Data Engineering Digest, May 2023

May 31, 2023

Hello Data Engineers,

Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.

Here are a few things that happened this month in the community:

Microsoft announces Fabric - “Data analytics for the era of AI”.
Trino seems great, why isn’t it more popular?
Using a hybrid Snowflake/Databricks architecture + real user experiences.
Is there a better alternative to CDC for ELT?
Data Engineers share lessons learned the hard way.
Are lakehouses replacing data warehouses or are warehouses here to stay?

Community Discussions

Here are the top posts you may have missed:

1. Microsoft announces Fabric data platform

It seems almost daily that the next “data analytics platform for all” is announced, but the vast majority either fade to obscurity or are relegated to niche use cases. Microsoft in particular is infamous for releasing products that could use some more user testing before launch, which led the community to share their thoughts and see what everyone else thinks. Here is a trailer from the announcement:

Microsoft Fabric is an Azure-hosted managed service that combines several existing Microsoft products like Data Factory, Synapse, and Power BI + Generative AI and markets itself as an “end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need.”

“Microsoft Fabric” on azure.microsoft.com

💡 Key Insight

The popularity of ChatGPT has been unprecedented, leading to what seems like the majority of tech companies to either pursue their own entrance into the market or to reshape their branding around it. Members comment that Fabric clearly aims to capitalize on this trend and create further vendor lock-in. There is also a general frustration that, because Microsoft heavily relies on enterprise sales targeting execs, data engineers will have to spend a lot of time convincing their leadership that Fabric likely won’t be a solution to all of their problems.

“Can’t wait to yet again explain to non technical management why we wont use these products.”

The general sentiment from the community is relatively harsh, commenting that Microsoft is trying too hard to capture Databricks into its own ecosystem. Still, it will be interesting to see how effectively modern data engineering tools and infrastructure platforms will be able to utilize AI. We aren’t particularly compelled as of yet, but the landscape is changing in AI even faster than it is for data engineering, so it is always worth it to keep your eyes peeled despite the sheer quantity of products being introduced in the market.

2. Experiences with trino? What am I missing

Often members seek advice on how to better use tools, wondering why they’re not a fan of one that’s often lauded in the community. This month, one member asked the opposite, stating that they love Trino but can’t understand why there’s not more hype around it. For those unfamiliar with Trino, it’s a distributed SQL query engine that is commonly used to access data from a data lake or query from several different data sources at the same time (query federation).

💡 Key Insight

Several members blame the marketing behind the product, stating that they love the product itself. Since Trino was formerly named Presto and other products like Athena are managed versions of Trino, members say that it can be confusing understanding how the product is positioning itself (read: the differences between presto and trino). Others bring up that the product faces competitors from all the major Cloud providers (AWS Redshift Spectrum, BigQuery External tables, etc.)

Trino requires hosting on external infrastructure, for most organizations a cloud provider, and since these providers have a managed offering as previously stated, typically it’s easier to forego Trino in favor of the managed offering.

Starburst.io is a relatively new company that offers a managed version of Trino that allows for the cloud-agnostic hosting of Trino. In terms of quantitative reasons on why to use Trino versus other offerings, some data engineers state Trino is significantly more performant for its cost than its competitors for querying and ETL workloads but you might need additional staff to support the infrastructure for it.

3. Anyone using Snowflake + Databricks?

Both Snowflake and Databricks are popular tools that are discussed often in the community separately but have you ever thought about combining them? If so, you’re not alone. While there is an increasing overlap in how both platforms are being used, each continues to excel in their unique niches so why not get the best of both worlds?

The author of this post brainstormed using Snowflake as the source of truth and using Databricks for ML on top of Snowflake. Alternatively, you could create a medallion style architecture with the bronze and silver being done in a Databricks lakehouse and then having just the gold layer in Snowflake “Kimball-style” for analytics.

“Sure, I'm familiar with the fact that Databricks, with its Lakehouse architecture, is capable of supporting both data warehousing and machine learning workloads. However, if an organization heavily relies on both of these workloads, I question whether Databricks performs as effectively in the realm of data warehousing compared to Snowflake. Instead of solely relying on one platform, why not leverage both platforms to get the benefits from each and achieve the best possible outcomes?”

💡 Key Insight

Several members shared that they are using both Snowflake and Databricks, with a popular pattern being: building the bronze and silver layers in Databricks and then creating the gold layer in Snowflake.

Some of the controversial comments were around how expensive this architecture seemed which is unsurprising because both of these tools are often considered pricy by the community. Some suggestions for reducing the cost were using spot instances in Databricks for workloads that can handle it and getting a discount by committing to a longer-term contract. Proponents of the architecture said that while it is expensive, it is also efficient, and as long as you’re keeping data moving in one direction then it can be a great model to cater to the needs of data scientists, machine learning engineers, traditional business users, and data analysts.

Other points mentioned were that Snowflake is easier to maintain but Databricks gives you more customization.

Interested in trying this out yourself? Dive into the comments for more specifics on how people are actually implementing this hybrid architecture in real-life.

4. Why would you ever not use CDC for ELT?

Data engineers commonly use change data capture (CDC) to replicate inserts, updates and deletes from a source database to a data warehouse or data lake. From there, you can use the popular ELT pattern. However, it’s not always possible to use CDC, so what do data engineers do in those instances where they can’t use CDC? Or is there an even better alternative to CDC altogether?

“As the title says... I can't think of a reason why CDC would ever not be the "gold standard" for any ELT data integration processes?”

💡 Key Insight

While the community agreed that CDC is the “gold standard” if you can use it - most of the time the business simply doesn’t need it which allows data engineers to choose a simpler/easier to maintain method. CDC requires more effort to set up and maintain and there are caveats to be aware of like a flood of changes outpacing the database transaction log, forcing you to do a full refresh.

Members mentioned a few alternatives to CDC:

Full refresh (aka full load): The easiest to maintain and set up. The sentiment is you should always start here and move on to other methods as needed/required.
Delta load: Add a modified_at column or use a primary key id to retrieve only the newest/modified records since the last time the pipeline ran. This option is an in-between a full load and CDC.
Insert + overwrite last N days: This approach is useful if there isn’t a modified column or id you can reliably use but you know that changes can be made to the data within a specified timeframe.

The downside to these approaches are typically increased load on the source database and time required to query and extract the changes versus using CDC. The other downside to these other methods is that there is no way to capture deleted records efficiently.

5. What have you learned the hard way?

One data engineer created a great thread detailing a few life lessons that they learned and asked other members for lessons that they’ve learned the hard way.

Their post resonates because let's face it - haven't we all learned valuable lessons the hard way?

💡 Key Insight

Some of the main lessons that members talk about are not even technically related - some of the biggest life lessons that we’ve learned in our jobs are to make sure to communicate with stakeholders early and understand not only what they’re looking for, but also understanding the data and why it matters. The most well-designed ETL pipeline connected to an elegantly constructed data warehouse and reporting solution is useless if stakeholders don’t use it.

Other engineers bemoan the expensiveness of the cloud and encourage focusing more on documentation.

The last piece of advice which we can probably all agree with, is that data always comes second to the product. While it might be true that many products these days are tied to and live or die by the data, data is useless unless there are users that care about it. While it might be annoying to have to cater to the business side of an application and ship changes that might seem half-baked or useless, having a basic understanding of the business side of a company can go a long way for even the most “heads-down” data engineers.

6. Checking in: lake houses don't seem to be replacing data warehouses

One community member believes that the tooling and features of the lakehouse architecture might be comparable to those of data warehouses and would replace data warehouses but that doesn’t seem to be happening.

“I can't seem to find much that would suggest a lake house comes anywhere close to the performance and or features, complex SQL syntax, of a data warehouse. Am I missing something?”

It’s worth noting that the author seems to be referring to modern cloud data warehouses like Snowflake, Redshift, and BigQuery.

💡 Key Insight

The community believes lakehouses are not yet replacing data warehouses for a few reasons:

Lakehouses really shine if you are working with hundreds of Petabtyes+ of data or you are working with streaming data and a variety of both structured and unstructured data. However, most companies don’t fit into this requirement and a data warehouse easily works for them, even up to Petabytes of structured data.
Lakehouses and warehouses can handle the same tasks in some cases, but warehouses are more efficient for structured data at smaller scales.
Migrating data architecture is a lengthy and costly process.
Modern MPP databases have a similar architecture to lakehouses under the hood. With these tools, you trade control for ease of use and less management overhead.

Ultimately, it seems again that there is no one clear winner and the choice depends on your organization's needs.

Join the conversation

🎁 Bonus:

📅 Upcoming Events

Here are some events happening next month:

6/1: r/dataengineering Quarterly Salary Discussion
6/26-29: Snowflake Summit
6/26-29: Data + AI Summit by Databricks

Share an event with the community here or view the full calendar

What did you think of today’s newsletter?

Your feedback helps us deliver the best newsletter possible.

If you are reading this and are not subscribed, subscribe here.

Want more Data Engineering? Join our community.

We’ll see you next month!

Data Engineering Community

Discussion about this post