Data Engineering Digest, July 2024

Jul 31, 2024

Hello Data Engineers,

Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.

Here are a few things that happened this month in the community:

Skepticism about Polars?
The top 3 file formats DEs love.
Scaling databases for big data.
What are some open-source Snowflake alternatives?
Inmon, Kimball & Data Vault, when to use which?
The biggest challenges data engineers face part 2.

Community Discussions

Here are the top posts you may have missed:

1. I'm skeptical about polars

Wondering if you should be using Polars for dataframe-based workflows? You’re not alone. This month, there were quite a few data engineers wondering where exactly Polars fits in a data engineer’s toolbelt.

For a long time Pandas has been the go-to for manipulating smaller datasets. It’s popular for exploratory data analysis (EDA) but often criticized for its poor performance and confusing syntax. In fact, many data engineers advise against using it at all in production pipelines.

Polars is a newer dataframe library that boasts better performance and the ability to handle larger datasets than Pandas. But are the benefits large enough to make the switch, and if so, when should you decide to use a tool like Spark instead?

💡 Key Insight

Several data engineers claim that they regularly see orders of magnitude speed improvement when using Polars over Pandas. The community also strongly believes that the syntax is overall much more intuitive. Readability is a huge and underrated pro for Polars. One drawback is that because Polars is still relatively new, you’ll find fewer answers and code examples on the web.

The next question is, when would you use Polars over something like Spark?

We like this succinct quote from the article “Polars vs Spark. Real Talk.” by Confessions of a Data Guy:

If you can replace Spark with Polars, you shouldn’t have been using Spark.
If you can replace working Polars pipelines with Spark, then you’re wasting time.

At the risk of over simplifying things, it boils down to can you process the data on a single machine or not and can you avoid the overhead (and maintenance) of running a Spark cluster? If so, check out Polars.

2. If you could only use 3 different file formats for the rest of your career. Which would you choose?

Most data engineers tend to work with a multitude of data sources. For better or for worse, file formats tend to be a substantial part of working with these data sources, and data engineers are quite opinionated on which file formats are superior. Join the conversation and share which 3 you would choose here.

💡 Key Insight

The top 3 file formats data engineers chose were:

CSV - comma separated values is a classic format that is designed for tabular data. It’s important to note that while csv files are probably the most popular data format, they’re not always the best choice. Nesting or semi-tabular data should use formats like JSON or XML instead.
JSON - widely used for storing and exchanging data in a human-readable format. It’s commonly used in web applications and APIs.
Parquet - popular format due to its efficient columnar storage structure, which allows for fast data processing and retrieval. It’s optimized for big data applications, making it ideal for large datasets.

The file format that got the most hate:

XML - while less common, still holds significance in certain industries and legacy systems. It’s a powerful “meta format” that can be used to represent complex data relationships and metadata, which can be valuable in specific use cases.

3. Why can’t databases scale to handle big data?

We have a 20TB dataset that we literally cannot load into Oracle... The engineers built a spark cluster with S3 storage but no one likes it because it doesn’t behave like a traditional database (inserts, updates, small queries still require a job to be launched). If they wanted a huge database why not just scale Oracle and Tableau?

This is a classic vertical vs horizontal scaling question which all data engineers should fundamentally understand.

💡 Key Insight

Databases CAN and do scale to handle big data.

Technically, any database will scale vertically to the point where you run into hardware and cost limitations. For example, you can buy a machine with a TB of RAM but it will be much more expensive and harder to find than several machines with a moderate amount of RAM.

*Source: https://www.techtarget.com/searchdatamanagement/definition/MPP-database-massively-parallel-processing-database*

This is how modern MPP databases work - they distribute data processing workloads across multiple machines allowing them to scale horizontally and cost effectively. A few examples are StarRocks, ClickHouse, Apache Druid and Snowflake.

4. What if there is a good open-source alternative to Snowflake?

One community member posted a survey asking the community if they would use an open-source to Snowflake and what their views are of the pros and cons of Snowflake. The community started a great discussion about what alternatives exist today (there are several).

💡 Key Insight

While no single alternative is going to match all of Snowflake’s features, there are several great open source options that you can run yourself such as: Apache Druid, Apache Pinot, ClickHouse, Apache Doris and StarRocks. Another interesting project that was mentioned is called Databend which specifically calls itself a modern alternative to Snowflake. Several other members also mentioned using distributed query engines like Trino on top of a data lake as an alternative.

The biggest value of using managed tools like Snowflake is that it’s managed for you and you don’t have to piece different systems together yourself. Keep in mind that open source can have a higher total cost of ownership (TCO) when factoring in deployment and maintenance time. When considering resources, human time should not be overlooked.

5. Inmon, Kimball & Data Vault

Which data modeling technique is the most widely used and why?

Inmon, Dimensional Modeling (Kimball) and Data Vault are the 3 that we hear about the most but there are several data modeling techniques that can be useful in different situations. We touched on this topic briefly here in May.

💡 Key Insight

All 3 of the techniques listed above are popular in different parts of the world and have their own strengths and weaknesses in terms of ease of use and startup costs. It’s clear that there is no one clear winner because as we data engineers often say: “it depends…”

That being said, the most common approach is typically a hybrid involving different techniques where they make sense. We recommend that data engineers not get bogged down by trying to strictly adhere to one technique. One theme we’ve observed is that engineers may start with something like OBT which has a lower startup cost while they build out the rest of the warehouse over time using another technique with a higher startup cost like data vault or dimensional modeling.

The biggest takeaway here is this: the data landscape is changing rapidly and it’s no longer good enough to know just one data modeling technique. You need to be able to adapt quickly and use concepts from multiple techniques to deliver business value in both the short and long term.

Joe Reis

Mixed Model Arts - The Convergence of Data Modeling Disciplines

It’s no secret I’m very much into mixed martial arts (MMA) and combat sports. If I hadn’t gone into data 20+ years ago, there’s a strong likelihood I’d have ended up doing MMA full-time instead. I s…

a year ago · 16 likes · 8 comments · Joe Reis

6. What Are the Biggest Challenges in Data Engineering Today?

I've been working in the data industry for over 20 years, and I've noticed that many companies lack robust data governance and quality processes. They seem to prioritize building pipelines, acquiring and processing data, and delivering it to the business, without focusing on governance and quality… What other significant challenges do you think we are facing today?

💡 Key Insight

We discussed this topic last month but wanted to add a few more interesting challenges discussed this month as well.

Another common challenge for data teams in general is that they are often treated by the business as a cost center. Therefore it’s unsurprising that data is often not treated as a priority and data quality, security, and time to value suffers. The AI trend may change things here but it’s too early to tell.

But perhaps the biggest challenge? It’s just gotten too boring.

“These days frameworks just do all the work for you (and that's a good thing).”

As tooling gets better and removes more and more of the technical challenges, more of our focus can go towards the other duties of data engineering like data security, data management, data ops and data architecture design.

📖 Related reading: Fundamentals of Data Engineering

Join the conversation

🎁 Bonus: