Data Engineering Digest, June 2024

Jun 30, 2024

Hello Data Engineers,

Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.

Here are a few things that happened this month in the community:

The biggest pains of being a data engineer.
Our data environment is bad…or is it?
Rare skills for data engineers.
Hot takes on the latest trends.
What’s the deal with all this ice…berg?
Self-serve BI: myth or real?

Community Discussions

Here are the top posts you may have missed:

1. What are the biggest pains you have as a data engineer?

From tooling annoyances to just wanting to be able to take a bit more holiday, what are your biggest bugbears atm?
I'll go first - people (execs) **not getting** data and the power it has to automate stuff.

One data engineer asked the community to personally vent about the biggest pains of their career as a data engineer. The thread unsurprisingly got a ton of attention.

💡 Key Insight

Several of the frustrations were related to working with non-technical stakeholders. A few examples were stakeholders believing the data is wrong because it doesn’t align with their preconceived expectations and non-technical stakeholders driving timelines and technical decisions.

Similarly, other users highlighted pervasive issues around bureaucracy and an aversion to change getting in the way of solid data engineering practices. For example, not wanting to change a process because “it’s been done this way for 30 years” or conversely, automating things that shouldn’t be done in the first place.

One final frustration was with other data engineers who chase the latest tech but ignore fundamental data engineering skills like modeling and architecture practices. The newest shiny tools won’t fix all of your problems unfortunately.

2. How Bad Is the Data Environment where you work?

I have bridging tables that don't bridge. I have tables with no keys. I have tables with incomprehensible soup of abbreviations as names. I have columns with the same business name in different databases that have different values and both are incorrect…Is it this bad everywhere or is it better where you work?

It’s unfortunately not uncommon to hear stories like this. Cutting corners and a lack of best practices or governance lead to “data debt” that needs to be paid off later, just like in SWE. Often though, it never gets paid off and the bill comes due… Even when best practices are implemented, DEs often have no control over source data and its quality.

💡 Key Insight

Here are just a few of the many stories data engineers shared about their environment:

The main schema in our production OLTP DB is named "johntest".

We had a SAS cluster setup that was called THE LAMER ENVIRONMENT. For a fortune 100 company. LAMER was the admin dude's last name.

I am part of an organization where tables are named table1, table2, up to tableN, and the PMs and BAs are calling it the same name during standup meetings.

Years of buying the latest and greatest tools that become obsolete, buying a better version that then goes obsolete is a cycle many companies go through…migrating from legacy systems to new is its own industry at this point.

It’s comforting to know that you’re not alone in dealing with these challenges and there are some learnings we can take away from the community.

Whatever you build is generally not going to change. Don’t fall into the trap of thinking “we can fix it later” because addressing tech debt is rarely prioritized. Along these lines:

Write documentation.
Pick a naming convention and stick with it (but don’t name things after yourself).
Give yourself more time than you think you need for new projects.
Enforce boundaries in a way that makes sense - don’t be combative but be strategic in the way that you display progress and timelines.

3. What are some rarely explored niches in data engineering?

The question is essentially “what are some data engineering skills or practices that are overlooked?”

There are other questions packed into this question. Is there a skill you’re missing that’s holding you back in your career? Are there skills that may be niche but more valuable? What should I be focusing my time on next?

Let’s see what the community had to say.

💡 Key Insight

Data engineers largely recommended focusing on the non-technical aspects of a data engineer’s job - things like social skills/networking, focusing on delivering business value, and stakeholder relationship building were three highlighted skills that most engineers should improve on.

On a more technical level, folks recommended learning proper software engineering practices and embracing the full lifecycle of software, including the less glorified parts like testing, CI/CD, and monitoring/logging.

Finally, many cautioned not to get caught up in buzzwords and trends. While finding a niche and developing skills can be helpful, ignoring personal development in solid data engineering fundamentals is a mistake.

4. What are everyones hot takes with some of the current data trends?

It’s finally Summer, and while most people are staying indoors to escape the heat, the drama is cooking! One member encouraged people to post their hot takes about the current data trends.

💡 Key Insight

The most popular hot takes all revolved around companies wanting to jump on the AI hype train despite not having the people, infrastructure, or high quality data to make it possible. Data engineers know this problem well because they are the ones closest to the data and are going to be crucial to unlocking the value that AI promises.

Other notable hot takes included:

Don’t pre-optimize and actually spend time looking at the data and performance instead.
Workflow orchestrators are bloatware.
DE focuses on how to get a job done instead of thinking about if the job should get done.
If you read Data Warehouse Toolkit, Designing Data Intensive Applications and actually reference them as you work/need you are ahead of 99.99% engineers.
Iceberg is not necessary for most companies.

5. Iceberg? What’s the big deal?

Speaking of trends and that thing you probably don’t need - data engineers seem to be buzzing about the apache iceberg table format these days, and for good reason.

Iceberg is designed to solve some important problems when working with big data and gives data engineers additional flexibility and control while designing a data lake. There are other table formats competing with Iceberg such as Delta Lake and Apache Hudi as well.

💡 Key Insight

Architecture comparison of Apache Iceberg, Apache Hudi & Delta Lake: startdataengineering.com

A table format, sometimes called open table format, combines a file format (like Parquet) and an additional metadata layer which unlock new capabilities on the data in your data lake. For example, Iceberg enables you to track schema evolution, partition evolution, travel back in time to a previous table state, create table branches/tag table state (similar to git), and handle multiple reads & writes concurrently. These are extremely useful features to have that traditional data lakes were missing. It also helps improve data governance on data lakes which has historically been challenging.

A table format like Iceberg combined with cheap object storage like S3 and the flexibility to use any compute engine you want gives data engineers possibly the most amount of choice with virtually no lock-in. So why was it a hot take that most companies don’t need it?

Most companies just don’t have enough data to warrant using it (let alone a data lake) and it takes more time and expertise to unlock the benefits of these table formats compared to the modern database offerings available.

🤓 Recommended reading: What is an Open Table Format?

6. The Self-serve BI Myth

That story usually starts with an engineer or data scientist who's frustrated because they spend too much time writing queries and preparing dashboards for business people. They think that if they make BI easy enough, everyone will be able to “self-serve”, but that rarely ever happens.

Self-serve business intelligence (internal analytics) is the goal of every organization but very few seem to be able to achieve it. One member posted their thoughts in a blog post that generated a great discussion.

💡 Key Insight

While self-serve BI has been around for decades, it still seems to be challenging for many companies. The actual technical piece of enabling self-serve by building an easy to use semantic layer is well documented in the book the data warehouse toolkit which data engineers still recommend to this day.

The community believes that the real challenge that’s left isn’t the tooling but it’s improving data literacy through training and seeking out power users who will champion data within their own teams. Our advice would be to find a select number of data champions within a variety of departments that you have prior relationships with, show value and quick wins to build trust, and make sure to constantly work on user friendliness and accessibility.

Join the conversation

🎁 Bonus:

📅 Upcoming Events

No events submitted for next month. Submit an event using the link below!

Share an event with the community here or view the full calendar

Opportunities to get involved:

Volunteer to speak at/organize a Data Engineering meetup
Share your story on how you got started in DE
Help improve the DE Salary app
Contribute to the wiki

Want to get involved in the community? Sign up here.

What did you think of today’s newsletter?

Your feedback helps us deliver the best newsletter possible.

If you are reading this and are not subscribed, subscribe here.

Want more Data Engineering? Join our community.

Want to contribute? Learn how you can get involved.

Stay tuned next month for more updates, and thanks for being a part of the Data Engineering community.

Data Engineering Community

Discussion about this post