Data Engineering Digest, August 2023

Aug 31, 2023

Hello Data Engineers,

Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.

Here are a few things that happened this month in the community:

IC to Data Engineering Manager Pitfalls and Advice
Underrated Open Source Tools
dbt-core Production Setups
Fun Industries to Work in As a DE
SQL for Data Engineers in Real Life
Quality Podcasts and YouTube Channels

Community Discussions

Here are the top posts you may have missed:

1. Moving on to a managerial position

One long-term senior data engineer who was offered a management role asked the community for general advice and pitfalls.

Management requires very different skills than engineering - many new ICs turned managers feel the urge to do everything themselves. Having the ability to delegate effectively, give space to engineers to succeed (and even fail), and resist the urge to micromanage is crucial.

💡 Key Insight

Most users point towards using common sense, taking ownership in helping other people succeed, and giving trust (even when not deserved). It’s also important for leadership to be aware of potential pitfalls with productivity metrics and overzealous planning since these practices can negatively impact a team's morale.

Data engineering managers should instead empower their teams by choosing owners, providing clear guidance, and allowing them to make decisions without relying on permission from upper management. A final word of advice for future managers is to genuinely have an interest in people and understand that some team members may need additional help feeling autonomous.

2. Underrated Open Source Data Tools

We often hear about the popular open-source tools used in data engineering but one member was looking for underrated tools for an upcoming talk they are preparing.

These kinds of discussions are usually insightful and we found at least a few new tools that are great but not as popular.

💡 Key Insight

Here are some recommendations from members:

Debezium: change data capture (CDC) tool with support for several popular sources/targets.
Clickhouse: open-source OLAP DBMS for real-time analytics.
DataDiff: Compare tables within or across databases.
OpenMetadata: all-in-one platform for data discovery, data lineage, data quality, observability, governance, and team collaboration.
Apache NiFi: easy to use, powerful, and reliable system to process and distribute data.

3. How does your team use dbt-core without dbt-cloud?

dbt Core: Integrates with Cloudera's Open Data Lakehouse

Have you ever wondered how other teams have set up dbt-core to be production-ready?

You may already know about dbt cloud and how it takes care of the CI/CD and infrastructure so teams can focus on building their project but for teams who can't use dbt cloud, there doesn't seem to be a standard for hosting it.

💡 Key Insight

For scheduling/orchestrating dbt-core the most popular tools were:

Airflow
Dagster (dbt integration)
GitHub Actions

Almost all of the data engineers who commented containerized their dbt projects with Docker and passed in environment variables to change the configuration. We love to see it.

From there, jobs are typically submitted to serverless compute services such as AWS Elastic Kubernetes Service (Fargate)/GCP Google Kubernetes Engine, or AWS Batch (Fargate).

A CI/CD pipeline is set up to generate dbt docs whenever new dbt models are deployed to production. These docs are then served via an S3 static site or GitHub pages.

While there is no right or wrong way to run your dbt-core project, the community seems to have arrived at a similar set of patterns (orchestrator + serverless + static site).

4. “Fun” domains to work as DE?

One community member shared how they are disappointed with their experience working as a data engineer in the marketing industry. Even though the tools they get to work with are nice, they are frustrated with how their hard work often doesn't get used to make decisions and how the overall goal of getting users to click more ads is unfulfilling.

A question for data engineers: What are some other "boring" industries? Join the discussion.

💡 Key Insight

The most popular answer was the medical device industry (think J&J, Stryker, Medtronic, etc). Medical devices produce a diverse set of data with a plethora of interesting use cases.

Along the same lines, another member chimed in with companies that make data products in general. “Data products are essentially an amalgamation of a variety of use cases that can be useful for customers across different business domains.” An example of this would be a fitness-tracking company that could serve both users with their daily stats or serve that data to bioinformatic companies for research purposes. Another pro for product-based companies is that they aim to maximize the utilization of their data, allowing you to explore various aspects of the same data.

A few other suggestions were: Non-profits, EdTech, Sports Betting/Analytics, and our favorite: an international cookie company.

One controversial take: the finance industry. While some feel like it’s a fun industry and pays well, others say it’s highly dependent on the company and project.

5. What kind of SQL are you guys writing and how complex is it?

Are you currently learning SQL and wondering what it looks like in the real world? Maybe you already write SQL and you wonder if other data engineers use SQL differently than you do.

It’s always helpful to know what tools and techniques are being used in practice vs. what you learn via classes or books. Even if you use it regularly you never know what you might learn from someone more experienced.

💡 Key Insight

The best advice given around this subject is to keep your SQL simple and readable. Simpler scripts that are well-organized are easier to understand (for future you and others), easier to debug, and more maintainable by other engineers. Readability is an underutilized metric with which to judge scripts - while performance is extremely important (and in some sectors by far the most important metric), having readable code allows for greater code development velocity in a collaborative environment and fewer bugs.

To make your SQL more readable, engineers recommend using CTEs to break up logic into more bite-sized chunks. Other than that, DEs primarily use window functions, MERGE statements, views, UDFs and occasionally stored procedures (but stored procs are a bit controversial).

6. Does anyone actually watch Data Engineering podcasts/youtubers?

One data engineer discusses the scarcity of quality content on data engineering topics available via podcasts and YouTube. While there is plenty of software engineering and front-end content, it can be challenging to find well-produced, vendor-independent data engineering content that goes beyond the basics.

For data engineers, it can be difficult to stay informed of the latest tools and trends in the field. And while there is plenty of content out there, it’s oftentimes created by people who have a vested interest in the product/service being discussed vs. a third party who can give a more unbiased opinion.

💡 Key Insight

Most of the sentiment from the community agreed that they wish there was more advanced content. One member suggested that the lack of advanced quality content is because data engineering is demanding and creating content after work is tiring. Another reason could be that data engineers are often bound by an NDA and therefore legally can’t publically discuss their day-to-day work.

Despite the sentiment, there were a few recommendations for folks to check out.

Thu Vu: for Python and SQL
ByteByteGo: for system design and general SWE
Large Data Bank: for database internals and programming
System Design Fight Club: for more in-depth system design
Unapologetically Technical: for technical deep dives into different tools
nullQueries: for general data engineering and data architecture
The Joe Reis Show: for general data engineering

Data Engineering Community

Discussion about this post

Data Engineering Community

Data Engineering Digest, August 2023

Community Discussions

1. Moving on to a managerial position

2. Underrated Open Source Data Tools

3. How does your team use dbt-core without dbt-cloud?

4. “Fun” domains to work as DE?

5. What kind of SQL are you guys writing and how complex is it?

6. Does anyone actually watch Data Engineering podcasts/youtubers?

📅 Upcoming Events

What did you think of today’s newsletter?

Discussion about this post