Data Engineering Digest, September 2023

Sep 30, 2023

Hello Data Engineers,

Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.

📢 Community Announcements 📢

This month we released v1 of our community’s open source salary tool. Thank you to Jake T., Saurabh J., and Nicholas P., and Jon P. for your contributions!
The NYC Data Engineers meetup is looking for co-organizers! You’ll get a chance to work with Jospeh Machado from Start Data Engineering. If you’re interested in getting involved please sign up here.
DataTune 2024 call for speakers: a data community conference in Nashville in March, 2024. Please submit talks about analytics, data science, data engineering, and database management here.

Here are a few things that happened this month in the community:

ETL platform choices and culture.
Educational backgrounds of non-techies turned data engineers.
How bad is a DE that doesn’t use Python?
Are data contracts really this complicated?
Advice on a data lake with DuckDB.
Whitepapers that every data engineer should read!

Community Discussions

Here are the top posts you may have missed:

1. What ETL platform do you use and what are the pros and cons?

One data engineer working on a business intelligence team asked the community for ETL platform recommendations. They currently work for a healthcare company with 8,000 employees.

While no one can give a recommendation based on the sparse details above, you can see some patterns from the experiences other data engineers shared. As an important note, there is no one size fits all “best” platform. Deciding on the “best” ETL platform depends on the use case, budget constraints, personnel, etc.

💡 Key Insight

If character is destiny then a company’s culture determines their data stack.

In industries that are risk averse like banking and government, vendor solutions tend to be preferred. You pay not just for a managed platform but there is also an outsized value placed on the support that comes with it. Changes are made carefully and slowly because there is a larger risk of something happening to the valuable data these organizations protect. Finally, while these tools tend to be a one stop shop, they are rarely customizable. In other words, they can probably do most things pretty well but that last 10% of work may be impossible.

Some tools mentioned in this area were:

Informatica
SSIS
IBM Data Stage

On the other hand, companies that have a strong engineering culture tend to prefer open-source solutions and build more things in-house to fit their needs. These organizations tend to place a greater value on flexibility and the ability to iterate quickly. As a caveat, open source does not always mean “more advanced”. While typically cheaper, it’s crucial to consider the time that staff require to learn the tool and get operational. Besides the hourly rate of staff, time is one of the few resources you can never recover. Open source tools tend to have less robust documentation and fewer avenues to ask (and find the answer to) questions. If using a managed tool saves you time and money, losing the small bit of customization that an open source tool provides may be worth it.

Tools mentioned by these types of orgs were:

Open source tools: Meltano, dbt, Airbyte
Python for a custom platform

2. What is your background education?

A recent chemical engineering grad who had no prior knowledge before entering data engineering was curious about how other folks who didn’t have a tech background got their start in data engineering.

Tip: check out our resources on transitioning into data engineering

Most folks probably know by now that data engineering can be a difficult field to enter due to the lack of junior opportunities and degrees that teach skills specific to data engineering. Most of the folks with a non-tech background enter the field from adjacent roles like data analyst or business intelligence engineer.

💡 Key Insight

There is a wide variety of educational backgrounds that members listed: Chemistry, Music, Biology, Film Editing, Finance, and even no college degree.

It may not be easy to get your break, but the wide variety of backgrounds is a testament to how open companies can be as long as you can prove you can do the job. In other industries, it’s much less forgiving to not have the requisite education. If you’re interested in transitioning to data engineering, we’ve put together a getting started guide.

3. Ok to not really use any python as a DE?

“I work in a DE role but find myself doing about 99% of my job in SQL. I’m building out dim tables and scripting the stored procedures needed for the ETL. I rarely use any Python and when I do it’s to throw my stored procedure into a really simple DAG and schedule it.”

We hear about Python a lot in data engineering and for most engineers it’s considered a core skill that’s required. So if you’re rarely using it you may be worried about falling behind.

💡 Key Insight

The reality is that SQL will be the better tool for solving most of the transformations you need to do in data engineering. Remember: you want to bring the compute to the data vs bringing the data to the compute.

The advice from the community can be summed up as: only use Python (or another general purpose language) if you can’t do a certain transformation in SQL or it’s much more readable in Python. Machine learning was quoted as an example of when to prefer a programming language but there are some companies working on bringing this back into the database (🎥 watch Less is More with PostgresML).

Some data engineers mentioned that while you may not need a programming language day-to-day you will almost definitely still be interviewed on it and if all you use is SQL you will probably get bored of that pretty quickly.

4. Data contracts is a buzzword

“This concept has been made way too complicated than it needs to be. I think we need some level setting here.”

This conversation got a lot of buzz likely because of all the talk about data contracts on social media in recent years. For those who are unfamiliar, a data contract refers to a coded document that defines and enforces the schema as well as expectations of a data source which can be used by data consumers to prevent breaking changes. You can think of it like an SLA.

If you’re looking for an example, see datacontract.com which also references Paypal’s open-sourced data contract template released earlier this year.

💡 Key Insight

While schema management (aka the original data contracts) may not be a new concept, it’s worth understanding how to do because it can save you a ton of time and future headaches. Schemas can change frequently due to external producers you have no control over, manually generated data, or just product teams being unaware the data is being consumed.

A common example is a data team consuming CDC data from a database and the product team being unaware. Then, because the product team doesn’t realize the data is being used, they make changes which impact downstream analytics. The value add for data contracts here is not just preventing breaking changes but also making data producers aware of what data is being used downstream.

While you can implement organizational policies to communicate these changes manually, data contracts can be a version controlled way of communicating asynchronously.

Related news: this month Gable unveiled its new platform for data contracts.

5. Views on using duckdb + S3 as a datalake/datalakehouse/datawarehouse

Proposed architecture: raw data (S3) -> transform data (DuckDB) -> store data (PostgreSQL) -> Visualize data (dashboards)

The engineer that proposed this wonders if this is a good idea at all and if it would be scalable.

Since DuckDB is an in-process analytical database (think sqlite but for analytics) it unlocks new possibilities and it’s natural for data engineers to wonder about all of the places it can be used.

💡 Key Insight

It’s probably not the best solution for a large scale data lake largely because DuckDB is not built for distributed processing (aka it’s limited to a single machine) but if you’re working with small data volumes it should be fine. A more common solution for data lake compute would be a query engine like Presto/Trino or Spark. Still, if you’re interested in playing around with this idea check out this tutorial on building a “poor man’s” data lake with DuckDB.

6. What whitepapers should every data engineer read?

One senior data engineer asked the community for fundamental papers that would be useful for any data engineer to read. They asked for white papers but it turns out they were looking for scientific research papers related to DE instead. Still, there was a great discussion and recommendations.

The difference between commercial white papers and scientific papers are:

White papers are written by a company’s staff vs scientific papers are written by academic researchers and are peer reviewed
White papers have the goal of persuading the reader to make a specific decision vs scientific papers have the goal of contributing to the body of scientific knowledge

White papers can be extremely well researched and relevant to you but it’s important to understand the distinction between the two.

💡 Key Insight

Here are the most popular suggestions (not all are papers):