Data Engineering Digest, September 2023
Hello Data Engineers,
Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.
đ˘ Community Announcements đ˘
This month we released v1 of our communityâs open source salary tool. Thank you to Jake T., Saurabh J., and Nicholas P., and Jon P. for your contributions!
The NYC Data Engineers meetup is looking for co-organizers! Youâll get a chance to work with Jospeh Machado from Start Data Engineering. If youâre interested in getting involved please sign up here.
DataTune 2024 call for speakers: a data community conference in Nashville in March, 2024. Please submit talks about analytics, data science, data engineering, and database management here.
Here are a few things that happened this month in the community:
ETL platform choices and culture.
Educational backgrounds of non-techies turned data engineers.
How bad is a DE that doesnât use Python?
Are data contracts really this complicated?
Advice on a data lake with DuckDB.
Whitepapers that every data engineer should read!
Community Discussions
Here are the top posts you may have missed:
1. What ETL platform do you use and what are the pros and cons?
One data engineer working on a business intelligence team asked the community for ETL platform recommendations. They currently work for a healthcare company with 8,000 employees.
While no one can give a recommendation based on the sparse details above, you can see some patterns from the experiences other data engineers shared. As an important note, there is no one size fits all âbestâ platform. Deciding on the âbestâ ETL platform depends on the use case, budget constraints, personnel, etc.
đĄ Key Insight
If character is destiny then a companyâs culture determines their data stack.
In industries that are risk averse like banking and government, vendor solutions tend to be preferred. You pay not just for a managed platform but there is also an outsized value placed on the support that comes with it. Changes are made carefully and slowly because there is a larger risk of something happening to the valuable data these organizations protect. Finally, while these tools tend to be a one stop shop, they are rarely customizable. In other words, they can probably do most things pretty well but that last 10% of work may be impossible.
Some tools mentioned in this area were:
Informatica
SSIS
IBM Data Stage
On the other hand, companies that have a strong engineering culture tend to prefer open-source solutions and build more things in-house to fit their needs. These organizations tend to place a greater value on flexibility and the ability to iterate quickly. As a caveat, open source does not always mean âmore advancedâ. While typically cheaper, itâs crucial to consider the time that staff require to learn the tool and get operational. Besides the hourly rate of staff, time is one of the few resources you can never recover. Open source tools tend to have less robust documentation and fewer avenues to ask (and find the answer to) questions. If using a managed tool saves you time and money, losing the small bit of customization that an open source tool provides may be worth it.
Tools mentioned by these types of orgs were:
Open source tools: Meltano, dbt, Airbyte
Python for a custom platform
2. What is your background education?
A recent chemical engineering grad who had no prior knowledge before entering data engineering was curious about how other folks who didnât have a tech background got their start in data engineering.
Tip: check out our resources on transitioning into data engineering
Most folks probably know by now that data engineering can be a difficult field to enter due to the lack of junior opportunities and degrees that teach skills specific to data engineering. Most of the folks with a non-tech background enter the field from adjacent roles like data analyst or business intelligence engineer.
đĄ Key Insight
There is a wide variety of educational backgrounds that members listed: Chemistry, Music, Biology, Film Editing, Finance, and even no college degree.
It may not be easy to get your break, but the wide variety of backgrounds is a testament to how open companies can be as long as you can prove you can do the job. In other industries, itâs much less forgiving to not have the requisite education. If youâre interested in transitioning to data engineering, weâve put together a getting started guide.
3. Ok to not really use any python as a DE?
âI work in a DE role but find myself doing about 99% of my job in SQL. Iâm building out dim tables and scripting the stored procedures needed for the ETL. I rarely use any Python and when I do itâs to throw my stored procedure into a really simple DAG and schedule it.â
We hear about Python a lot in data engineering and for most engineers itâs considered a core skill thatâs required. So if youâre rarely using it you may be worried about falling behind.
đĄ Key Insight
The reality is that SQL will be the better tool for solving most of the transformations you need to do in data engineering. Remember: you want to bring the compute to the data vs bringing the data to the compute.
The advice from the community can be summed up as: only use Python (or another general purpose language) if you canât do a certain transformation in SQL or itâs much more readable in Python. Machine learning was quoted as an example of when to prefer a programming language but there are some companies working on bringing this back into the database (đĽ watch Less is More with PostgresML).
Some data engineers mentioned that while you may not need a programming language day-to-day you will almost definitely still be interviewed on it and if all you use is SQL you will probably get bored of that pretty quickly.
4. Data contracts is a buzzword
âThis concept has been made way too complicated than it needs to be. I think we need some level setting here.â
This conversation got a lot of buzz likely because of all the talk about data contracts on social media in recent years. For those who are unfamiliar, a data contract refers to a coded document that defines and enforces the schema as well as expectations of a data source which can be used by data consumers to prevent breaking changes. You can think of it like an SLA.
If youâre looking for an example, see datacontract.com which also references Paypalâs open-sourced data contract template released earlier this year.
đĄ Key Insight
While schema management (aka the original data contracts) may not be a new concept, itâs worth understanding how to do because it can save you a ton of time and future headaches. Schemas can change frequently due to external producers you have no control over, manually generated data, or just product teams being unaware the data is being consumed.
A common example is a data team consuming CDC data from a database and the product team being unaware. Then, because the product team doesnât realize the data is being used, they make changes which impact downstream analytics. The value add for data contracts here is not just preventing breaking changes but also making data producers aware of what data is being used downstream.
While you can implement organizational policies to communicate these changes manually, data contracts can be a version controlled way of communicating asynchronously.
Related news: this month Gable unveiled its new platform for data contracts.
5. Views on using duckdb + S3 as a datalake/datalakehouse/datawarehouse
Proposed architecture: raw data (S3) -> transform data (DuckDB) -> store data (PostgreSQL) -> Visualize data (dashboards)
The engineer that proposed this wonders if this is a good idea at all and if it would be scalable.
Since DuckDB is an in-process analytical database (think sqlite but for analytics) it unlocks new possibilities and itâs natural for data engineers to wonder about all of the places it can be used.
đĄ Key Insight
Itâs probably not the best solution for a large scale data lake largely because DuckDB is not built for distributed processing (aka itâs limited to a single machine) but if youâre working with small data volumes it should be fine. A more common solution for data lake compute would be a query engine like Presto/Trino or Spark. Still, if youâre interested in playing around with this idea check out this tutorial on building a âpoor manâsâ data lake with DuckDB.
6. What whitepapers should every data engineer read?
One senior data engineer asked the community for fundamental papers that would be useful for any data engineer to read. They asked for white papers but it turns out they were looking for scientific research papers related to DE instead. Still, there was a great discussion and recommendations.
The difference between commercial white papers and scientific papers are:
White papers are written by a companyâs staff vs scientific papers are written by academic researchers and are peer reviewed
White papers have the goal of persuading the reader to make a specific decision vs scientific papers have the goal of contributing to the body of scientific knowledge
White papers can be extremely well researched and relevant to you but itâs important to understand the distinction between the two.
đĄ Key Insight
Here are the most popular suggestions (not all are papers):
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
Data Mesh Principles and Logical Architecture (read our April edition: Is the Data Mesh paradigm really practical?)
A Relational Model of Data for Large Shared Data Banks by Edgar F. Codd (the inventor of the relational model)
And one of our own suggestions:
đ Bonus:
đť DE project: Spark, Delta Lake, Great Expectations & Docker w/best practices!
đď¸ Formula 1 streaming pipeline
đ
Upcoming Events
Share an event with the community here or view the full calendar
What did you think of todayâs newsletter?
Your feedback helps us deliver the best newsletter possible.
If you are reading this and are not subscribed, subscribe here.
Want more Data Engineering? Join our community.
Want to contribute? Learn how you can get involved.
Weâll see you next month!