Data Engineering Digest, January 2024
Hello Data Engineers,
Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.
Here are a few things that happened this month in the community:
What’s next after the lakehouse architecture? The riverboat?
Discussion on using SQL for everything
What happened to Julia and why is Rust so popular?
Apache Iceberg in production
Advice for studying to transition into DE
The best course/certification/degree of all time!
Community Discussions
Here are the top posts you may have missed:
1. What is next after Lakehouse architecture?
“Is there something new in this space ready to replace lakehouses?”
The data space is perpetually evolving and data engineers wanting to keep up to date with what’s happening tend to look to the community for insights into emerging trends and technologies. Awareness of what might replace the lakehouse architecture enables data engineers to strategically plan so that they can adopt the technologies that align with evolving business needs. But new trends don’t always necessarily equal replacement. Oftentimes, we see a hybrid approach that integrates the strengths of emerging architectures with existing ones.
See related: Lakehouse Architectures - How does it look for you?
💡 Key Insight
Existing tools have been adding features that blur the lines between architectures. For example, back in 2022 Snowflake announced integrations with open table formats and thus can be used to implement a lakehouse while still looking like the familiar data warehouse in the front. All of the major players have been announcing similar features and encroaching on each other’s territories to compete. Data ingestion and workflow orchestration tools look like the next targets to be cannibalized.
There was also a large sentiment that most companies haven’t even moved to a lakehouse architecture yet. There may be a lot of activity going on about the latest and greatest architecture, but many data engineers still believe that the traditional data warehouse is the right choice for most companies and that these newest solutions are really only geared towards large and complex data-driven companies.
2. Discussion: Is SQL all you really need? A perspective from an older colleague.
One member expressed frustration about a coworker’s dogmatic insistence that SQL is the single source of truth and that if a job can be done with SQL, then it should be done with SQL.
"Ah, this is the problem with modern technologies. See, when I learnt SQL 15 years ago, it hasn't really changed much so I've never had to learn anything new. With modern technologies, it always changes and you could be using something completely different next year whereas I've only ever used SQL at all of my jobs."
Technology evolves perhaps too quickly, and many engineers make the mistake of focusing their efforts on chasing the latest technology instead of mastering proven skills. There’s always a balance - unless it’s your passion, perhaps pigeon-holing yourself into a single technology isn’t the best bet, but at the same time being too broad and not mastering any skills lowers your earning potential and impact tremendously.
💡 Key Insight
The best advice is to always use the right tool for the job which means understanding the strengths and weaknesses of each tool. While SQL is a foundational skill for data engineers and is typically the go to, that doesn’t mean you should never use anything else. Beyond SQL, modern data engineers are expected to know at least one general programming language and implement software engineering best practices and using techniques like version control and CI/CD.
A large piece of what to learn and use also revolves around what motivates you as an individual. Perhaps the missing key is what excites you and keeps you passionate about data engineering? For some, that is staying on the cutting edge and exploring different technologies. For others, it may be developing mastery of a single tool.
3. Why are we ignoring Julia for data engineering and going for rust?
You may have heard about the Rust language recently as it’s gained popularity in the DE community due to its performance benefits and use in popular tools like Polars, Apache Arrow DataFusion, and several bindings for other DE tools. Julia is another general purpose language that was gaining popularity a few years ago but hasn’t seen as much adoption in the DE community. One member asked, why is Rust so popular? Will there ever be one unified language we all use?
The current favorite language among data engineers is Python but there are some concerns about its performance and the packaging ecosystem. If you’re struggling with some of the downsides of Python you may be looking out for a more performant tool like Rust or Julia.
💡 Key Insight
Both languages started in the early 2000s but Rust was able to amass a larger community and users which helped it gain popularity. Rust’s focus domains are CLI tooling, WebAssembly, Networking, and Embedded use cases whereas Julia is more focused on solving computational science problems.
Aside from the Julia vs Rust question, some members pointed out that while base Python is slow it’s often used as a wrapper around a faster library written in lower level languages like C++ which allows you to get the best of both worlds. So while Julia or Rust may be faster than Python and solve some of those problems, the benefits would have to be large enough to offset the effort to learn a new language. And currently, if we added Python to the graphs above it eclipses the other languages so we had to remove it.
Let us know what you think in the comments.
4. Are you using Iceberg in production?
“I’m a big believer in the data lake, even during the Snowflake hype. But it’s always been difficult to set up and use. Iceberg (as well as Hudi and Delta) made data lakes easier with built-in support for upserts and optimizations.
…
Are you using Iceberg in production? Are you considering migrating your existing data lake to Iceberg?”
There has been a lot of enthusiasm around data lakehouse architectures recently with the popularity of open table file formats that allow you to combine the benefits of a traditional data warehouse with a data lake. Apache Iceberg is one of the popular formats that is talked about along with Delta Lake but because they are newer some data engineers may be hesitant to adopt it just yet. Data engineers tend to make the decision to adopt new tech not just based on technical features but also factors like community engagement and openness.
💡 Key Insight
The most popular tools data engineers were using with Iceberg in production were:
AWS Athena (Presto/Spark) - most popular
Trino (Formerly Presto)
AWS Glue (Spark)
AWS EMR (Spark)
The common theme with these options is that they are managed by a cloud provider and they have good integration with Iceberg. As more tools integrate with open table formats we expect to see more adoption but it may still be too early to tell.
5. People who transitioned to DE, how did you study?
“I want to move into Data Engineering. I've started learning on my own and I'm curious about how others have done it and would recommend doing it.”
We see variations of this question frequently and we are working on answering this in a community roadmap. It won’t be easy but with the community’s insight and ideas we can make something that will hopefully help answer a lot of these questions. If you’re interested, you can get involved here.
💡 Key Insight
Several data engineers shared their stories of working in a data driven role and finding opportunities at work to gain some of the skills they needed to eventually gain an opportunity in data engineering. While self-learning and projects are great, ideally you want to try and proactively find real-world opportunities at work. Identify things that could be automated and data that can be analyzed or optimized. If you have a data team, ask to shadow them sometimes and offer to help with simple tasks to start.
Of course, hard work is important but it also takes a little bit of luck. So keep yourself motivated however you can and use our community for inspiration and support. Many many people have successfully made this transition and shared their experiences so it’s a treasure trove of information. If you’re looking for project inspiration, check out our data engineering projects page where we’ve curated high quality projects submitted from the community. You can also use our project template on GitHub to start your own.
6. What's the best course, certification or degree of all time?
Another FAQ is about the best course/certification for data engineers. We’ve even put together a table of the best learning resources that are frequently recommended by data engineers.
Any community requires new folks who are excited to join to grow, and data engineering is no exception. The tech field in particular can be a bit snobby to new devs, and while many newcomers should work on their Googling skills, it’s important to be as supportive as possible.
💡 Key Insight
If you’re just starting out, Reis and Housley’s Fundamentals of Data Engineering is the best book to start with and we recommend it as well (no affiliation). The Data Warehouse Toolkit is another popular suggestion as data modeling is an important but often overlooked skill for new data engineers. In terms of certifications, AWS/Azure/GCP certs are useful for learning cloud tooling but they probably won't have much of an impact on your ability to get a job. If you’re looking for something more hands on, something like the Data Engineering Zoomcamp is a great free option and recommended by many community members.
🎁 Bonus:
📅 Upcoming Events
2/6: Chill Data Summit
2/28 - 3/1: PGConf India
Share an event with the community here or view the full calendar
Opportunities to get involved:
Share your story on how you got started in DE
Want to get involved in the community? Sign up here.
What did you think of today’s newsletter?
Your feedback helps us deliver the best newsletter possible.
If you are reading this and are not subscribed, subscribe here.
Want more Data Engineering? Join our community.
Want to contribute? Learn how you can get involved.
Stay tuned next month for more updates, and thanks for being a part of the Data Engineering community.