Hello Data Engineers,
Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.
Here are a few things that happened this month in the community:
Rethinking Data Governance: Challenges and Opportunities
Managing Schema Evolution Effectively
Navigating Advanced SQL Challenges
Google’s $32B Acquisition of Wiz: What It Means for Cloud Security
The Rise of Synthetic Data: Trends and Insights
Modern Data Center Challenges
Community Discussions
Here are the top posts you may have missed:
1. What's your honest take on Data Governance?
Data governance is a term used to describe the framework and procedures used to maintain high data quality and security throughout its lifecycle (input, storage, transformation, access, and deletion). Too much governance can slow down data operations and too little can lead to security risks and bad data.
💡 Key Insight
The benefits of data governance are clear: increased data trust, improved security and fewer data silos that hinder efficiency. However, opinions on implementation feasibility vary: while some dismiss it as a “pipe dream”, others argue that with targeted adjustments, it can be successfully implemented in any organization.
The common barriers to data governance we found were:
Lack of organizational buy-in: one of the biggest hurdles is ensuring that both teams and leadership take ownership and prioritize data governance initiatives.
Lower priority: Governance often isn't a focus until data issues arise.
Time and resource investment: Proper governance takes time, and if it’s overly complex or slows things down, teams might bypass the process entirely.
Tips to make data governance more successful:
Keep it simple - especially at smaller organizations. Start small and prioritize your most used data assets.
Deliver value incrementally so that stakeholders can start to realize the value and earn buy-in to bigger investments.
Keep open lines of communication with data consumers. Data engineers need to be talking to teams to understand what they need and to build trust.
For a deeper dive into data governance strategies, check out these community-recommendations:
2. How do you handle data schema evolution in your company?
Schema evolution refers to the process of modifying the schema of a database in a way that preserves the integrity of existing data and maintains backward compatibility with previous versions. It’s one of the core data engineering challenges and learning how to manage schema evolution can save you headaches from broken pipelines and data loss.
💡 Key Insight
Typically the best way to manage schema evolution is to be part of the discussion of changes. This may mean being involved in upstream PRs or creating data contracts to facilitate discussion.
Other less ideal approaches include automatically merging in new schemas or validating data before processing (and stopping if it fails validation). These reactive approaches can lead to other issues like unexpected behavior and pipeline disruptions.
As far as technologies that can help with this process, you can use a schema registry. A schema registry is a platform that stores your changing schema information and has helper functions to serialize and deserialize it for producers and consumers. You can also “roll your own” like one data engineer shared where they stored their schemas in JSON files, versioned them in git and then used Python to serialize/deserialize them.
3. What’s the most difficult SQL code you had to write?
Just like this member, you may be trying to understand the level of SQL skills you need for a typical data engineering role and what is considered “advanced.” The basics of SQL can generally be learned quickly but there are a few areas that commonly trip up newcomers.
💡 Key Insight
In terms of writing SQL, most data engineers felt that recursive CTE queries are the most difficult with window functions a close second. That being said, most data engineers felt that writing SQL wasn’t generally difficult but rather the reading, understanding and debugging of complicated SQL is a more common struggle.
Some tips for improving SQL readability and debugging:
Break up SQL statements into CTEs for improved readability.
LLMs can be a useful tool for analyzing and explaining long SQL queries.
Industry Pulse
Here are some industry topics and trends from this month:
1. Google announces agreement to acquire Wiz
Google announced a $32 billion dollar acquisition of Wiz, a leading cloud security platform which will soon join Google Cloud. This acquisition represents an investment by Google to accelerate two large and growing trends in the AI era: improved cloud security and the ability to use multiple clouds (multicloud). Wiz’s products are expected to continue to work and be available across all major clouds, including Amazon Web Services, Microsoft Azure, and Oracle Cloud platforms.
2. The Synthetic Data Generation Trend
Another story from Google this month was a partnership with Gretel, a synthetic data generation company. Synthetic data generation is a trend expected to grow in order to accelerate AI training. Earlier this year Nvidia and Open AI also discussed moving towards synthetic data factories which may be related to reports that AI companies are running out of real world data to train powerful AI models.
3. Data Center Challenges
Over the past year, discussions in big tech have focused on expanding data center infrastructure to meet the growing demand for AI. However, recent challenges are reshaping the conversation. This month, reports indicate that data centers are facing new challenges - including pressure to become more water efficient, securing additional power capacity, and Microsoft allegedly abandoning plans to create new data centers in the US and Europe. Google and Meta quickly scooped up some of the freed capacity that Microsoft abandoned. Meanwhile, China is also experiencing weakened demand for data centers leaving many facilities unused.
🎁 Bonus:
📅 Upcoming Events
4/22-24: Data Council 2025
4/25: PGDay Chicago 2025
Share an event with the community here or view the full calendar
Opportunities to get involved:
Want to get involved in the community? Sign up here.
What did you think of today’s newsletter?
Your feedback helps us deliver the best newsletter possible.
If you are reading this and are not subscribed, subscribe here.
Want more Data Engineering? Join our community.
Want to contribute? Learn how you can get involved.
Stay tuned next month for more updates, and thanks for being a part of the Data Engineering community.