Data Engineering Digest, June 2023

Jun 30, 2023

Hello Data Engineers,

Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.

📢 Community Announcement 📢

We are opening up a professional community for data engineers who want to network, discuss advanced topics, and attend in-person events. It’s free to join but there is currently a waitlist so don’t wait to sign up. For more information about this new community and why we are starting it, read our announcement.

Here are a few things that happened this month in the community:

Recommendations for ADF replacement: Azure to GCP
Beyond Python and SQL: Exploring other worthwhile languages for Data Engineers
Biggest challenges in building data pipelines
What is your favorite data catalog?
Does anyone else hate Pandas?
How do data engineers to SQL day-to-day?

Community Discussions

Here are the top posts you may have missed:

1. Orchestration

One data engineer is asked to migrate their platform from Azure to GCP and while parts of their platform are cloud agnostic, they will have to move off of their orchestration tool (Azure Data Factory) and they are looking for replacements.

“Currently our environment is: ADF for orchestration, databricks for transformations, AZDL Gen 2 storage, and synapse for reporting (which is slowly going away as we transition to Databricks)...What tools are you using that can replace ADF?”

This is a prime example of vendor lock-in and when using a third-party tool for your data pipelines would be useful. Check out our communities list of popular workflow orchestration tools.

💡 Key Insight

Airflow is still considered the most popular workflow orchestrator in the industry and is still widely used but Dagster was another popular recommendation here.

One data engineer pointed out that GCP has a managed version of Airflow (Cloud Composer) so you don’t have to maintain the infrastructure yourself. AWS also has a managed Airflow offering (MWAA).

A few engineers cautioned against using Databricks workflows exclusively because it’s not as flexible for non-databricks pipelines.

“I would caution people against becoming overly reliant on databricks workflows as they won't really handle all the non-databricks scheduling needs you will eventually have. Which means you'll eventually be supporting 2 orchestrators or have to do a migration.”

2. In DE, is there a language that is actually worth learning besides Python and SQL?

“Python and SQL are must-know-to-survive languages. Besides that, everything else's value seems dubious at best.
The next three on the list of importance would be Scala, Java, and R. But like I said, the value of learning any of those three is very questionable.
C/C++ could be considered for writing Python extensions, and Go and Rust for their future potential.
What do you think about this whole situation?”

Many data engineers may struggle to understand where they should be spending their time learning because job descriptions may have more than one language as a requirement. Typically, Python is used in the vast majority of jobs but as performance requirements increase data engineers may start to see some other languages used.

💡 Key Insight

The general sentiment from the community is that unless you know you’ll be working with a specific language beyond Python & SQL, you probably don’t need to invest time into it.

Here are some examples where it would be useful to learn a second language:

R: Bioinformatics, Pharma, academia, and many areas in the science industry use R.
Scala: Very rarely used but the jobs that do use it are high paying. Useful if you will be working with Spark. Some argue that even if you don’t use it - you’ll learn concepts that will make you a better engineer.
Java: Similar to Scala.

Instead, data engineers recommend investing time into learning an infrastructure as code (IaC) language like Terraform or Pulumi as well as Bash for working on servers.

3. Discussion: What are your biggest problems when building data pipelines?

“I was wondering what are the three biggest problems you encounter on a day-to-day basis.
For example, is it extracting unstructured data, merging data streams, meeting throughput or latency requirements, keeping upstream and downstream schemas in sync, managing a large number of components in the pipeline, etc. or something else that gives you headaches? Curious to hear!”

This was a market research question but it generated some interesting discussions from the community.

💡 Key Insight

TL;DR the technical problems are usually the easy part, it’s the business and people problems that are harder for engineers to solve.

The most common daily problems data engineers shared were:

Data Quality: not being able to control the quality of source data and incorrect assumptions about data over time as people transition from the org.
Naming conventions: folks from different parts of the org wanting different names for the same object or just changing their mind.
Changing business requirements: gotta be agile!

4. What is your favorite data catalog?

One member asked what kind of data catalog everyone else is using and was searching for reviews.

“What data catalog is your company currently using? What do you like about it and hate about it? What tool would you replace your current data catalog with if you could make the decision?”

A data catalog uses metadata and data management tools to organize all of the data assets within your organization. You can think of it like the computer at your local library that helps you find a specific book but for your data instead.

Ideally, a data catalog should be able to grant the right access to the right data to the right people at the right time.

💡 Key Insight

Many comments were jokes about how it’s funny you would assume that they have a data catalog at all. Data catalogs seem to be one of those tools that everyone agrees are useful but few organizations will invest in.

“Data catalog? Well hello Mr Fancypants.”

A few of the recommended data catalogs from engineers who have used them were:

OpenMetadata
Alation
Data Hub
Amundsen
And last but not least, Excel stored in Sharepoint 🙃

5. Does anyone else hate Pandas?

Pandas seems almost ubiquitous for Python users, which makes one user question why it’s so popular.

“It’s almost like the developers of Pandas said “Hey. You know how everyone knows SQL? Let’s make a program that uses completely different syntax. I’m sure users will love it”

Pandas can be finicky with relatively simple commands and queries and it’s relatively slow for data processing. By switching to another library built for larger datasets, users can potentially save valuable computing resources and time and reduce headaches from debugging tricky Pandas manipulations.

💡 Key Insight

Other users agree, with the general caveat being that Pandas is more popular with Data Analysts and Data Scientists. Many users sang praises about Polars and PySpark as an alternative.

Polars was recommended for vertical scaling (more powerful computing instance), but for truly big data, users recommended using PySpark and scaling horizontally (multiple instances in a cluster).

6. 🤔❓How do you use SQL in daily tasks as a Data Engineer?

A relatively new data engineer asks how crucial SQL is for their upcoming data engineering position. Based on feedback from another thread, they purchased and read through the book T-SQL: Fundamentals (highly recommended by the community as well!). However, it wasn’t clear how concepts like CTEs, complex joins, and window functions would be useful in data engineering.

In many organizations, data engineers will not only create pipelines but also do ad-hoc analytics and build visualizations as well while in others they might only be focused on the infrastructure and pipelines. So, if it really varies by job what level of SQL is worth learning as a data engineer?

💡 Key Insight

Here’s how our members say they use SQL in their day-to-day:

Stored procedures for code portability.
Window functions for common aggregations such as: ranking, ordering, moving averages, cumulative averages, and partitioned aggregations. Window functions can also improve performance by reducing the number of joins needed for these aggregations. Here is an advanced window functions course recommended by DEs.
CTEs for readability and code reusability. Prefer CTEs over subqueries whenever possible.

Another member pointed out that even if you don’t use these parts of SQL for your pipelines you very likely might use them for analyzing a dataset yourself to gain a better understanding of it.

Join the conversation

🎁 Bonus:

📑 Database Cheatsheet for AWS, Azure and GCP
🙅‍♂️ We protested Reddit’s API changes
📚 Must-read books and articles from the past year
💩 Another 🔥 Excel meme

📅 Upcoming Events

7/20-21: DataConnect Conference

Share an event with the community here or view the full calendar

What did you think of today’s newsletter?

Your feedback helps us deliver the best newsletter possible.

If you are reading this and are not subscribed, subscribe here.

Want more Data Engineering? Join our community.

We’ll see you next month!

Data Engineering Community

Discussion about this post