Data Engineering Digest, October 2023
Hello Data Engineers,
Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.
Here are a few things that happened this month in the community:
Is an OLAP database required for data warehousing?
Examples of “serious” companies using code-based visualization tools.
Is Python our fate?
Things DEs do not like about being a data engineer.
To use or not to use Pandas in production, that is the question.
What actually is a Data Mart and why you should build one.
Community Discussions
Here are the top posts you may have missed:
1. Do I really need an OLAP database for a data warehouse?
OLAP databases provide superior performance vs OLTP for analytical workloads which is why they are commonly used in data warehousing. But the question is, are they absolutely necessary if you’re working with small scale data? What other potentially cheaper options are available?
Choosing the appropriate tool for the job is required of every data engineer therefore we always want to know what options are available and what their tradeoffs are. For every new database decision you’ll always need to consider cost, performance, and maintenance.
💡 Key Insight
While we sometimes use the term “OLAP database”, OLAP is really just an approach to answering multidimensional queries quickly. Therefore the OLAP approach can be applied to other types of data storage solutions. Some databases simply are configured to work better with an OLAP approach which is why they are called OLAP databases.
It’s perfectly fine to use OLTP databases for smaller analytical workloads and in fact they’ve been used this way for decades. Several databases also provide configurations that can make it more suitable for OLAP workloads like columnstore indexes or column-oriented backends.
So why use OLTP databases instead of OLAP then?
From a cost perspective, you can generally run OLTP databases on a server for a low fixed cost whereas most of the popular data warehousing solutions use consumption-based pricing. Predictable pricing frees you from having to constantly worry about optimization as long as it fits your latency requirements. It’s worth noting that data warehousing solutions aren’t always more expensive, but if configured or used poorly then you will pay for it so it requires some expertise to run efficiently.
The downside of the OLTP options are typically more configuration (db settings, networking, updates) and if your data grows quickly then you will usually have to scale the solution yourself.
Popular OLTP databases used for data warehousing are SQL Server and PostgreSQL.
2. Examples of “serious” companies using anything other than Power BI/Tableau?
It seems like most companies use no-code visualization tools like Power BI or Tableau but one member wonders what scale of companies use pro-code visualization tools like Shiny, Dash, or D3.js.
We’re not 100% sure what the intent of the question is but one guess might be that some data engineers who also have to make visualizations would prefer to do it in code vs using a no-code tool. Regardless, it sparked a great discussion!
💡 Key Insight
One reason no-code visualization tools are so popular is because most of the business people that are consuming and creating their own analytics are not technical. They choose no-code because it allows them to quickly analyze data with a low learning curve. And once a tool is widely adopted it is very difficult to switch and have to re-create all of those reports/dashboards let alone re-training folks on a new tool.
Another force at play is enterprise giants like Microsoft designing their products with this mindset and making it easy to export data and integrate into commonly used tools like Excel and PowerPoint. For Microsoft’s Power BI, it looks like this strategy is working.
3. Is Python our fate?
“Python is a nice dynamic language. I have nothing against it. I see people adding type hints, static checkers like MyPy, etc... We're turning Python into Typescript basically. And why not? That's one way to go to achieve better type safety. But ...can we do ourselves a favor and use a proper statically typed language? 😂”
Python is the most widely used language in data engineering but why?
💡 Key Insight
There are several reasons why most data engineering is done in Python and it mainly boils down to the performance/cost tradeoff.
From a business perspective, most businesses need to deliver value fast which for DE translates into quickly creating code that can do the job. Simultaneously, most businesses don’t need super high performance/real-time data which would benefit from a statically typed language. There are also hiring/maintenance considerations.
Data engineers like Python because writing in Python is easier, faster, and easy to read/understand compared to other languages.
DE Tip: If you do need a statically typed language some popular ones for data engineering are Java and Scala.
4. What do you *not* like about being a data engineer?
“So many people hype it up because of the salary, but what's the reality of being a data engineer?”
You see a lot of social media touting data engineering as an easy high salary job but there are challenges just like any other profession.
💡 Key Insight
The most commonly mentioned challenges data engineers faced were:
Pressure to deliver quickly and creating tech debt in the process.
Getting requirements from stakeholders for visualizations is a very iterative and time-consuming process. Oftentimes, stakeholders don’t know exactly what they want.
A focus on solely moving data and not getting to see the downstream value created by it (Analytics, ML, DS.)
Data sources are messy and you usually don’t have any control over them.
Project-based work and no time to continue improving or iterating on older projects.
It’s a role that doesn’t usually get recognition and only gets attention when there’s an issue.
Being on call and having to fix issues off hours and under high-pressure situations.
5. Should a data engineer use Pandas in production code?
Pandas is a popular Python library for analyzing and manipulating data. Data engineers often use Pandas for exploratory data analysis (EDA) and for simple data transformations.
Some data engineers get criticism for using Pandas in data pipelines which leads them to ask why or when they shouldn’t use it.
💡 Key Insight
It’s always about using the right tool for the job. While Pandas is widely used, it’s not meant for working with large datasets. If you’re building a data pipeline you need to also consider scalability and ideally you won’t need to re-engineer your pipeline as the data size increases.
On the flip side, a distributed processing engine like Spark might be more scalable but if your data is small then it might actually be slower because there is overhead to run it.
Pandas tip: always pin your pandas version
A rule of thumb is if your data can fit into memory then use a tool like Pandas, otherwise use something like Spark.
6. What the f#%! *actually is* a Data Mart and why should I build one?
“I don't mean the definition of "Data Mart " that we all know. Subset of data... aimed at specific business teams... pre-aggregated... etc...
What I don't understand, and can't seem to find actual examples online that draw from realistic datasets, is what it is supposed to look like in practice.”
💡 Key Insight
Here we typically share an insight but one member put it so succinctly we decided to share their answer (lightly edited and formatted for readability). All credits to u/PangeanPrawn and thank you for being part of the community.
“There are traditionally 3(ish) layers to a data pipeline:
1. Raw (data lake)
a. Intermediate stage
2. Warehouse (sometimes called an 'edw' for enterprise data warehouse)
3. Data mart
Raw: The raw data is just that. Tables that are essentially copies of files from outside sources. Sometimes the data lake is actually in blob storage and then replicated into tables in a database.
EDW: The EDW is the clean version of the data. it is still separated by source generally, though there might be some models here that are joins of the source data depending on how your business is set up. This is usually a star-schema.
Data mart: These are transformations of the models in your warehouse that are specific to a downstream system connection. Could be a 1 to 1 with a report. The idea is you want to be doing as little transformation in reporting tools as possible, because SQL is just better at it. Business users should be able to query a specific model in the data mart as the backend for a report, ML-model etc.”
🎁 Bonus:
📅 Upcoming Events
Share an event with the community here or view the full calendar
Opportunities to get involved:
Want to get involved? Sign up here.
What did you think of today’s newsletter?
Your feedback helps us deliver the best newsletter possible.
If you are reading this and are not subscribed, subscribe here.
Want more Data Engineering? Join our community.
Stay tuned next month for more updates, and thanks for being a part of the Data Engineering community.