Hello Data Engineers,
Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.
Here are a few things that happened this month in the community:
Banking + Open Source: crazy or doable?
PDF processing challenges and solutions
Data Engineers weigh in on the Musk/Social Security debacle
Insights from Airflow’s Annual Survey
Impressive Netflix Impressions ETL
How Cloudflare makes sense of 700 Million events/second
Community Discussions
Here are the top posts you may have missed:
1. Banking + Open Source ETL: Am I Crazy or Is This Doable?
One community member recently started a new job at a bank and is dealing with data pipeline performance issues. Their current stack is SSIS and SSAS, part of the Microsoft SQL Server stack, which was popular a few decades ago (and still widely used in Enterprise). They are considering moving to open source tools to improve performance and avoid lock-in but aren’t sure if it’s realistic due to challenges related to the sensitive nature of the data.
Banking is a highly regulated industry and this discussion highlights some of the unique challenges you may face if you work in this industry.
💡 Key Insight
The community’s first reaction? Slow down. While fresh perspectives are valuable, proposing a data pipeline migration without deeply understanding the existing system is risky. A slow pipeline doesn’t always mean the entire stack needs replacing - bottlenecks might stem from inefficient queries, bad indexing, or poor scheduling.
Even if an open-source alternative could improve performance, migrations in banking aren’t just about speed - they’re about compliance, risk, and cost. A major infrastructure change like this requires a clear, quantifiable benefit to outweigh the disruption. Beyond security and cost, another factor is that vendor support is often necessary in industries like this. So you may be able to use open source technology if you can clear the other hurdles but you may not be able to do so without a vendor to manage it and provide support.
2. Fastest way to process 1 TB worth of pdf data
I have an S3 bucket worth 1 TB of PDF data. I need to extract text from them and do some pre-processing. What is the fastest way to do this?
PDFs are a common source of unstructured data and they have a reputation for being difficult to parse because they can contain text, images and embedded data. That being said, there are a variety of tools to help extract information from PDFs in 2025 and parsing unstructured data sources is a trend that is expected to continue growing.
💡 Key Insight
PDFs come in three main formats: text-based, image-based, and hybrid (containing both text and images).
Text-based PDFs are the easiest to parse, as the text can be selected and extracted directly.
Image-based PDFs present a greater challenge, as they require Optical Character Recognition (OCR) to convert the image of the text into machine-readable text. This process can be error-prone, especially with poor image quality, unusual fonts, or complex layouts.
Hybrid PDFs contain both text and images, requiring a combination of direct text extraction and OCR.
Most of the proprietary solutions for parsing PDFs are built for scale out of the box but if you’re using an open source library you can pair it with a distributed processing engine like Apache Spark for large workloads.
Open source libraries:
Proprietary solutions:
Many LLMs can extract information from documents as well
3. Is the social security debacle as simple as the doge kids not understanding what COBOL is?
As a skeptic of everything, regardless of political affiliation, I want to know more. I have no experience in this field and figured I’d go to the source.
This past month, Elon Musk shared findings from an audit of US Social Security data where he said that a “cursory examination of Social Security” data conducted by the Department of Government Efficiency identified “people in there that are 150 years old.” This data point was then misinterpreted by the President who misleadingly claimed “millions and millions of people over 100 years old” may be receiving improper benefits from social security. The claim naturally gained lots of attention and one popular explanation for this anomaly was that this was due to a quirk in how COBOL (a decades-old programming language likely used) stores dates. Is the answer that simple?
The data engineering community weighs in.
💡 Key Insight
Most data engineers in the community believe that the issue is less about a flaw in COBOL and more about a data literacy challenge. In legacy systems - such as the Social Security system, which has been in place since 1935 - the data is inherently complex. New hires often face a steep learning curve to understand all the nuances and processes before they can draw accurate conclusions. And it’s not a challenge specific to the government either - onboarding at any company takes weeks if not months.
An investigation by factcheck.org revealed that the likely culprit was historical data maintenance: the SSA historically had inadequate controls to update the database with new deaths, and the benefits of fixing the historical data was not worth the costs. This is a classic problem any data person can relate to: data sources relying on human input are messy. The investigation highlighted that this is not a new problem and the SSA created measures to address it in 2015. Additionally, there are several other countermeasures the US federal government already possesses to prevent fraud.
The audit findings aligned with what many experienced engineers expected and the incident underscores the importance of rigorous data literacy. It serves as a reminder that before drawing conclusions, it’s crucial to understand the context, history, and limitations of the data at hand.
Read the full investigation here
Industry Pulse
Here are some industry topics and trends from this month:
1. Airflow Survey 2024
Airflow just released their annual survey with over 5,000 responses from all over the world. It contains interesting information about the types of workloads running on Airflow, the types of companies running it, and also some of its biggest challenges.
2. Introducing Impressions at Netflix
Every image you hover over isn’t just a visual placeholder; it’s a critical data point that fuels Netflix’s sophisticated personalization engine. These images or “impressions” play a pivotal role in transforming your interaction from simple browsing into an immersive binge-watching experience, all tailored to your unique tastes. Netflix’s new impressions-based personalization system is a reminder that subtle UX interactions can have massive data implications. In this multi-part blog series, Netflix engineering takes you behind the scenes of their system that processes billions of impressions daily.
3. Over 700 million events/second: How we make sense of too much data
As of December 2024, Cloudflare’s data pipeline is ingesting up to 706M events per second generated by various services, and that represents 100x growth since their 2018 data pipeline blog post… All of these data streams power things like Logs, Analytics, and billing, as well as other products, such as training machine learning models for bot detection. This blog post is focused on techniques Cloudflare uses to efficiently and accurately deal with the high volume of data they ingest for their Analytics products.
🎁 Bonus:
📅 Upcoming Events
3/1-3/7: Open Data Day 2025
Share an event with the community here or view the full calendar
Opportunities to get involved:
Want to get involved in the community? Sign up here.
What did you think of today’s newsletter?
Your feedback helps us deliver the best newsletter possible.
If you are reading this and are not subscribed, subscribe here.
Want more Data Engineering? Join our community.
Want to contribute? Learn how you can get involved.
Stay tuned next month for more updates, and thanks for being a part of the Data Engineering community.