Hello Data Engineers,
Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.
Here are a few things that happened this month in the community:
Data Engineers’ Dream Architectures
Are no-code tools dying out?
Why can’t databases store images, audio, and video?
Sneak Peek: How Zapier runs Zaps at scale
Optimizing LinkedIn’s search pipeline with Spark
Scaling analytics and reporting at Cloudflare
Community Discussions
Here are the top posts you may have missed:
1. What would be your dream architecture?
We rarely get to choose our tech stack from scratch - constraints like switching costs, legacy systems, and internal politics usually make that impossible. But in a world where you could choose whatever tools you want for a data architecture, from ingestion to visualization, what would you choose?
💡 Key Insight
Here’s what the community would choose if starting from a clean slate:
Airflow
Dagster
ETL Logic
Python
dbt
Storage
Snowflake
Postgres
Containerization
Kubernetes
Docker
Podman
2. Did no code/low code tools lose favor or were they never in style?
I feel like I never hear about Talend or Informatica now. Or Alteryx. Who’s the biggest player in this market anyway? I thought the concept was cool when I heard about it years ago. What happened?
No-code and low-code ETL tools have been around for decades and were popularized in the early 2000s. They promised increased accessibility and speed by enabling less technical users to move and transform data without waiting on engineering.
💡 Key Insight
Since the early 2000s, data engineers have shifted toward modular, open-source, code-first platforms for greater flexibility and control. Still, many businesses rely on low-code tools because they shorten time to value. Data is not valuable until the point it’s ready to be analyzed. This creates pressure to decentralize simple workflows. There are also many companies that can’t afford / need dedicated data engineering teams.
Some of the no-code/low-code tools data engineers use today are:
Azure Data Flow (ADF)
Alteryx
Matillion
Informatica
KNIME
3. Why don’t databases store images, audio and video?
…From an AI perspective, embeddings are the source to run any AI workloads. Currently the process is to generate these embeddings outside of the database and insert them.
With AI adoption going large isn’t it beneficial to have databases generating embeddings on the fly for this kind of data?
💡 Key Insight
Traditional databases aren’t optimized for storing high-dimensional blob data like images, audio, and video. Blob storage solutions like Amazon S3, Azure Blob Storage, and GCP Cloud Storage were developed to store this kind of unstructured data at scale and make it easily accessible. A common pattern is to store the unstructured data in blob storage and then have a reference to it in the database to get the best of both worlds.
Since the author specifically mentioned embedding for AI use-cases it’s worth mentioning that there exist vector databases and extensions that allow storing embeddings in a database but you still need to generate embeddings externally - typically via an embedding API - because the process is computationally intensive, require additional dependencies, hardware, and is a very different workload from serving/searching the embeddings.
Industry Pulse
Here are some industry topics and trends from this month:
1. How Zapier runs isolated tasks on AWS Lambda and upgrades functions at scale
Zapier is a leading no-code automation provider whose customers use their solution to automate workflows and move data across over 8,000 applications such as Slack, Salesforce, Asana, and Dropbox. In this post, you’ll learn how Zapier has built their serverless architecture focusing on three key aspects: using Lambda functions to build isolated Zaps, operating over a hundred thousand Lambda functions through Zapier’s control plane infrastructure, and enhancing security posture while reducing maintenance efforts by introducing automated function upgrades and cleanup workflows into their platform architecture.
2. Optimizing LinkedIn Sales Navigator’s search pipeline with Spark
At LinkedIn, the engineering team shifted separated search use cases to a centralized hosted search service, known as search-as-a-service. One of the largest use cases is the search system for LinkedIn Sales Navigator, which leverages AI to identify prospects, enhance engagement, and elevate conversations. During this migration, the team also transitioned the data manipulation (DM) pipeline for search index building from MapReduce to Spark and significantly tuned and improved the Spark jobs’ efficiency. In this blog post, the LinkedIn Engineering team shared insights gained from optimizing this intricate Spark-based data manipulation pipeline—which comprises over 100 DM jobs—and how they successfully reduced the total execution time from 6 - 7 hours to around three hours.
3. How TimescaleDB helped us scale analytics and reporting
At Cloudflare, PostgreSQL and ClickHouse are the standard databases for transactional and analytical workloads… ClickHouse is a more recent addition to their stack. The Cloudflare engineering team started using it around 2017, and it has enabled them to ingest tens of millions of rows per second while supporting millisecond-level query performance. ClickHouse is a remarkable technology, but like all systems, it involves trade-offs. In this post, the Cloudflare engineering team will explain why they chose TimescaleDB — a Postgres extension — over ClickHouse to build the analytics and reporting capabilities in the Zero Trust product suite.
🎁 Bonus:
🍁 What I Learned From Processing All of Statistics Canada's Tables
🤔 Are data modeling and understanding the business all that is left for data engineers in 5-10 years?
📅 Upcoming Events
Share an event with the community here or view the full calendar
Opportunities to get involved:
Want to get involved in the community? Sign up here.
What did you think of today’s newsletter?
Your feedback helps us deliver the best newsletter possible.
If you are reading this and are not subscribed, subscribe here.
Want more Data Engineering? Join our community.
Want to contribute? Learn how you can get involved.
Stay tuned next month for more updates, and thanks for being a part of the Data Engineering community.