What is Data Ingestion?
What is Data Ingestion?
Data Ingestion is the critical, foundational first phase of the massive data engineering lifecycle. It is the specific architectural process of extracting massive volumes of raw, chaotic data from highly disparate external systems (such as operational databases, third-party SaaS APIs, and live web server logs) and securely transporting that data into a centralized storage environment, typically the raw Bronze layer of an Open Data Lakehouse or Cloud Data Warehouse.
Without highly robust, bulletproof data ingestion architecture, the entire downstream data ecosystem collapses. An incredibly complex machine learning model or a beautiful executive dashboard is mathematically useless if the ingestion pipeline drops 5% of the transactional data or suffers a massive latency delay. The primary challenge of Data Ingestion is managing the intense physical friction of interacting with external, fragile source systems across the public internet.
The Two Paradigms of Ingestion
Data engineering teams architect ingestion pipelines utilizing two distinct strategies, dictated entirely by the latency requirements of the business.
1. Batch Ingestion
Batch Ingestion is the historical standard and remains the absolute backbone of enterprise data. In a batch architecture, the data pipeline wakes up at a specific, scheduled interval (e.g., every night at midnight, or every hour). It opens a massive connection to the source system, extracts millions of historical records in one massive sweep, and writes them to the Lakehouse.
Batch ingestion is highly efficient, mathematically easy to validate, and consumes very little sustained compute. However, it guarantees data staleness. If the CEO looks at the dashboard at 4:00 PM, they are viewing data that is 16 hours old.
2. Streaming (Real-Time) Ingestion
Streaming Ingestion completely eliminates latency. Instead of running on a schedule, the pipeline establishes a continuous, persistent, always-open connection to the source system (often utilizing tools like Apache Kafka or AWS Kinesis). The exact millisecond a customer clicks a button on the website, that individual event is captured, pushed through the streaming pipeline, and landed in the Data Lakehouse instantaneously.
While streaming provides absolute real-time visibility, it is exponentially more complex to engineer, requiring highly advanced distributed state management and massive continuous compute resources to ensure data is not duplicated during network outages.
The Challenges of the Extraction Layer
The physical act of pulling data out of a source system is highly dangerous.
- API Rate Limiting: If an engineer writes a script to extract customer data from the Salesforce API, Salesforce aggressively monitors that connection. If the script requests too much data too quickly, Salesforce triggers a strict Rate Limit, violently crashing the ingestion pipeline. Engineers must build complex “Backoff and Retry” logic into the ingestion script to handle this throttling elegantly.
- Database Contention: If an engineer runs a massive
SELECT *query against the live PostgreSQL database powering the main website to extract the daily sales, that query consumes massive CPU resources. It can cause the live website to freeze, directly impacting actual customers. To solve this, advanced ingestion utilizes Change Data Capture (CDC), reading the invisible, low-level transaction logs of the database rather than querying the tables directly, ensuring zero impact on live operations.
Summary of Technical Value
Data Ingestion is the heavy-lifting logistical layer of the modern data stack. By securely navigating API rate limits, managing massive network throughput, and utilizing a hybrid architecture of deep historical batch loads and lightning-fast real-time streams, robust data ingestion pipelines provide the absolute critical guarantee that the central Data Lakehouse is constantly fueled with the complete, accurate reality of the business.
Learn More
To learn more about the Data Lakehouse, read the book “Lakehouse for Everyone” by Alex Merced. You can find this and other books by Alex Merced at books.alexmerced.com.