Data Architectures Decoded | Part 2: Data Processing

Welcome back to Data Architectures Decoded. In Part 1, we explored the world of data ingestion, how data gets into our systems, whether trickling in like cars on a highway (streaming) or arriving in bulk like containers loading on ships at a shipyard (batch). We talked about why it matters and highlighted some standout tools to get the job done. Now that we’ve got the data through the door, what’s next? That’s where Part 2 comes in: Data Processing. This is where the raw data starts turning into something truly valuable.

Imagine data as raw ingredients in a kitchen. Ingestion is gathering those ingredients (flour, eggs, spices, etc.) but they’re not a meal yet. Data processing is the cooking: chopping, mixing, and seasoning until you’ve got a dish ready to serve.

Whether you’re whipping up a feast in one go or plating dishes fresh off a sushi conveyor belt, processing is what transforms the raw into the remarkable. Let’s dive into what that means, why it’s crucial, and how it works in the wild.

‍

What is Data Processing?

In the realm of data architectures, data processing is the art of manipulating and transforming data to make it useful. It’s the step where we clean up messy inputs, enrich them with context, aggregate them into meaningful summaries, or reshape them for analysis.

Why does this matter? For data engineers, managers, and executives, processed data is the fuel for smarter decisions, faster insights, and competitive edges. Without it, you’re stuck with a pile of raw numbers and logs that don’t tell you much. Processing turns that pile into dashboards, recommendations, or fraud alerts: what moves the needle for your business.

‍

Types of Data Processing: Batch vs. Stream

Data processing comes in two flavors: batch processing and stream processing. Each has its own strengths, ideal use cases, and best tools for the job.

Batch Processing: Cooking in Bulk

Batch processing is like prepping a week’s worth of meals in one go. You gather a large amount of data, process it all at once, and store the results for later use. It’s a method designed for efficiency with big datasets, typically run on a schedule (like daily or weekly), rather than delivering instant results.

Here’s how it breaks down:

Extract: Collect data from various sources, such as databases, APIs, or flat files. For example, you might pull sales records from an e-commerce platform or logs from a web server. (See Part 1 for more on data ingestion)
Transform: Process the data by cleaning it (e.g., removing duplicates), performing calculations (e.g., aggregating monthly revenue), or combining datasets (e.g., joining customer profiles with purchase histories to analyze buying trends).
Load: Store the processed data in a data warehouse, like Snowflake or Google BigQuery, where it’s ready for reporting or analysis when needed.

This sequence follows the ETL framework: Extract, Transform, Load which is a standard approach for preparing data in bulk.

‍

Figure 1. Batch processing data like meal prepping for a week.

‍

Use Case: Data Warehousing

Batch processing powers data warehousing, where historical data is collected and stored for long-term analysis. For instance, a retailer might process six months of sales data to uncover patterns, such as which products peak during holiday seasons. It’s best suited for scenarios where you need comprehensive insights over time, not real-time updates.

‍

Tools to Know

Apache Spark: A high-performance tool that processes data in memory, making it significantly faster than traditional disk-based systems. It excels at handling large-scale tasks, like analyzing terabytes of server logs to detect performance issues or running complex SQL queries on historical customer data.
Use Spark when you need quick results from big datasets but don’t require immediate, real-time processing.
Hadoop MapReduce: A distributed processing framework that breaks down massive datasets into smaller chunks and processes them across multiple machines. It’s slower than Spark due to its reliance on disk operations, but it’s highly reliable for huge, stable workloads, like processing years of archived financial transactions overnight. It’s a go-to for scenarios where throughput matters more than speed.

‍

When to Use It

Batch processing shines in these situations:

Non-Urgent Tasks: When results can wait, such as generating quarterly financial summaries or preparing annual compliance reports.
Large-Scale Data: It’s ideal for crunching massive volumes, like training machine learning models on years of user behavior data.
Scheduled Workflows: Perfect for consistent, predictable processing, like updating a data warehouse with daily sales figures.

In essence, batch processing delivers a reliable stash of prepared data insights, ready whenever you need them; just don’t expect it to handle on-the-fly requests.

Stream Processing: Real-Time Data Handling

Stream processing is like a sushi bar’s conveyor belt: data arrives continuously, and you process it immediately to serve fresh insights. Unlike batch processing, which handles data in scheduled bulk operations, stream processing tackles it as it flows, enabling real-time or near-real-time analysis.
It’s built for speed and responsiveness, perfect for applications where timing is everything. And here’s how it works:

Event-Driven Architecture: Data comes in as a stream of events (e.g., each click, transaction, or sensor reading acts as an individual trigger).
The system processes these events either one at a time or in small time windows (e.g., analyzing the last 5 seconds of activity).
This allows for instant reactions, such as updating a live dashboard or sending an alert the moment something unusual occurs.

‍

*Stream Processing in Action. A sushi conveyor belt delivering fresh data as it arrives, ready for immediate use.*

‍

Use Case: Real-Time Analytics

Stream processing drives real-time analytics, delivering insights when delays aren’t an option. Examples include:

Fraud Detection: Financial systems can identify and flag suspicious transactions instantly, stopping fraud before it’s finalized.
Traffic Management: Apps like Waze process live data to reroute drivers around traffic jams as they form, improving efficiency on the fly.

These scenarios showcase stream processing’s ability to act fast and keep insights current.

Tools to Know

Apache Flink: A solid tool used widely for complex event processing, Flink handles tasks like tracking a user’s journey across multiple interactions (e.g., clicks, searches, purchases) with millisecond accuracy. It’s suited for low-latency needs, such as monitoring live social media trends or processing real-time bids in online auctions.
Apache Kafka Streams: Integrated with Kafka’s data pipeline, Kafka Streams offers a lightweight framework for real-time data flows. It’s ideal for applications like updating recommendation engines as users browse or processing continuous telemetry from IoT devices.

When to Use It

Stream processing shines when:

Timing is Critical: Immediate responses are needed, such as flagging fraud, delivering live traffic updates, or adjusting inventory in real time.
Data Flows Continuously: It’s perfect for ongoing streams like website activity, factory sensor readings, or stock market data.
Insights Need to Adapt: Use it for dynamic analysis, like personalizing Netflix recommendations mid-session or optimizing ad bids as user behavior shifts.

Compared to batch processing’s bulk approach, stream processing acts as data arrives, making it essential for time-sensitive and high-velocity environments.

‍

Note on ETL Flexibility

ETL (Extract, Transform, Load) is often tied to batch processing in traditional setups, but it’s not limited to that approach. Each stage (extraction, transformation, and loading) can operate as either batch or stream-based, offering flexibility to match your needs.
In Part 1, we saw how extraction (or ingestion) can happen in batches, like pulling a day’s worth of data, or as a stream, like capturing live events.

The same applies here: transformation and loading can adapt to either mode, letting you mix and match for the right outcome:

Batch Extraction + Batch Transformation: Traditional ETL for scheduled tasks like monthly reports.
Stream Extraction + Stream Transformation: Real-time ETL for instant insights, like fraud detection.
Batch Extraction + Stream Transformation: Extract a batch of historical data and process it in a streaming pipeline for immediate analysis (e.g., analyzing past sales for a live dashboard).
Stream Extraction + Batch Transformation: Collect real-time data and process it in batches for aggregated insights (e.g., gathering website clicks every hour for a summary report).
For example, a financial institution might use stream extraction and transformation for fraud alerts while using batch extraction and transformation for daily transaction summaries. This adaptability ensures your pipelines align with business goals.

‍

Batch vs. Stream Processing: A Side-by-Side Comparison

Choosing between batch and stream processing can be confusing, as it involves trade-offs in speed, complexity, and scalability. The table below clarifies the key differences, making it easier to decide which approach fits your needs.

‍

Best Practices for Data Processing Pipelines

Effective data processing demands a structured approach to design and execution. The 4 practices outlined below strengthen pipeline durability and adaptability, ensuring they fulfill both present and anticipated requirements.

1. Modular Design

Overview: Segment pipelines into distinct, reusable units.
This method permits isolated modifications without impacting the entire system.
Implementation: Use functions or microservices to preserve adaptability. Look at your processes and identify clear boundaries between each one. This could be having a function for entity resolution, another one for sales aggregation and price resolution, etc.
Example: Should a transformation or entity resolution process fail or need an update, they can be corrected independently, leaving the rest of the processing unaffected.
Benefit: Facilitates troubleshooting and system enhancements, bolstering pipeline stability.

2. Idempotent Operations

Overview: Design operations to be idempotent, ensuring repeated executions yield identical outcomes to a single execution. This is by far the most vital practice for failure management, especially in streaming contexts.
Implementation: Assign unique identifiers to events and verify prior processing to prevent duplication.
Example: In e-commerce, idempotence avoids duplicate charges or repeated emails during transaction retries.
Benefit: Upholds data accuracy and reliability during system recovery attempts.

3. Error Handling

Overview: Prepare for failures with thorough error management, incorporating logging, retries, and contingency measures.
Implementation: Leverage tools such as Kafka’s dead-letter queue for flawed data and Spark’s error logs for analysis. Apply retry mechanisms for temporary disruptions.
Example: In a streaming pipeline, malformed sensor data can be logged and redirected, allowing valid data processing to proceed.
Benefit: Minimizes data loss and expedites resolution of issues.

4. Monitoring and Observability

Overview: Regularly assess pipeline performance through metrics like throughput, latency, and error rates. Tools such as Prometheus or Grafana offer critical insights.
Implementation: Set up alerts for metric irregularities and address deviations promptly.
Example: In a marketing campaign, track click data processing times to mitigate delays before they affect outcomes.
Benefit: Enables early detection and correction of issues, maintaining operational consistency.

Compliance with these practices is critical for sustaining efficient data processing pipelines.

‍

Real-World Applications of Data Processing

Data processing is a key factor in business success for companies like Netflix and Walmart. Each uses a specific method, stream processing for Netflix and batch processing for Walmart, to meet their distinct needs. These examples show how data processing improves user experience, operational efficiency, and market position.

Netflix: Stream Processing for Real-Time Personalization

Netflix operates in an environment where speed and relevance matter. With millions of subscribers creating billions of daily events, such as plays, pauses, and searches, the platform needs to process this data immediately to keep its recommendation system accurate. Stream processing allows Netflix to analyze user actions as they occur, keeping recommendations current and personalized.

Netflix uses Apache Flink, a stream processing framework designed for fast, low-latency data handling. Flink processes billions of events in real time, updating suggestions as soon as a user finishes an episode. A notable feature is stateful processing, which tracks a user’s session across multiple actions. For instance, if a user watches several episodes of a series, the system records their progress and genre preferences, adjusting recommendations instantly.

The outcome is clear: Netflix keeps users engaged with tailored content, leading to longer viewing sessions and lower churn rates. In a competitive field where attention is critical, stream processing gives Netflix an edge in personalized entertainment.

Walmart: Batch Processing for Supply Chain Efficiency

Walmart manages a large network of stores, products, and supply chains. To maintain smooth operations, they use batch processing to analyze big datasets overnight and prepare for the next day’s demands.

Every night, Walmart gathers sales data, inventory levels, and transactions from its global stores. To efficiently process this massive dataset, it relies on Apache Spark, a powerful batch processing engine designed for large-scale workloads. Spark can process terabytes of data quickly, providing insights by morning. A key method is data partitioning, which splits large datasets into smaller parts processed simultaneously across multiple machines. This approach prevents delays, even with huge data volumes.

The result is practical: Walmart restocks shelves accurately, forecasts demand, and reduces overstock or shortages. Batch processing supports their operational efficiency, helping them compete in a busy market.

The Value of Data Processing

Netflix and Walmart demonstrate a central idea: data processing is a business advantage. For Netflix, stream processing delivers real-time personalization that retains users. For Walmart, batch processing ensures supply chain accuracy that boosts efficiency. While their methods differ, both companies show how choosing the right processing approach creates value.

In today's data-driven world, success hinges on efficient data processing. Whether it's engaging viewers or optimizing inventory, companies like Netflix and Walmart demonstrate how the right tools and strategies transform data into a competitive advantage. The key takeaway for any organization: invest in data processing to stay ahead.

‍

Wrapping Up: Why Data Processing Matters

In Part 2 of Data Architectures Decoded, we have explored data processing, the step where raw data turns into useful information. We looked at two main methods: batch processing, which handles big chunks of data on a schedule, and stream processing, which gives real-time results as data comes in. But what happens to all that data after it’s been processed? Where does it go, and how do you make sure it’s ready when you need it? In Part 3, we’ll move on to data storage, the next key step in the data architecture journey. We’ll cover where your processed data lives, how to organize it, and which tools can help you manage it well.

Author

Quentin O. Kasseh

Quentin has over 15 years of experience designing cloud-based, AI-powered data platforms. As the founder of other tech startups, he specializes in transforming complex data into scalable solutions.

Read Bio