Streamlining Data ETL with Apache Airflow and Python

Sep 21, 2023

In today’s data-driven world, the ability to process, analyze, and derive insights from vast volumes of data is crucial for businesses and organizations. Data engineering is the backbone of this data-driven ecosystem, responsible for collecting, transforming, and preparing data for downstream analysis. However, as data complexity and scale continue to grow, managing data engineering processes manually becomes impractical. This is where Apache Airflow steps in as a powerful workflow automation tool that simplifies, streamlines, and enhances the efficiency of data engineering tasks.

1. Use Case: Automating Data ETL

Imagine a retail company with an extensive online presence that generates vast amounts of sales data daily. The company needs to analyze this data to make informed decisions, such as adjusting inventory levels, optimizing marketing campaigns, and identifying customer trends. To accomplish this, they must collect data from various sources, clean and transform it, and load it into a data warehouse for analysis.

Sample Data:

Here’s a simplified sample dataset representing daily sales data:

Date,Product,Category,Units Sold,Revenue
2023-01-01,Product A,Electronics,50,1000
2023-01-01,Product B,Apparel,30,450
2023-01-02,Product A,Electronics,55,1100
2023-01-02,Product B,Apparel,35,525
2023-01-03,Product A,Electronics,60,1200
2023-01-03,Product B,Apparel,40,600
...

This dataset includes information such as the date of sale, the product sold, the product category, the number of units sold, and the revenue generated. The retail company needs to regularly collect, transform, and load this data into their data warehouse to keep their analysis up to date and enable data-driven decision-making. This is where Apache Airflow comes into play to automate the ETL (Extract, Transform, Load) process efficiently.

2. Data Extraction and Transformation

In this section, we will delve into the data extraction and transformation steps using Python scripts and Apache Airflow DAGs.

Data Extraction with Python:

To start the ETL process, we need to extract data from the source, which could be databases, APIs, or external files. In our retail sales scenario, we will focus on extracting data from a CSV file. Python provides various libraries to handle data extraction, such as Pandas.

Here’s a Python snippet illustrating how to read our sample CSV data:

import pandas as pd

# Define the file path
data_file = 'sales_data.csv'

# Read the CSV file into a Pandas DataFrame
data = pd.read_csv(data_file)

# Display the first few rows of the data
print(data.head())

This code uses the Pandas library to read the CSV file into a DataFrame, making it easy to work with the data.

Data Transformation with Python:

Once we have the data, we often need to clean and transform it to prepare it for analysis. In our example, we’ll perform a simple data transformation by calculating the total revenue for each product category.

Here’s a Python snippet to perform this transformation:

# Group the data by 'Category' and calculate the sum of 'Revenue'
category_revenue = data.groupby('Category')['Revenue'].sum().reset_index()

# Display the transformed data
print(category_revenue)

This code groups the data by the ‘Category’ column and calculates the sum of ‘Revenue’ for each category, resulting in a DataFrame with the total revenue per category.

Creating an Apache Airflow DAG:

While Python is excellent for data extraction and transformation at a small scale, automating these tasks becomes crucial when dealing with large volumes of data over time. Apache Airflow provides a powerful solution for creating data pipelines as Directed Acyclic Graphs (DAGs).

Here’s a high-level explanation of creating an Airflow DAG for our scenario:

Define the DAG structure, including tasks and dependencies.
Define Python functions for each task, encapsulating the ETL logic.
Configure task dependencies to ensure they execute in the correct order.

Below is a simplified example of an Apache Airflow DAG for our retail sales scenario. Note that this is a basic illustration, and in a real-world scenario, you would include error handling, scheduling, and more complex logic.

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

# Define the default_args dictionary
default_args = {
    'owner': 'your_name',
    'start_date': datetime(2023, 1, 1),
    'retries': 1,
}

# Create a DAG instance
dag = DAG(
    'retail_sales_etl',
    default_args=default_args,
    schedule_interval='@daily',  # Run the DAG daily
    catchup=False,
)

# Define Python functions for extraction and transformation
def extract_data():
    # Your data extraction logic here
    pass

def transform_data():
    # Your data transformation logic here
    pass

# Define Airflow tasks
extract_task = PythonOperator(
    task_id='extract_data',
    python_callable=extract_data,
    dag=dag,
)

transform_task = PythonOperator(
    task_id='transform_data',
    python_callable=transform_data,
    dag=dag,
)

# Define task dependencies
extract_task >> transform_task

In this simplified DAG, extract_data and transform_data represent Python functions that perform data extraction and transformation. The >> operator signifies the task dependency, ensuring that transform_data runs only after extract_data is successfully completed.

This Airflow DAG automates the ETL process, making it efficient, repeatable, and suitable for processing real-time data on a regular schedule.

3. Loading Data into MySQL

Now that we’ve successfully extracted and transformed our sample retail sales data, the next step is to load it into our MySQL database. In this section, we will demonstrate how to establish a connection to MySQL using Python and then load our transformed data into the database.

Connecting to MySQL with Python:

To interact with MySQL from Python, we’ll use the mysql-connector-python library, which provides a convenient way to establish a connection and execute SQL queries.

First, you need to install the library using pip if you haven’t already:

pip install mysql-connector-python

Next, let’s create a Python script to connect to our MySQL database. Replace the placeholders in the code with your actual database credentials.

import mysql.connector

# MySQL database credentials
db_config = {
    'host': 'your_host',
    'user': 'your_user',
    'password': 'your_password',
    'database': 'your_database',
}

# Create a connection to MySQL
conn = mysql.connector.connect(**db_config)

# Create a cursor object to interact with the database
cursor = conn.cursor()

# Close the cursor and connection when done
cursor.close()
conn.close()

In this script, we establish a connection to MySQL using the provided credentials. Make sure to replace 'your_host', 'your_user', 'your_password', and 'your_database' with your actual database details.

Loading Data into MySQL:

Now that we have a MySQL connection, let’s load our transformed data into a MySQL table. We’ll assume that you’ve already created an empty table in your database with the same structure as your transformed data.

Here’s how you can load data into MySQL using Python:

# Replace 'your_table' with your actual table name
table_name = 'your_table'

# SQL statement to insert data
insert_query = f"INSERT INTO {table_name} (Category, Revenue) VALUES (%s, %s)"

# Prepare data for insertion (category_revenue is the DataFrame from the transformation step)
data_to_insert = [(row['Category'], row['Revenue']) for _, row in category_revenue.iterrows()]

# Execute the insert query
cursor.executemany(insert_query, data_to_insert)

# Commit the changes to the database
conn.commit()

In this code snippet, we specify the target table name, build an SQL insert query, prepare the data for insertion, and then execute the query using executemany. Finally, we commit the changes to the database.

By following these steps, you can seamlessly load your transformed data into MySQL, making it accessible for various data-driven applications and real-time dashboards.

4. Conclusion

In this article, we’ve taken a hands-on approach to understanding the process of extracting, transforming, and loading (ETL) real-time data into a MySQL database using Python. Here are the key takeaways:

Data Extraction and Transformation: We learned how to use Python to extract data from a sample CSV file, perform essential data transformations, and prepare it for loading into a database. These transformations included data cleansing, date formatting, and aggregation.
MySQL Database: We explored how to connect to a MySQL database using Python and load data into it. This step is crucial for making our transformed data accessible for various data-driven applications.
Apache Airflow: We highlighted the significance of Apache Airflow in automating the ETL process. By creating Directed Acyclic Graphs (DAGs), we can orchestrate data workflows efficiently, ensuring data is updated and processed in real-time.

By mastering the ETL process, you can streamline your data pipeline, ensuring that data is up-to-date, accurate, and readily available for analysis. Whether you’re working with financial data, sales data, or any other domain, these techniques are fundamental for effective data management and analysis.

Bragadeesh’s Substack

Discussion about this post