Real-Time E-Commerce Analytics with Apache Flink, Elasticsearch, and Postgres

Feb 04, 2025

Introduction

In today's fast-paced digital economy, real-time analytics is crucial for gaining business insights and making data-driven decisions. This post covers a real-time sales analytics pipeline built using Apache Flink, Kafka, Elasticsearch, and Postgres. By leveraging stream processing, this project enables real-time tracking of financial transactions, category-wise sales, and daily/monthly revenue.

Project Overview

This project is an Apache Flink application that processes financial transaction data streamed via Kafka. The processed data is then stored in both Postgres (for structured storage) and Elasticsearch (for full-text search and analysis). The entire system is orchestrated using Docker Compose, ensuring seamless deployment and scalability.

Architecture

The system architecture follows this flow:

Sales Transaction Generator (Python - main.py): Simulates real-time financial transactions and pushes data into Kafka topics.
Kafka Consumers (Flink Application - DataStreamJob.java):
- Reads transaction data from Kafka.
- Performs aggregations and transformations.
- Writes the results into Postgres and Elasticsearch.
Postgres Database:
- Stores structured data for analytical queries.
- Maintains tables: transactions, sales_per_category, sales_per_day, sales_per_month.
Elasticsearch:
- Stores indexed transaction data for full-text search and visualization.
- Enables fast retrieval and filtering of financial records.

Setting Up the Project

Prerequisites

Docker & Docker Compose
Apache Flink (1.20.0)
Java 21
Maven
Python (for transaction generator)

Installation Steps

Clone the repository:

git clone <repository_url>
cd flink-ecommerce-analytics

Start the infrastructure using Docker Compose:

docker-compose up

Generate sales transactions:

python main.py

Compile and build the Flink application:

mvn clean compile package

Run the Flink job:

/opt/homebrew/Cellar/apache-flink/1.20.0/bin/flink run -c flinkecommerce.DataStreamJob target/flinkecommerce-1.0-SNAPSHOT.jar

Verify data in Postgres:

SELECT * FROM transactions;

Search indexed records in Elasticsearch:

curl -X GET "http://localhost:9200/transactions/_search" -H 'Content-Type: application/json' -d'{"query": {"match_all": {}}}'

Running Flink in VS Code (MacOS)

For development, install the required VS Code extensions:

Maven for Java
Extension Pack for Java

Create a Maven Project

mvn archetype:generate \
  -DarchetypeGroupId=org.apache.flink \
  -DarchetypeArtifactId=flink-quickstart-java \
  -DarchetypeVersion=1.16.0 \
  -DgroupId=com.example \
  -DartifactId=flink-project \
  -Dversion=1.0-SNAPSHOT \
  -DinteractiveMode=false

Start Flink Cluster

bin/jobmanager.sh start
bin/taskmanager.sh start

Conclusion

This real-time analytics pipeline demonstrates how Apache Flink can be used to process streaming transaction data at scale. By integrating Postgres and Elasticsearch, businesses can gain powerful insights into their sales trends, helping them make data-driven decisions efficiently.

Stay tuned for more updates, and feel free to reach out with questions or suggestions! 🚀

DE Projects

Discussion about this post