I attended a free Data Engineering BootCamp by
on YouTube a few months ago. The content was incredibly rich—covering everything from basic data pipelines to advanced distributed systems. Unfortunately, not long after, Zach transitioned the videos into a paid subscription model, and the free content disappeared. Determined not to lose this valuable knowledge, I embarked on a personal project: I gathered transcripts of the bootcamp sessions, which were available as PDFs, and built an interactive chatbot that could answer questions based on that entire corpus.In this post, I’ll walk you through the end-to-end process of building this Retrieval-Augmented Generation (RAG) system. I’ll cover everything—from processing hundreds of thousands of tokens of bootcamp transcripts to designing a hybrid search mechanism that combines keyword-based and semantic retrieval, and finally generating answers using an OpenAI model (o1-mini) optimized for reasoning.
The GitHub repo for the project is linked at the end of this post!
The Motivation
The bootcamp’s videos were a treasure trove of data engineering insights. When they went behind a paywall, I realized that I had a unique opportunity to create a tool that would allow anyone to query that knowledge base—even if they couldn’t access the original videos. My goal was to build a chatbot that:
Understands natural language queries.
Retrieves the most relevant parts of a 700,000-token corpus.
Generates concise, informative responses.
Demo
Overview of the Architecture
The system is composed of several interconnected components:
Document Processing and Chunking
I started by collecting 39 PDFs containing the full transcripts (If you need an automated script to download Google Docs as PDFs, take a look at this repo). Using a custom Python module, I extracted text from these PDFs and broke them down into smaller chunks. This wasn’t as simple as splitting paragraphs; to maintain context, I used a sliding window technique where each chunk not only includes a portion of the text but also surrounding sentences. This ensures that the generated context is coherent when later used by the language model.Indexing with Elasticsearch and ChromaDB
For retrieval, I built a hybrid search system:Elasticsearch is used to perform BM25-based keyword searches on the transcripts. This traditional approach helps retrieve documents that contain relevant terms.
ChromaDB is employed to store embeddings generated from the text chunks using a SentenceTransformer model (
all-MiniLM-L6-v2
). This allows for semantic search, where the meaning of the query is compared with the stored document embeddings.
By combining these two methods with a reciprocal rank fusion strategy, I could efficiently select the most relevant chunks from the corpus.
LLM Integration for Answer Generation
Once the context is retrieved, it’s fed into an OpenAI-powered language model. Instead of using heavyweight models like GPT-3.5 or GPT-4, I chose the cost-effectiveo1-mini-2024-09-12
model. I opted foro1-mini
specifically for its strong reasoning capabilities relative to its cost. This model provided the right balance of performance and affordability, enabling me to handle large contexts without breaking the bank.FastAPI and Streamlit
The backend is built using FastAPI, which exposes an endpoint (/chat
) for handling queries. The endpoint:Receives a user’s question.
Performs a hybrid search on the knowledge base.
Constructs a prompt that includes a system instruction, the retrieved context, and the query.
Sends this prompt to the OpenAI API to generate a response.
The frontend is implemented in Streamlit, offering an interactive chat interface that displays the conversation history and even lets users inspect the source documents used to generate the answer.
Diving Deeper: How It All Works
Document Processing
I wrote a module called document_processing.py
that uses pypdf
to extract text from each PDF. Then, using NLTK’s sentence tokenizer, I split the text into sentences and grouped them into chunks of 128 sentences. To ensure each chunk is contextually meaningful, I add a sliding window of 3 sentences before and after each chunk. This method balances the need for a sufficient amount of context while avoiding overwhelming the language model with too much text at once.
Hybrid Search with Elasticsearch and ChromaDB
The hybrid search mechanism is the heart of the retrieval process:
Elasticsearch: The
elasticsearch_manager.py
module creates an index (if one doesn’t exist) and indexes each document chunk. When a query is made, Elasticsearch performs a BM25 search over the chunk content and context.ChromaDB: Meanwhile,
hybrid_search.py
handles semantic search by embedding the chunks using SentenceTransformer and storing these vectors in ChromaDB. When a query comes in, the system encodes the query into an embedding and retrieves similar chunks based on cosine similarity.Result Fusion: The results from both searches are merged using a reciprocal rank fusion technique, ensuring that the final context provided to the language model is both relevant and comprehensive.
Generating Responses with OpenAI
The llm_integration.py
module wraps around the OpenAI API. It constructs a prompt by combining:
A system prompt that instructs the assistant to be concise and to ask for clarification if the context is insufficient.
The retrieved context (with sources indicated).
The user’s query.
The chosen o1-mini-2024-09-12
model is used here because it offers strong reasoning capabilities at a fraction of the cost of larger models like GPT-3.5 or GPT-4. The focus is on ensuring that even with extensive input contexts, the model can generate meaningful and precise answers.
API and Frontend
Backend: The FastAPI server
api.py
handles incoming POST requests on/chat
. It retrieves context, constructs the prompt, calls the OpenAI API, and returns the response along with the source document identifiers.Frontend: The Streamlit app (
frontend/app.py
) provides a chat interface where users can ask questions. It displays the conversation history and allows users to view the sources for each answer.
Performance Considerations and Cost Management
One of the biggest challenges was handling the large amount of context from 700,000 tokens. Initially, I experimented with feeding up to 80k tokens into the model, which resulted in very long inference times (over 30 seconds). To address this:
Selective Retrieval: The hybrid search mechanism ensures only the most relevant chunks are used.
Chunking and Context Limitation: I ensure the model isn't overwhelmed by limiting the number of sentences per chunk and adding contextual windows.
Cost-Effective Model Selection: I specifically chose the
o1-mini-2024-09-12
model for its cost-effectiveness and reasoning prowess. While models like GPT-3.5 or GPT-4 are impressive, they are significantly more expensive. Theo1-mini
strikes a balance between cost and performance, making it ideal for this project.
Challenges and Next Steps
Challenges
Managing Large Contexts: With 700,000 tokens spanning 39 PDFs, it was challenging to ensure that the model received only the most relevant context without overwhelming it.
Cost Management: Using a model that provides good reasoning without incurring high costs was critical.
Next Steps
Re-Ranking: Incorporate a re-ranking step to further refine the retrieval results by selecting the top relevant documents from an initially larger set.
Streaming and Parallelization: Optimize API calls by enabling streaming responses and parallelizing retrieval processes to reduce latency.
Deployment: Containerize the application with Docker and deploy it on a cloud platform for better scalability.
User Authentication: Add authentication to secure the application, especially if it’s deployed publicly.
Conclusion
This project represents my journey in preserving and democratizing the knowledge I gained during the free Data Engineering BootCamp. By leveraging modern NLP techniques, hybrid search methods, and cost-effective language models, I built an interactive assistant that brings valuable content to users in a conversational format. I hope this inspires others to explore creative ways to harness and share knowledge!