What is RAG?
RAG, or Retrieval-Augmented Generation, is a method that combines the power of large language models (LLMs) with external knowledge sources. Unlike traditional LLMs, which generate responses based only on their training data, a RAG system retrieves relevant information from a knowledge base (like PDFs or documents) before generating an answer.
This allows for:
- More accurate and up-to-date answers.
- Context-aware responses based on your own data.
Key Components of a RAG System
A RAG system combines retrieval techniques with the power of generative models. Let’s break down its key components and how they work together:
1. Document Loader
The first step in a RAG pipeline is reading and extracting content from source documents.This can include PDFs, Word files, or plain text. A document loader converts these into raw textual data that can be processed further.
2. Text Chunker
Once the full document is loaded, the next step is to break it down into manageable pieces.This is called chunking. Large documents can exceed model limits or lose context if not properly segmented. A chunker splits the text into smaller overlapping segments, making it easier for the system to process while preserving meaningful context.
3. Embedding Generator
Each chunk of text needs to be transformed into a numerical format that the system can work with.This is done through embeddings — vector representations of text. Embeddings capture the semantic meaning of a chunk, allowing the system to compare and retrieve similar content based on the user’s query.
4. Vector Store
Once all chunks are embedded, they are stored in a vector database. This storage solution enables efficient similarity search. When a user asks a question, the system searches the vector store to find chunks most relevant to the query. FAISS is commonly used here for its speed and scalability in handling high-dimensional vectors.
5. Query Processing
When the user inputs a question, the system generates an embedding for the query. It then performs a similarity search in the vector store to retrieve the top relevant document chunks. These chunks act as the context that guides the language model to produce an accurate and grounded answer.
6. Language Model (LLM)
The final stage involves combining the retrieved document chunks with the user’s question.This combined input is passed to a language model, which uses the context to generate a response.The better the retrieved context, the more accurate and relevant the answer will be.
7. Caching Layer (Optional)
To improve performance, a caching layer can be introduced. Instead of recomputing embeddings and rebuilding the vector index every time, the system saves them locally. On the next run, it can directly load them — significantly speeding up processing and reducing redundant computations.
What We’re Going to Build
In this blog, we’ll build a basic but functional RAG system that
- Accepts a PDF as input: The system takes a PDF file, reads its content using PyMuPDF, and prepares it for processing.
- Chunks the text: The text is split into smaller, overlapping chunks using LangChain’s RecursiveCharacterTextSplitter. This makes the content easier to embed and retrieve later.
- Generates embeddings using Ollama locally: Each chunk is passed to a local embedding model via
http://localhost:11434
to get its vector representation. This model could benomic-embed-text
or similar. - Stores them in FAISS: The generated embeddings are stored in a FAISS index for fast similarity search and retrieval.
- Caches the chunks and index using Pickle: To avoid reprocessing the same PDF repeatedly, the chunks and FAISS index are cached locally using Python’s
pickle
module. - Accepts user queries in the terminal: The CLI waits for user input, converts the question to an embedding, and performs a similarity search over the FAISS index.
- Streams the answer using an LLM: The relevant chunks and query are sent to a local LLM (e.g.,
llama3.2:latest
) hosted via Ollama, and the response is streamed token-by-token in real time to the terminal.
Let’s begin
Before building anything, let’s set up a clean development environment.
Step 1: Install Python
Make sure Python 3.9 or newer is installed.
python3 --version
If not installed, download it from python.org/downloads.
Step 2: Set Up a Virtual Environment
python3 -m venv rag-env
source rag-env/bin/activate # On Windows: rag-env\\Scripts\\activate
Step 3: Install Required Python Libraries
pip install langchain faiss-cpu pymupdf requests numpy
📝 Note: faiss-cpu
is the CPU version of FAISS. If you’re using a GPU, install faiss-gpu
instead.
Step 4: Install & Run Ollama Locally
Ollama lets you run LLMs and embedding models on your own machine.
Check out our detailed guide to learn how to set up and install Ollama locally.
Pull the required models:
ollama pull llama3.2:latest
ollama pull nomic-embed-text:latest
Make sure the Ollama service is running:
ollama serve
By default, it runs on http://localhost:11434.
Step 5: Create Project Folder Structure
rag-project/
├── rag_pdf_cli.py # Main script
├── docs/
│ └── myfile.pdf # Your source PDFs
├── rag-env/ # Your Python virtual environment
Place any PDFs you want to query inside the docs/
folder.
Step 6: Open Python script (rag_pdf_cli.py
) and modify it.
import os import sys import fitz # PyMuPDF for PDF processing import requests import numpy as np import faiss # Facebook AI Similarity Search for efficient similarity search import pickle import json from langchain.text_splitter import RecursiveCharacterTextSplitter # Function to load and chunk PDF content # Params: # pdf_path: Path to the PDF file # chunk_size: Size of each text chunk (in characters) # chunk_overlap: Number of characters to overlap between chunks def load_pdf_chunks(pdf_path, chunk_size=200, chunk_overlap=50): # Open PDF and extract text from all pages doc = fitz.open(pdf_path) full_text = "".join([page.get_text() for page in doc]) # Split text into overlapping chunks for better context preservation splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) return splitter.split_text(full_text) # Function to get text embeddings from Ollama API # Params: # text: Input text to embed # model: Name of the embedding model to use def get_ollama_embedding(text, model="nomic-embed-text"): url = "http://localhost:11434/api/embeddings" payload = { "model": model, "prompt": text } # Make API request to Ollama response = requests.post(url, json=payload) response.raise_for_status() data = response.json() return data["embedding"] # Function to build FAISS index for similarity search # Params: # chunks: List of text chunks to index def build_faiss_index(chunks): # Get embedding dimension from first chunk dim = len(get_ollama_embedding(chunks[0])) # Initialize FAISS index using L2 distance index = faiss.IndexFlatL2(dim) # Create embeddings for all chunks vectors = [get_ollama_embedding(chunk) for chunk in chunks] # Add vectors to FAISS index index.add(np.array(vectors).astype("float32")) return index, vectors # Function to retrieve most relevant chunks for a query # Params: # query: User's question # chunks: List of all text chunks # index: FAISS index # k: Number of top chunks to retrieve def get_top_chunks(query, chunks, index, k=3): # Get embedding for query query_vec = np.array(get_ollama_embedding(query)).astype("float32").reshape(1, -1) # Search for similar vectors in FAISS index distances, indices = index.search(query_vec, k) return [chunks[i] for i in indices[0]] # Function to stream responses from Ollama LLM # Params: # prompt: Input prompt for the LLM # model: Name of the LLM model to use def stream_ollama_response(prompt, model="llama3.2:latest"): url = "http://localhost:11434/api/generate" payload = { "model": model, "prompt": prompt, "stream": True } # Stream response from Ollama API response = requests.post(url, json=payload, stream=True) response.raise_for_status() # Process streamed response for line in response.iter_lines(decode_unicode=True): if line: try: data = json.loads(line) print(data.get("response", ""), end="", flush=True) except json.JSONDecodeError: continue print() # New line after complete response # Main RAG (Retrieval Augmented Generation) pipeline def run_rag(pdf_path): # Define cache files for FAISS index and text chunks index_file = "faiss_index.index" chunks_file = "chunks.pkl" # Load cached data if available if os.path.exists(index_file) and os.path.exists(chunks_file): print("[+] Loading cached index and chunks...") index = faiss.read_index(index_file) with open(chunks_file, "rb") as f: chunks = pickle.load(f) else: # Process PDF and create new index if cache doesn't exist print("[+] Loading and chunking PDF...") chunks = load_pdf_chunks(pdf_path) print(f"[+] Loaded {len(chunks)} chunks.") print("[+] Building FAISS index...") index, _ = build_faiss_index(chunks) # Cache the index and chunks for future use faiss.write_index(index, index_file) with open(chunks_file, "wb") as f: pickle.dump(chunks, f) print("[+] Cached index and chunks saved.") # Interactive question-answering loop while True: query = input("\nYour Question (or type 'exit'): ") if query.lower() == "exit": break # Retrieve relevant chunks and create context top_chunks = get_top_chunks(query, chunks, index) context = "\n\n".join(top_chunks) # Create prompt with context and question prompt = f"""Use the following context to answer the question: {context} Question: {query} Answer:""" # Generate and stream response stream_ollama_response(prompt) # CLI entry point if __name__ == "__main__": if len(sys.argv) < 2: print("Usage: python rag_pdf_cli.py path_to_pdf") else: run_rag(sys.argv[1])
Step 7: Run the App
Once you’ve saved the script as rag_pdf_cli.py
, you can run it like:
python rag_pdf_cli.py docs/myfile.pdf
Conclusion
In this guide, we explored how to build a basic text-based RAG (Retrieval-Augmented Generation) system using Python, Ollama, and FAISS. From setting up the environment and processing PDFs, to generating embeddings and streaming LLM responses, you now have a working prototype that brings together modern LLM techniques and efficient vector search.
For more insightful tutorials, visit our Tech Blogs and explore the latest in Laravel, AI, and Vue.js development!