Tech AI Insights

7 Powerful Steps to Build a RAG System with Ollama

What is RAG?

RAG, or Retrieval-Augmented Generation, is a method that combines the power of large language models (LLMs) with external knowledge sources. Unlike traditional LLMs, which generate responses based only on their training data, a RAG system retrieves relevant information from a knowledge base (like PDFs or documents) before generating an answer.

This allows for:

  • More accurate and up-to-date answers.
  • Context-aware responses based on your own data.

Key Components of a RAG System

A RAG system combines retrieval techniques with the power of generative models. Let’s break down its key components and how they work together:

1. Document Loader

The first step in a RAG pipeline is reading and extracting content from source documents.This can include PDFs, Word files, or plain text. A document loader converts these into raw textual data that can be processed further.

2. Text Chunker

Once the full document is loaded, the next step is to break it down into manageable pieces.This is called chunking. Large documents can exceed model limits or lose context if not properly segmented. A chunker splits the text into smaller overlapping segments, making it easier for the system to process while preserving meaningful context.

3. Embedding Generator

Each chunk of text needs to be transformed into a numerical format that the system can work with.This is done through embeddings — vector representations of text. Embeddings capture the semantic meaning of a chunk, allowing the system to compare and retrieve similar content based on the user’s query.

4. Vector Store

Once all chunks are embedded, they are stored in a vector database. This storage solution enables efficient similarity search. When a user asks a question, the system searches the vector store to find chunks most relevant to the query. FAISS is commonly used here for its speed and scalability in handling high-dimensional vectors.

5. Query Processing

When the user inputs a question, the system generates an embedding for the query. It then performs a similarity search in the vector store to retrieve the top relevant document chunks. These chunks act as the context that guides the language model to produce an accurate and grounded answer.

6. Language Model (LLM)

The final stage involves combining the retrieved document chunks with the user’s question.This combined input is passed to a language model, which uses the context to generate a response.The better the retrieved context, the more accurate and relevant the answer will be.

7. Caching Layer (Optional)

To improve performance, a caching layer can be introduced. Instead of recomputing embeddings and rebuilding the vector index every time, the system saves them locally. On the next run, it can directly load them — significantly speeding up processing and reducing redundant computations.

What We’re Going to Build

In this blog, we’ll build a basic but functional RAG system that

  • Accepts a PDF as input: The system takes a PDF file, reads its content using PyMuPDF, and prepares it for processing.
  • Chunks the text: The text is split into smaller, overlapping chunks using LangChain’s RecursiveCharacterTextSplitter. This makes the content easier to embed and retrieve later.
  • Generates embeddings using Ollama locally: Each chunk is passed to a local embedding model via http://localhost:11434 to get its vector representation. This model could be nomic-embed-text or similar.
  • Stores them in FAISS: The generated embeddings are stored in a FAISS index for fast similarity search and retrieval.
  • Caches the chunks and index using Pickle: To avoid reprocessing the same PDF repeatedly, the chunks and FAISS index are cached locally using Python’s pickle module.
  • Accepts user queries in the terminal: The CLI waits for user input, converts the question to an embedding, and performs a similarity search over the FAISS index.
  • Streams the answer using an LLM: The relevant chunks and query are sent to a local LLM (e.g., llama3.2:latest) hosted via Ollama, and the response is streamed token-by-token in real time to the terminal.

Let’s begin

Before building anything, let’s set up a clean development environment.

Step 1: Install Python

Make sure Python 3.9 or newer is installed.

python3 --version

If not installed, download it from python.org/downloads.

Step 2: Set Up a Virtual Environment

python3 -m venv rag-env
 source rag-env/bin/activate  # On Windows: rag-env\\Scripts\\activate

Step 3: Install Required Python Libraries

pip install langchain faiss-cpu pymupdf requests numpy

📝 Note: faiss-cpu is the CPU version of FAISS. If you’re using a GPU, install faiss-gpu instead.

Step 4: Install & Run Ollama Locally

Ollama lets you run LLMs and embedding models on your own machine.

Check out our detailed guide to learn how to set up and install Ollama locally.

Pull the required models:

ollama pull llama3.2:latest
 ollama pull nomic-embed-text:latest

Make sure the Ollama service is running:

ollama serve

By default, it runs on http://localhost:11434.

Step 5: Create Project Folder Structure

rag-project/
├── rag_pdf_cli.py        # Main script
├── docs/
│   └── myfile.pdf        # Your source PDFs
├── rag-env/              # Your Python virtual environment

Place any PDFs you want to query inside the docs/ folder.

Step 6: Open Python script (rag_pdf_cli.py) and modify it.

import os
import sys
import fitz  # PyMuPDF for PDF processing
import requests
import numpy as np
import faiss  # Facebook AI Similarity Search for efficient similarity search
import pickle
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Function to load and chunk PDF content
# Params:
#   pdf_path: Path to the PDF file
#   chunk_size: Size of each text chunk (in characters)
#   chunk_overlap: Number of characters to overlap between chunks
def load_pdf_chunks(pdf_path, chunk_size=200, chunk_overlap=50):
    # Open PDF and extract text from all pages
    doc = fitz.open(pdf_path)
    full_text = "".join([page.get_text() for page in doc])
    # Split text into overlapping chunks for better context preservation
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_text(full_text)

# Function to get text embeddings from Ollama API
# Params:
#   text: Input text to embed
#   model: Name of the embedding model to use
def get_ollama_embedding(text, model="nomic-embed-text"):
    url = "http://localhost:11434/api/embeddings"
    payload = {
        "model": model,
        "prompt": text
    }
    # Make API request to Ollama
    response = requests.post(url, json=payload)
    response.raise_for_status()
    data = response.json()
    return data["embedding"]

# Function to build FAISS index for similarity search
# Params:
#   chunks: List of text chunks to index
def build_faiss_index(chunks):
    # Get embedding dimension from first chunk
    dim = len(get_ollama_embedding(chunks[0]))
    # Initialize FAISS index using L2 distance
    index = faiss.IndexFlatL2(dim)
    # Create embeddings for all chunks
    vectors = [get_ollama_embedding(chunk) for chunk in chunks]
    # Add vectors to FAISS index
    index.add(np.array(vectors).astype("float32"))
    return index, vectors

# Function to retrieve most relevant chunks for a query
# Params:
#   query: User's question
#   chunks: List of all text chunks
#   index: FAISS index
#   k: Number of top chunks to retrieve
def get_top_chunks(query, chunks, index, k=3):
    # Get embedding for query
    query_vec = np.array(get_ollama_embedding(query)).astype("float32").reshape(1, -1)
    # Search for similar vectors in FAISS index
    distances, indices = index.search(query_vec, k)
    return [chunks[i] for i in indices[0]]

# Function to stream responses from Ollama LLM
# Params:
#   prompt: Input prompt for the LLM
#   model: Name of the LLM model to use
def stream_ollama_response(prompt, model="llama3.2:latest"):
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": True
    }
    # Stream response from Ollama API
    response = requests.post(url, json=payload, stream=True)
    response.raise_for_status()

    # Process streamed response
    for line in response.iter_lines(decode_unicode=True):
        if line:
            try:
                data = json.loads(line)
                print(data.get("response", ""), end="", flush=True)
            except json.JSONDecodeError:
                continue
    print()  # New line after complete response

# Main RAG (Retrieval Augmented Generation) pipeline
def run_rag(pdf_path):
    # Define cache files for FAISS index and text chunks
    index_file = "faiss_index.index"
    chunks_file = "chunks.pkl"

    # Load cached data if available
    if os.path.exists(index_file) and os.path.exists(chunks_file):
        print("[+] Loading cached index and chunks...")
        index = faiss.read_index(index_file)
        with open(chunks_file, "rb") as f:
            chunks = pickle.load(f)
    else:
        # Process PDF and create new index if cache doesn't exist
        print("[+] Loading and chunking PDF...")
        chunks = load_pdf_chunks(pdf_path)
        print(f"[+] Loaded {len(chunks)} chunks.")

        print("[+] Building FAISS index...")
        index, _ = build_faiss_index(chunks)

        # Cache the index and chunks for future use
        faiss.write_index(index, index_file)
        with open(chunks_file, "wb") as f:
            pickle.dump(chunks, f)
        print("[+] Cached index and chunks saved.")

    # Interactive question-answering loop
    while True:
        query = input("\nYour Question (or type 'exit'): ")
        if query.lower() == "exit":
            break

        # Retrieve relevant chunks and create context
        top_chunks = get_top_chunks(query, chunks, index)
        context = "\n\n".join(top_chunks)

        # Create prompt with context and question
        prompt = f"""Use the following context to answer the question:

{context}

Question: {query}
Answer:"""

        # Generate and stream response
        stream_ollama_response(prompt)

# CLI entry point
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python rag_pdf_cli.py path_to_pdf")
    else:
        run_rag(sys.argv[1])

Step 7: Run the App

Once you’ve saved the script as rag_pdf_cli.py, you can run it like:

python rag_pdf_cli.py docs/myfile.pdf

Conclusion

In this guide, we explored how to build a basic text-based RAG (Retrieval-Augmented Generation) system using Python, Ollama, and FAISS. From setting up the environment and processing PDFs, to generating embeddings and streaming LLM responses, you now have a working prototype that brings together modern LLM techniques and efficient vector search.

For more insightful tutorials, visit our Tech Blogs and explore the latest in Laravel, AI, and Vue.js development!

Scroll to Top