Retrieval-Augmented Generation (RAG) has become a popular approach in building AI chatbots that provide accurate and context-aware answers. In this tutorial, we will build a RAG-based chatbot where users can upload multiple documents (PDF, Word, etc.) to create a custom knowledge base. When users query, the chatbot will fetch relevant context from these documents using Chroma DB (a vector database) and provide precise answers.
Watch the full step-by-step video tutorial here:
👉 YouTube Video: Build RAG-Based Chatbot with Multi-File Upload + Chroma DB
What is RAG (Retrieval-Augmented Generation)?
RAG combines:
This approach ensures:
Architecture Overview:
- User uploads documents (PDF, Word, PPT, etc.)
- Text extraction → Convert documents into text.
- Embeddings creation → Convert text into vector embeddings.
- Store embeddings in Chroma DB
- Query flow:
- Convert user query into an embedding.
- Fetch top N relevant chunks from Chroma DB.
- Pass context + query to LLM for response.
Tools & Libraries Used:
- Python for backend
- Chroma DB for vector storage
- Hugging Face LLM
- Hugging Face Embedding Model
- Flask for API
Below are the code snippet for different steps. You can go through my Youtube video to the complete detail and GitHub repository link
Initialize Chroma DB and Create a Collection
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="docs")
Generate Embeddings:
import requests
from typing import List
import os
import logging
headers = {
"Authorization": f"Bearer {os.getenv("HUGGINGFACEHUB_API_TOKEN")}",
"Content-Type": "application/json"
}
logger = logging.getLogger(__name__)
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "sentence-transformers/distilbert-base-nli-mean-tokens")
# Function to get embedding for a given text using Hugging Face API
# This function assumes that the environment variable EMBEDDING_MODEL is set to a valid Hugging Face model ID
# Example: EMBEDDING_MODEL="sentence-transformers/distilbert-base-nli-mean-tokens"
def get_embedding(text: str) -> List[float]:
try:
url = f"https://api-inference.huggingface.co/models/{EMBEDDING_MODEL}"
print(f"url: {url}")
response = requests.post(url, headers=headers, json={"inputs": text}, timeout=60, verify=False)
response.raise_for_status()
embedding = response.json()
return embedding # This is a list of floats
except Exception as e:
logger.error(f"Error getting embedding: {e}")
raise ValueError("Failed to get embedding from the model.")
Store embedded data in Chroma DB:
def add_documents_to_vector_store(docs: list[str], ids: list[str], filename: str) -> None:
try:
vectors = [get_embedding(text) for text in docs]
collection.add(
documents=docs,
embeddings=vectors,
ids=ids,
metadatas=[{"filename": filename} for _ in docs]
)
except Exception as e:
logger.error(f"Error adding documents to vector store: {e}")
raise ValueError(f"Failed to add documents to vector store: {e}")
Query Chroma DB:
def query_similar_documents(query: str, top_k: int = 3) -> list[str]:
try:
context_files = get_uploaded_files()
print(f"Context files available: {context_files}")
if len(context_files) == 0:
return []
query_vector = get_embedding(query)
results = collection.query(
query_embeddings=[query_vector],
n_results=top_k,
where={"filename": {"$in": context_files}}
)
return results["documents"][0]
except Exception as e:
logger.error(f"Error querying similar documents: {e}")
raise ValueError(f"Failed to query similar documents: {e}")
Combine with LLM for Answer Generation:
import os
import requests
import logging
from typing import List
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import chromadb
from chromadb.config import Settings
from app.core.chroma_store import query_similar_documents, add_documents_to_vector_store
from huggingface_hub import InferenceClient
logger = logging.getLogger(__name__)
HUGGINGFACE_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
LLM_MODEL = os.getenv("LLM_MODEL") # mistralai/Mixtral-8x7B-Instruct-v0.1
headers = {
"Authorization": f"Bearer {os.getenv("HUGGINGFACEHUB_API_TOKEN")}",
"Content-Type": "application/json"
}
def build_client() -> InferenceClient:
return InferenceClient(
model=LLM_MODEL,
token=HUGGINGFACE_API_TOKEN,
timeout=300
)
def get_llm_response(prompt: str):
try:
client = build_client()
SYSTEM_PROMPT = """You are a Personal assistant specialized in providing well formated answer from the context being provided."""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Provide well formated answer from the given context.\n\n{prompt}"}
]
result = client.chat.completions.create(
model=LLM_MODEL,
messages=messages,
temperature=0.1,
max_tokens=200
)
text = result.choices[0].message.content
if isinstance(text, list):
text = "".join(part.get("text", "") if isinstance(part, dict) else str(part) for part in text)
return text
except Exception as e:
logger.error(f"Error getting LLM response: {e}")
raise ValueError("Failed to get response from the LLM model.")
✅ Full Video Tutorial
📌 Watch the complete step-by-step guide with implementation here:
👉 Build RAG-Based Chatbot with Multi-File Upload + Chroma DB