Building Custom AI Applications with LLMs and RAG

Introduction: The Power Duo – LLMs and RAG

Large Language Models (LLMs) like GPT-4, Llama 2, or Mistral have demonstrated incredible capabilities in understanding and generating human-like text. They are trained on vast amounts of data, enabling them to perform tasks like summarization, translation, content creation, and answering general questions.

However, LLMs have limitations:

Knowledge Cut-off: Their knowledge is limited to their training data, meaning they can’t access real-time or very recent information.
Hallucinations: They can sometimes generate plausible but factually incorrect or nonsensical information.
Lack of Domain Specificity: While general-purpose, they might lack the deep, specific knowledge required for niche applications (e.g., internal company policies, specialized medical research).

This is where Retrieval-Augmented Generation (RAG) comes in. RAG enhances LLMs by giving them access to external, authoritative knowledge bases during the generation process. Instead of relying solely on their internal, static knowledge, RAG allows LLMs to “look up” information and use it to formulate more accurate, contextually relevant, and up-to-date responses.

How RAG works with LLMs: Imagine an LLM as a brilliant student who has read many books. RAG is like giving that student access to a meticulously organized library and teaching them how to find and cite the most relevant books before answering a question.

The core idea is a two-step process:

Retrieval: Given a user query, the system first retrieves relevant information from a pre-defined knowledge base (your custom data).
Generation: This retrieved information, along with the original query, is then fed to the LLM as additional context, guiding it to generate a more informed and accurate response.

Apple Macbook Pro M4 16-inch

Apple MacBook Pro (16-inch, Apple M4 Pro chip with 14‑core CPU and 20‑core GPU, 24GB Unified Memory, 512GB) – Space Black

Category: Laptops

Benefits of RAG for Custom AI Applications:

Improved Accuracy and Reduced Hallucinations: By grounding responses in factual, external data, RAG significantly reduces the chances of the LLM “hallucinating” or providing incorrect information.
Access to Up-to-Date and Domain-Specific Knowledge: You can feed RAG systems with your latest internal documents, proprietary databases, or real-time information, ensuring the LLM’s responses are current and highly relevant to your specific domain.
Cost-Effective Customization: Instead of expensive and time-consuming fine-tuning or retraining of LLMs for specific tasks or data, RAG provides a more efficient way to customize their knowledge.
Enhanced Transparency and Explainability: Because the LLM’s response is based on retrieved documents, you can often provide citations or point to the source material, increasing trust and allowing for verification.
Scalability: RAG systems can be scaled to handle large volumes of data and queries, making them suitable for enterprise-level applications.
Flexibility: Easily update the knowledge base without modifying the LLM itself, making your application adaptable to evolving information.

Step-by-Step Tutorial: Building a Custom AI Application with LLMs and RAG

Let’s walk through the general steps to build a custom RAG-based AI application. We’ll use a conceptual example of a “Company Policy Q&A Bot.”

Phase 1: Data Preparation and Indexing (The Knowledge Base)

This is the foundation of your RAG system – preparing the external knowledge the LLM will retrieve from.

Step 1: Gather Your Data Identify and collect all the relevant documents that your AI application needs to “know” about.

Examples for Company Policy Bot: PDF handbooks, internal wikis, HR documents, compliance guidelines, FAQs, internal memos.
Formats: Can be diverse – text files (.txt), PDFs, Word documents (.docx), web pages, Markdown files, database records, etc.

Step 2: Load and Process Documents You’ll need to load these documents into a format your system can work with. Libraries like LangChain are excellent for this.

from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader, DirectoryLoader
import os

def load_documents(data_directory="data/"):
    documents = []
    # Load PDF files
    pdf_loader = DirectoryLoader(data_directory, glob="./*.pdf", loader_cls=PyPDFLoader)
    documents.extend(pdf_loader.load())

    # Load text files (if any)
    text_loader = DirectoryLoader(data_directory, glob="./*.txt")
    documents.extend(text_loader.load())

    # You can add other loaders for different file types
    print(f"Loaded {len(documents)} documents.")
    return documents

# Assuming your policy documents are in a 'data' folder
# documents = load_documents()

Step 3: Chunk Documents LLMs have a limited context window (the amount of text they can process at once). Your documents are likely too long to fit. You need to break them down into smaller, manageable “chunks.”

Strategy: Use a TextSplitter (e.g., RecursiveCharacterTextSplitter) that attempts to split semantically (by paragraphs, sentences) before resorting to fixed-size splits. Overlapping chunks can help maintain context.

from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,  # Max characters per chunk
        chunk_overlap=200, # Overlap between chunks for context
        length_function=len,
        is_separator_regex=False,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks.")
    return chunks

# chunks = chunk_documents(documents)

Step 4: Create Embeddings Embeddings are numerical representations (vectors) of text that capture its semantic meaning. Documents with similar meanings will have similar embedding vectors.

Choose an Embedding Model: You’ll need an embedding model (e.g., all-MiniLM-L6-v2, OpenAIEmbeddings, GoogleGenerativeAIEmbeddings). This model converts your text chunks into vectors.

from langchain_community.embeddings import SentenceTransformerEmbeddings # For local models
# Or from langchain_openai import OpenAIEmbeddings # For OpenAI
# Or from langchain_google_genai import GoogleGenerativeAIEmbeddings # For Google

def get_embeddings_model():
    # Using a local Sentence Transformer model for demonstration
    # pip install sentence-transformers
    model_name = "all-MiniLM-L6-v2"
    embeddings_model = SentenceTransformerEmbeddings(model_name=model_name)
    return embeddings_model

# embeddings_model = get_embeddings_model()

Step 5: Store Embeddings in a Vector Database A vector database is specialized for storing and efficiently querying these embedding vectors. When a user asks a question, its embedding will be compared against the stored document embeddings to find the most relevant chunks.

Popular Choices: ChromaDB (local/embedded), Pinecone, Weaviate, Milvus, FAISS.

from langchain_community.vectorstores import Chroma
import shutil # For clearing directory

CHROMA_PATH = "chroma_db"

def create_vector_store(chunks, embeddings_model):
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH) # Clear existing DB for fresh start

    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings_model,
        persist_directory=CHROMA_PATH
    )
    vector_store.persist()
    print(f"Vector store created with {len(chunks)} chunks.")
    return vector_store

# Full data processing workflow
# documents = load_documents()
# chunks = chunk_documents(documents)
# embeddings_model = get_embeddings_model()
# vector_store = create_vector_store(chunks, embeddings_model)

Phase 2: Building the RAG Application (Query and Response)

Now that your knowledge base is indexed, you can build the application logic to answer user queries.

Step 6: Initialize the LLM Choose and set up the Large Language Model you want to use. This could be a cloud-based API (e.g., OpenAI’s GPT models, Google’s Gemini) or a locally hosted model (e.g., via Ollama).

from langchain_openai import ChatOpenAI # For OpenAI
# Or from langchain_community.llms import Ollama # For local Ollama models

def get_llm():
    # For OpenAI, ensure OPENAI_API_KEY is set in your environment
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.1)
    # For Ollama:
    # llm = Ollama(model="mistral")
    return llm

# llm = get_llm()

Step 7: Implement the Retrieval Mechanism When a user asks a question, you need to:

Convert the query into an embedding using the same embedding model used for your documents.
Use this query embedding to search your vector database for the most semantically similar (i.e., relevant) document chunks.

def retrieve_context(query, vector_store, k=4):
    # k is the number of relevant documents to retrieve
    retriever = vector_store.as_retriever(search_kwargs={"k": k})
    relevant_docs = retriever.invoke(query)
    return relevant_docs

# query = "What is the company's policy on remote work?"
# retrieved_context = retrieve_context(query, vector_store)
# print(f"Retrieved {len(retrieved_context)} relevant documents.")

Step 8: Construct the Augmented Prompt Combine the original user query with the retrieved relevant document chunks. This augmented prompt is what you’ll send to the LLM.

Prompt Engineering: The way you structure this prompt is crucial. You want to instruct the LLM to use the provided context to answer the question and to state if it cannot find an answer within the given context.

from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

def build_rag_chain(llm, retriever):
    prompt_template = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful assistant for answering questions about company policies.
        Use the following context to answer the question. If you don't know the answer based on the provided context,
        simply state that you don't know, and do not make up an answer.
        ---
        {context}
        """),
        ("human", "{input}")
    ])

    document_chain = create_stuff_documents_chain(llm, prompt_template)
    retrieval_chain = create_retrieval_chain(retriever, document_chain)
    return retrieval_chain

# retriever = vector_store.as_retriever(search_kwargs={"k": 4})
# rag_chain = build_rag_chain(llm, retriever)

Step 9: Generate the Response Send the augmented prompt to the LLM and receive its generated response.

# response = rag_chain.invoke({"input": "What is the company's policy on remote work?"})
# print(response["answer"])

Phase 3: Deployment and Iteration

Step 10: Build a User Interface (Optional but Recommended) For a practical application, you’ll want a way for users to interact with your bot.

Simple Web App: Frameworks like Streamlit, Gradio, or Flask can be used to create a simple chat interface.

Step 11: Testing and Evaluation

Test with Diverse Queries: Ask questions that are both directly answerable by your documents and those that aren’t.
Monitor for Hallucinations: Check if the bot ever provides incorrect information.
Evaluate Relevance: Are the retrieved documents actually relevant to the query?
User Feedback: Implement a feedback mechanism to gather user input for continuous improvement.

Step 12: Continuous Improvement and Updates

Update Knowledge Base: As company policies change or new information becomes available, update your documents, re-chunk them, and re-index your vector database. This is a key advantage of RAG – you don’t need to retrain the LLM!
Refine Prompt Engineering: Experiment with different system prompts to guide the LLM’s behavior and improve response quality.
Consider Advanced RAG Techniques: For more complex scenarios, explore techniques like re-ranking retrieved documents, query expansion, or using a “router” to direct queries to different knowledge bases.

Example Application Flow (Company Policy Q&A Bot)

User: “What is the company’s policy on working from home?”
Retrieval Step:
- The user’s question is embedded into a vector.
- This vector is used to query the chroma_db (vector database).
- The vector database returns the top k (e.g., 4) most similar document chunks from your stored policy documents (e.g., paragraphs from the “Remote Work Policy” document).
Augmentation Step:
- The original question and the retrieved document chunks are combined into a single prompt: “You are a helpful assistant for answering questions about company policies. Use the following context to answer the question. If you don’t know the answer based on the provided context, simply state that you don’t know, and do not make up an answer.[Retrieved chunk 1: “Employees are eligible for remote work…”] [Retrieved chunk 2: “Approval process for remote work involves…”] [Retrieved chunk 3: “Equipment provided for remote setups…”]What is the company’s policy on working from home?”
Generation Step:
- This augmented prompt is sent to the LLM (e.g., GPT-3.5-turbo).
- The LLM uses its general language understanding capabilities, combined with the specific context provided, to generate an accurate and relevant answer about the company’s remote work policy.
Output: The LLM provides the answer to the user.

Conclusion

By combining the powerful generative capabilities of LLMs with the precise, up-to-date information retrieval of RAG, you can build highly effective and domain-specific AI applications. This tutorial provides a foundational understanding and practical steps to get started. Remember, continuous testing, iteration, and updating your knowledge base are key to maintaining a valuable and reliable custom AI solution. Good luck with your AI development endeavors!