📝 LLM & AI

RAG: Chatting with Documents via Vector Search

Author

Pyland

📅

Published

30.06.2026

⏱️

Reading time

1 min

👁️

Views

📊

Level

Article

#llm #ai

RAG (Retrieval-Augmented Generation) is a pattern for working with your own documents. Instead of fine-tuning the model, you find the relevant fragments and pass them in as context.

The problem: LLMs hallucinate

Claude was trained up to a certain date and knows nothing about your documents. When you ask about them, the model generates plausible-sounding but fabricated text.

RAG solves this

User question
      ↓
Vector search over the document database
      ↓
Relevant fragments (chunks) retrieved
      ↓
Fragments → Claude's system prompt
      ↓
Claude answers based on the real text

Implementation

import chromadb
import anthropic

client_ai = anthropic.Anthropic()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("docs")

SYSTEM_BASE = "You are an assistant. Answer only based on the provided documents."

def search(query: str, n: int = 3) -> list[dict]:
    results = collection.query(
        query_texts=[query],
        n_results=n,
        include=["documents", "metadatas", "distances"]
    )
    return [
        {"text": doc, "source": meta["source"], "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
        if dist < 0.5  # relevance filter
    ]

def build_context(chunks: list[dict]) -> str:
    if not chunks:
        return ""
    parts = []
    for i, chunk in enumerate(chunks, 1):
        relevance = 1 - chunk["distance"]
        parts.append(
            f"[Fragment {i} | Source: {chunk['source']} | Relevance: {relevance:.0%}]\n"
            f"{chunk['text']}"
        )
    return "\n\n".join(parts)

def rag_chat(question: str, history: list[dict]) -> str:
    chunks = search(question)
    context = build_context(chunks)

    system = SYSTEM_BASE
    if context:
        system += f"\n\nDocuments:\n{context}"
    else:
        system += "\n\nNo documents found. Let the user know."

    history.append({"role": "user", "content": question})
    response = client_ai.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system,
        messages=history,
    )
    answer = response.content[0].text
    history.append({"role": "assistant", "content": answer})
    return answer

Key RAG decisions

Decision	Impact
chunk_size	Search precision
n_results	Amount of context
distance threshold	Filtering out irrelevant results
System prompt	Behavior when no data is found

RAG: Chatting with Documents via Vector Search

The problem: LLMs hallucinate

RAG solves this

Implementation

Key RAG decisions

Your reaction to the article

Prompt Caching in the Claude API: Saving on Tokens

AsyncAnthropic: Async Client for the Claude API

💬 Comments (0)

No comments yet

Similar articles

httpx: A Modern HTTP Client for Python

AI Agents: ReAct Loop and Autonomous Actions

Typer: CLI Applications Without the Boilerplate

Did you like the article?

Similar articles

httpx: A Modern HTTP Client for Python
📅 30.06.2026 👁️ 102

AI Agents: ReAct Loop and Autonomous Actions
📅 30.06.2026 👁️ 97

Typer: CLI Applications Without the Boilerplate
📅 30.06.2026 👁️ 82