📝 LLM & AI

Document Chunking Strategies for RAG

P
Author
Pyland
📅
Published
30.06.2026
⏱️
Reading time
1 min
👁️
Views
99
📊
Level
Article

Embedding a long document averages out its meanings — a specific question won’t find the right paragraph. The solution is to split the document into chunks and index each one separately.

Basic Chunking with Overlap

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Splits text into chunks with overlap."""
    chunks = []
    step = chunk_size - overlap  # step is smaller than size → creates overlap
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += step

    return chunks

Example:
- chunk_size=500, overlap=50step=450
- Chunk 1: characters 0–500
- Chunk 2: characters 450–950
- Chunk 3: characters 900–1400

Overlap ensures that meaning at chunk boundaries is not lost.

Why One Large Chunk Doesn’t Work

# ❌ Bad: the entire document as a single chunk
collection.add(documents=[full_text], ids=["doc"])
# Embedding = average of all meanings → imprecise search

# ✅ Good: split into pieces
chunks = chunk_text(full_text)
collection.add(
    documents=chunks,
    ids=[f"doc_chunk_{i}" for i in range(len(chunks))]
)

Metadata for Traceability

def chunk_document(text: str, source: str, chunk_size=500, overlap=50) -> list[dict]:
    chunks = chunk_text(text, chunk_size, overlap)
    return [
        {
            "text": chunk,
            "id": f"{source}_chunk_{i}",
            "metadata": {
                "source": source,
                "chunk_index": i,
                "total_chunks": len(chunks)
            }
        }
        for i, chunk in enumerate(chunks)
    ]

Loading a Folder of Documents

from pathlib import Path

def load_documents(folder: str) -> list[dict]:
    docs = []
    for path in Path(folder).iterdir():
        if path.suffix in (".txt", ".md"):
            text = path.read_text(encoding="utf-8")
            docs.extend(chunk_document(text, source=path.name))
    return docs

Chunking Parameters

chunk_size overlap When to use
200–300 20–30 Precise search over short facts
500–800 50–100 Balanced (most use cases)
1000–2000 100–200 Long reasoning, rich context

A chunk that is too small loses context. A chunk that is too large dilutes meaning.

Your reaction to the article

💬 Comments (0)

🔐 Sign in to leave a comment
🚪 Login
💭

No comments yet

Be the first to share your opinion about this article!

🔗 Similar

Similar articles

Continue learning with these materials

📝

httpx: A Modern HTTP Client for Python

httpx is a next-generation HTTP client. Its interface is similar to requests, but it supports...

📅 30.06.2026 👁️ 102
📝

AI Agents: ReAct Loop and Autonomous Actions

A chatbot answers questions. An agent takes action: it calls tools, retrieves real data, and...

📅 30.06.2026 👁️ 98
📝

Typer: CLI Applications Without the Boilerplate

Typer builds CLIs from Python type annotations. No argparse, no manual parsing — just decorators...

📅 30.06.2026 👁️ 82

Did you like the article?

Subscribe to our updates and receive new articles first. Grow with PyLand!