An embedding is a numerical vector that represents a piece of text. Texts that are similar in meaning get vectors that are close to each other. This is the foundation of semantic search in RAG systems.
Analogy
Imagine a space with coordinates. Each word is a point:
- “cat” and “kitten” — nearly in the same spot
- “cat” and “automobile” — far apart
- “Python” and “programming” — close together
An embedding is exactly those coordinates in such a space, just with hundreds of dimensions.
How ChromaDB creates embeddings
ChromaDB automatically creates embeddings during add() using the built-in all-MiniLM-L6-v2 model:
import chromadb
collection = chromadb.PersistentClient("./db").get_or_create_collection("docs")
# ChromaDB computes embeddings for each text automatically
collection.add(
documents=["Python is a programming language", "The cat is sitting on the sofa"],
ids=["doc_0", "doc_1"]
)
# During search, the query is also turned into an embedding
results = collection.query(
query_texts=["How do I start programming?"],
n_results=1
)
# Returns "Python is a programming language" as the closest match
Cosine distance
ChromaDB uses cosine distance (0 to 2):
distance = 0.0 → texts are identical
distance = 0.3 → very similar (relevance ~70%)
distance = 0.7 → weak connection
distance = 2.0 → opposite in meaning
results = collection.query(query_texts=["question"], n_results=5, include=["distances"])
for dist in results["distances"][0]:
relevance = 1 - dist
print(f"Relevance: {relevance:.1%}")
Why not plain keyword search?
Question: "How do I write a function?"
Text: "def lets you declare a subroutine in Python"
With keyword search — no matches. With semantic search — high relevance, because the meanings are close.
Limitations of ChromaDB’s built-in model
- The
all-MiniLM-L6-v2model handles English better than other languages - For better quality with non-English text: OpenAI
text-embedding-3-smallortext-embedding-ada-002
# Example with OpenAI embeddings (for reference)
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
ef = OpenAIEmbeddingFunction(api_key="sk-...", model_name="text-embedding-3-small")
collection = client.get_or_create_collection("docs", embedding_function=ef)
💬 Comments (0)
No comments yet
Be the first to share your opinion about this article!