RAG
Embeddings
Intermediate
This guide covers
embeddings
in depth — the foundation of all modern Retrieval-Augmented Generation (RAG) systems and semantic search. You will learn what embeddings are, how they work, which models to choose in 2026, practical implementation patterns with LangChain/LangGraph, chunking strategies, quality considerations, and production best practices with complete, runnable code.
Embeddings
What Are Embeddings?
Embeddings are dense vector representations of text (or other data) in a high-dimensional space where semantic similarity is captured by vector proximity (usually cosine similarity or Euclidean distance).
Two similar sentences will have vectors that point in nearly the same direction, while unrelated sentences will be far apart. This enables fast, meaningful search without exact keyword matching.
Semantic Vector Representations
- Sparse vectors (e.g., TF-IDF, BM25): High-dimensional, mostly zeros, keyword-based.
- Dense embeddings: Low-to-medium dimensional (256–3072), floating-point numbers, capture meaning.
How Embeddings Work
- Input text → Tokenizer → Transformer model (or other architecture)
- Model outputs a fixed-size vector (e.g., 1536 dimensions)
- Vectors are stored in a vector database with metadata
- At query time: embed query → nearest-neighbor search (ANN)
Embedding Models
Popular options in 2026:
|
Model
|
Provider
|
Dimensions
|
Strengths
|
Cost / Speed
|
|---|---|---|---|---|
|
text-embedding-3-small
|
OpenAI
|
1536
|
Excellent quality, easy API
|
Low cost, fast
|
|
text-embedding-3-large
|
OpenAI
|
3072
|
Highest quality from OpenAI
|
Higher cost
|
|
voyage-3-large
|
Voyage AI
|
1024/2048
|
Top retrieval performance
|
Competitive
|
|
BGE / E5 / Stella
|
Hugging Face
|
768–1024
|
Strong open-source, privacy
|
Free (local)
|
|
Qwen3-Embedding / Jina v4
|
Various
|
Flexible
|
Multilingual, lightweight
|
Excellent open models
|
OpenAI Embeddings
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small", # or "text-embedding-3-large"
# dimensions=1024, # optional reduction (Matryoshka)
openai_api_key="..."
)
vector = embeddings.embed_query("What is LangGraph?")
print(len(vector)) # 1536
Local Embedding Models
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-en-v1.5", # strong performer
# model_name="intfloat/e5-mistral-7b-instruct",
# model_name="Qwen/Qwen3-Embedding-0.6B",
model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
encode_kwargs={"normalize_embeddings": True} # important for cosine similarity
)
# Batch embedding
texts = ["Document one...", "Document two..."]
vectors = embeddings.embed_documents(texts)
Text Similarity Concepts
- Cosine Similarity: Most common for embeddings (angle between vectors)
- Euclidean / L2 Distance: Used by some vector stores
- Dot Product: Fast when vectors are normalized
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([vec1], [vec2])[0][0]
Embedding Dimensions
Higher dimensions usually = better quality but more storage & slower search.
Modern models support Matryoshka Representation Learning, you can truncate dimensions with minimal quality loss.
Modern models support Matryoshka Representation Learning, you can truncate dimensions with minimal quality loss.
Example with OpenAI:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)
Chunking Before Embedding
Critical step. Bad chunking destroys retrieval quality.
Recursive Character Text Splitter (Recommended Baseline)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("docs/")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # tokens ~ characters for English
chunk_overlap=150,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(docs)
print(len(chunks))
Semantic Chunking (Advanced)
from langchain_experimental.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker(
embeddings, # any embeddings model
breakpoint_threshold_type="percentile", # or "standard_deviation"
breakpoint_threshold_amount=85
)
chunks = semantic_splitter.split_documents(docs)
Embedding Large Documents Best practices:
- Split into meaningful chunks (400–1000 tokens)
- Add contextual metadata (document title, section, page)
- Use hierarchical indexing (summary + detailed chunks)
- Consider parent-document retriever in LangChain
Embedding Quality Considerations
- Domain adaptation (fine-tune or use domain-specific models)
- Multilingual support
- Length normalization
- Batch size for speed
- Caching embeddings during development
Updating and Rebuilding Embeddings
# Incremental update
new_chunks = splitter.split_documents(new_docs)
vectorstore.add_documents(new_chunks)
# Full rebuild (when changing model or chunking strategy)
vectorstore = Chroma.from_documents(
documents=all_chunks,
embedding=embeddings,
collection_name="my_knowledge_base_v2",
persist_directory="./chroma_db"
)
Common Embedding Mistakes
- Using fixed-size chunking without overlap
- Wrong distance metric (cosine vs euclidean)
- Not normalizing embeddings
- Embedding entire documents instead of chunks
- Ignoring metadata filtering
- Using outdated models (ada-002 instead of v3)
- No evaluation of retrieval quality (use RAGAS or custom metrics)
- Storing raw text without preprocessing (remove noise, normalize)
Best Practices for Embeddings
- Start with text-embedding-3-small or bge-large-en-v1.5
- Always use chunk overlap (10–20%)
- Prefer semantic or recursive splitting
- Add rich metadata (source, date, section, importance)
- Use hybrid search (vector + BM25/keyword)
- Normalize embeddings for cosine similarity
- Monitor retrieval metrics continuously
- Version your embedding model + chunking strategy
- Cache embeddings aggressively during development
- Test multiple models on your own data — never trust public leaderboards blindly
Pro Tip – Flexible Embedding Wrapper
class FlexibleEmbeddings:
def __init__(self, model_name: str = "text-embedding-3-small"):
if "text-embedding" in model_name:
self.embeddings = OpenAIEmbeddings(model=model_name)
else:
self.embeddings = HuggingFaceEmbeddings(model_name=model_name)
def embed_documents(self, texts):
return self.embeddings.embed_documents(texts)
def embed_query(self, text):
return self.embeddings.embed_query(text)
# Easy switching
embeddings = FlexibleEmbeddings("BAAI/bge-large-en-v1.5")
Embeddings are the
single most important factor
in RAG performance after your data quality. Invest time here and your agents will become dramatically more accurate and reliable.
AI agent LangGraph Python RAG