RAG

Embeddings

Intermediate

This guide covers embeddings in depth — the foundation of all modern Retrieval-Augmented Generation (RAG) systems and semantic search. You will learn what embeddings are, how they work, which models to choose in 2026, practical implementation patterns with LangChain/LangGraph, chunking strategies, quality considerations, and production best practices with complete, runnable code.

Embeddings

What Are Embeddings?

Embeddings are dense vector representations of text (or other data) in a high-dimensional space where semantic similarity is captured by vector proximity (usually cosine similarity or Euclidean distance). Two similar sentences will have vectors that point in nearly the same direction, while unrelated sentences will be far apart. This enables fast, meaningful search without exact keyword matching.

Semantic Vector Representations

Sparse vectors (e.g., TF-IDF, BM25): High-dimensional, mostly zeros, keyword-based.
Dense embeddings: Low-to-medium dimensional (256–3072), floating-point numbers, capture meaning.

Dense embeddings power modern RAG because they understand synonyms, context, and intent.

How Embeddings Work

Input text → Tokenizer → Transformer model (or other architecture)
Model outputs a fixed-size vector (e.g., 1536 dimensions)
Vectors are stored in a vector database with metadata
At query time: embed query → nearest-neighbor search (ANN)

Embedding Models

Popular options in 2026:

Model	Provider	Dimensions	Strengths	Cost / Speed
text-embedding-3-small	OpenAI	1536	Excellent quality, easy API	Low cost, fast
text-embedding-3-large	OpenAI	3072	Highest quality from OpenAI	Higher cost
voyage-3-large	Voyage AI	1024/2048	Top retrieval performance	Competitive
BGE / E5 / Stella	Hugging Face	768–1024	Strong open-source, privacy	Free (local)
Qwen3-Embedding / Jina v4	Various	Flexible	Multilingual, lightweight	Excellent open models

OpenAI Embeddings

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",   # or "text-embedding-3-large"
    # dimensions=1024,                # optional reduction (Matryoshka)
    openai_api_key="..."
)

vector = embeddings.embed_query("What is LangGraph?")
print(len(vector))  # 1536

Local Embedding Models

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",   # strong performer
    # model_name="intfloat/e5-mistral-7b-instruct",
    # model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True}   # important for cosine similarity
)

# Batch embedding
texts = ["Document one...", "Document two..."]
vectors = embeddings.embed_documents(texts)

Text Similarity Concepts

Cosine Similarity: Most common for embeddings (angle between vectors)
Euclidean / L2 Distance: Used by some vector stores
Dot Product: Fast when vectors are normalized

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

sim = cosine_similarity([vec1], [vec2])[0][0]

Embedding Dimensions

Higher dimensions usually = better quality but more storage & slower search.
Modern models support Matryoshka Representation Learning, you can truncate dimensions with minimal quality loss.

Example with OpenAI:

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)

Chunking Before Embedding

Critical step. Bad chunking destroys retrieval quality.

Recursive Character Text Splitter (Recommended Baseline)

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("docs/")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # tokens ~ characters for English
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " ", ""]
)

chunks = splitter.split_documents(docs)
print(len(chunks))

Semantic Chunking (Advanced)

from langchain_experimental.text_splitter import SemanticChunker

semantic_splitter = SemanticChunker(
    embeddings,                     # any embeddings model
    breakpoint_threshold_type="percentile",  # or "standard_deviation"
    breakpoint_threshold_amount=85
)

chunks = semantic_splitter.split_documents(docs)

Embedding Large Documents Best practices:

Split into meaningful chunks (400–1000 tokens)
Add contextual metadata (document title, section, page)
Use hierarchical indexing (summary + detailed chunks)
Consider parent-document retriever in LangChain

Embedding Quality Considerations

Domain adaptation (fine-tune or use domain-specific models)
Multilingual support
Length normalization
Batch size for speed
Caching embeddings during development

Updating and Rebuilding Embeddings

# Incremental update
new_chunks = splitter.split_documents(new_docs)
vectorstore.add_documents(new_chunks)

# Full rebuild (when changing model or chunking strategy)
vectorstore = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings,
    collection_name="my_knowledge_base_v2",
    persist_directory="./chroma_db"
)

Common Embedding Mistakes

Using fixed-size chunking without overlap
Wrong distance metric (cosine vs euclidean)
Not normalizing embeddings
Embedding entire documents instead of chunks
Ignoring metadata filtering
Using outdated models (ada-002 instead of v3)
No evaluation of retrieval quality (use RAGAS or custom metrics)
Storing raw text without preprocessing (remove noise, normalize)

Best Practices for Embeddings

Start with text-embedding-3-small or bge-large-en-v1.5
Always use chunk overlap (10–20%)
Prefer semantic or recursive splitting
Add rich metadata (source, date, section, importance)
Use hybrid search (vector + BM25/keyword)
Normalize embeddings for cosine similarity
Monitor retrieval metrics continuously
Version your embedding model + chunking strategy
Cache embeddings aggressively during development
Test multiple models on your own data — never trust public leaderboards blindly

Pro Tip – Flexible Embedding Wrapper

class FlexibleEmbeddings:
    def __init__(self, model_name: str = "text-embedding-3-small"):
        if "text-embedding" in model_name:
            self.embeddings = OpenAIEmbeddings(model=model_name)
        else:
            self.embeddings = HuggingFaceEmbeddings(model_name=model_name)
    
    def embed_documents(self, texts):
        return self.embeddings.embed_documents(texts)
    
    def embed_query(self, text):
        return self.embeddings.embed_query(text)

# Easy switching
embeddings = FlexibleEmbeddings("BAAI/bge-large-en-v1.5")

Embeddings are the single most important factor in RAG performance after your data quality. Invest time here and your agents will become dramatically more accurate and reliable.

AI agent LangGraph Python RAG

← All training