RAG

Embeddings

Intermediate

Embeddings
This guide covers embeddings in depth — the foundation of all modern Retrieval-Augmented Generation (RAG) systems and semantic search. You will learn what embeddings are, how they work, which models to choose in 2026, practical implementation patterns with LangChain/LangGraph, chunking strategies, quality considerations, and production best practices with complete, runnable code.

Embeddings

What Are Embeddings?

Embeddings are dense vector representations of text (or other data) in a high-dimensional space where semantic similarity is captured by vector proximity (usually cosine similarity or Euclidean distance). Two similar sentences will have vectors that point in nearly the same direction, while unrelated sentences will be far apart. This enables fast, meaningful search without exact keyword matching.

Semantic Vector Representations

  • Sparse vectors (e.g., TF-IDF, BM25): High-dimensional, mostly zeros, keyword-based.
  • Dense embeddings: Low-to-medium dimensional (256–3072), floating-point numbers, capture meaning.
Dense embeddings power modern RAG because they understand synonyms, context, and intent.

How Embeddings Work

  1. Input text → Tokenizer → Transformer model (or other architecture)
  2. Model outputs a fixed-size vector (e.g., 1536 dimensions)
  3. Vectors are stored in a vector database with metadata
  4. At query time: embed query → nearest-neighbor search (ANN)

Embedding Models

Model
Provider
Dimensions
Strengths
Cost / Speed
text-embedding-3-small
OpenAI
1536
Excellent quality, easy API
Low cost, fast
text-embedding-3-large
OpenAI
3072
Highest quality from OpenAI
Higher cost
voyage-3-large
Voyage AI
1024/2048
Top retrieval performance
Competitive
BGE / E5 / Stella
Hugging Face
768–1024
Strong open-source, privacy
Free (local)
Qwen3-Embedding / Jina v4
Various
Flexible
Multilingual, lightweight
Excellent open models

OpenAI Embeddings

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",   # or "text-embedding-3-large"
    # dimensions=1024,                # optional reduction (Matryoshka)
    openai_api_key="..."
)

vector = embeddings.embed_query("What is LangGraph?")
print(len(vector))  # 1536

Local Embedding Models

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",   # strong performer
    # model_name="intfloat/e5-mistral-7b-instruct",
    # model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True}   # important for cosine similarity
)

# Batch embedding
texts = ["Document one...", "Document two..."]
vectors = embeddings.embed_documents(texts)

Text Similarity Concepts

  • Cosine Similarity: Most common for embeddings (angle between vectors)
  • Euclidean / L2 Distance: Used by some vector stores
  • Dot Product: Fast when vectors are normalized
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

sim = cosine_similarity([vec1], [vec2])[0][0]

Embedding Dimensions

Higher dimensions usually = better quality but more storage & slower search.
Modern models support Matryoshka Representation Learning, you can truncate dimensions with minimal quality loss.
Example with OpenAI:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)

Chunking Before Embedding

Critical step. Bad chunking destroys retrieval quality.
Recursive Character Text Splitter (Recommended Baseline)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("docs/")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # tokens ~ characters for English
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " ", ""]
)

chunks = splitter.split_documents(docs)
print(len(chunks))
Semantic Chunking (Advanced)
from langchain_experimental.text_splitter import SemanticChunker

semantic_splitter = SemanticChunker(
    embeddings,                     # any embeddings model
    breakpoint_threshold_type="percentile",  # or "standard_deviation"
    breakpoint_threshold_amount=85
)

chunks = semantic_splitter.split_documents(docs)

Embedding Large Documents Best practices:

  • Split into meaningful chunks (400–1000 tokens)
  • Add contextual metadata (document title, section, page)
  • Use hierarchical indexing (summary + detailed chunks)
  • Consider parent-document retriever in LangChain

Embedding Quality Considerations

  • Domain adaptation (fine-tune or use domain-specific models)
  • Multilingual support
  • Length normalization
  • Batch size for speed
  • Caching embeddings during development

Updating and Rebuilding Embeddings

# Incremental update
new_chunks = splitter.split_documents(new_docs)
vectorstore.add_documents(new_chunks)

# Full rebuild (when changing model or chunking strategy)
vectorstore = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings,
    collection_name="my_knowledge_base_v2",
    persist_directory="./chroma_db"
)

Common Embedding Mistakes

  • Using fixed-size chunking without overlap
  • Wrong distance metric (cosine vs euclidean)
  • Not normalizing embeddings
  • Embedding entire documents instead of chunks
  • Ignoring metadata filtering
  • Using outdated models (ada-002 instead of v3)
  • No evaluation of retrieval quality (use RAGAS or custom metrics)
  • Storing raw text without preprocessing (remove noise, normalize)

Best Practices for Embeddings

  1. Start with text-embedding-3-small or bge-large-en-v1.5
  2. Always use chunk overlap (10–20%)
  3. Prefer semantic or recursive splitting
  4. Add rich metadata (source, date, section, importance)
  5. Use hybrid search (vector + BM25/keyword)
  6. Normalize embeddings for cosine similarity
  7. Monitor retrieval metrics continuously
  8. Version your embedding model + chunking strategy
  9. Cache embeddings aggressively during development
  10. Test multiple models on your own data — never trust public leaderboards blindly

Pro Tip – Flexible Embedding Wrapper

class FlexibleEmbeddings:
    def __init__(self, model_name: str = "text-embedding-3-small"):
        if "text-embedding" in model_name:
            self.embeddings = OpenAIEmbeddings(model=model_name)
        else:
            self.embeddings = HuggingFaceEmbeddings(model_name=model_name)
    
    def embed_documents(self, texts):
        return self.embeddings.embed_documents(texts)
    
    def embed_query(self, text):
        return self.embeddings.embed_query(text)

# Easy switching
embeddings = FlexibleEmbeddings("BAAI/bge-large-en-v1.5")
Embeddings are the single most important factor in RAG performance after your data quality. Invest time here and your agents will become dramatically more accurate and reliable.

AI agent LangGraph Python RAG

← All training