Lesson 2 of 3

Understanding Embeddings

35 min

Vector Embeddings Explained

Embeddings are the secret sauce behind semantic search. They convert text into numerical vectors that capture meaning—allowing us to find documents by concept, not just keywords.

**Key Insight:** Similar meanings → Similar vectors → Similar positions in vector space

How Embeddings Work

pythonEmbeddings capture semantic similarity

# Creating embeddings with OpenAI
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    """Convert text to a 1536-dimensional vector"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Example: Compare two sentences
text1 = "The cat sat on the mat"
text2 = "A feline rested on the rug"
text3 = "Python is a programming language"

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)

# Cosine similarity: 1 = identical, 0 = unrelated
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"Cat sentences: {cosine_similarity(emb1, emb2):.3f}")  # ~0.92
print(f"Cat vs Python: {cosine_similarity(emb1, emb3):.3f}")  # ~0.45

💡Embedding models are different from chat models. Use text-embedding-3-small for embeddings, gpt-4 for generation.

Chunking Strategies

Before embedding documents, you need to split them into chunks. Chunk size affects retrieval quality:

**Too small (100 tokens):** Missing context, fragmented information **Too large (2000 tokens):** Diluted relevance, wasted context window **Sweet spot (200-500 tokens):** Often best for most use cases

**Chunking Methods:** 1. **Fixed-size:** Split every N characters/tokens 2. **Sentence-based:** Split on sentence boundaries 3. **Recursive text splitting:** Split on paragraphs, then sentences, then words 4. **Semantic chunking:** Group by topic (advanced)

Implement Basic Chunking🐍 Python

Complete the function to split text into chunks of approximately 500 characters with 50-character overlap.

💡 Hint: To create overlap, move forward by (chunk_size - overlap) instead of chunk_size

PreviousIntroduction to RAG Architecture

NextBuilding a Vector Search Pipeline