Understanding Embeddings
Vector Embeddings Explained
Embeddings are the secret sauce behind semantic search. They convert text into numerical vectors that capture meaning—allowing us to find documents by concept, not just keywords.
**Key Insight:** Similar meanings → Similar vectors → Similar positions in vector space
How Embeddings Work
# Creating embeddings with OpenAI
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
"""Convert text to a 1536-dimensional vector"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Example: Compare two sentences
text1 = "The cat sat on the mat"
text2 = "A feline rested on the rug"
text3 = "Python is a programming language"
emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)
# Cosine similarity: 1 = identical, 0 = unrelated
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cat sentences: {cosine_similarity(emb1, emb2):.3f}") # ~0.92
print(f"Cat vs Python: {cosine_similarity(emb1, emb3):.3f}") # ~0.45Chunking Strategies
Before embedding documents, you need to split them into chunks. Chunk size affects retrieval quality:
**Too small (100 tokens):** Missing context, fragmented information **Too large (2000 tokens):** Diluted relevance, wasted context window **Sweet spot (200-500 tokens):** Often best for most use cases
**Chunking Methods:** 1. **Fixed-size:** Split every N characters/tokens 2. **Sentence-based:** Split on sentence boundaries 3. **Recursive text splitting:** Split on paragraphs, then sentences, then words 4. **Semantic chunking:** Group by topic (advanced)
Complete the function to split text into chunks of approximately 500 characters with 50-character overlap.
💡 Hint: To create overlap, move forward by (chunk_size - overlap) instead of chunk_size