Semantic Code Search
Codegen’s VectorIndex
enables semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren’t present.
Basic Usage
Create and save a vector index for your codebase:
Later, load the index and perform semantic searches:
The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.
Getting Embeddings
You can also get embeddings for arbitrary text using the same model:
How It Works
The VectorIndex
class:
- Processes each file in your codebase
- Splits large files into chunks that fit within token limits
- Uses OpenAI’s text-embedding-3-small model to create embeddings
- Stores embeddings in a numpy array for efficient similarity search
- Saves the index to disk for reuse
When searching:
- Your query is converted to an embedding using the same model
- Cosine similarity is computed between the query and all file embeddings
- The most similar files are returned, along with their similarity scores
Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.
Example Searches
Here are some example semantic searches that demonstrate the power of the system:
The semantic search can understand concepts and return relevant results even when the exact terms aren’t present in the code.