An ORM for vector databases
April 22, 2023
This week I wrote an initial version of an ORM for vector databases. It lets you define indexes as Python objects and search for them using method chaining. The API aligns closely with existing SQL ORMs like SQLAlchemy or Peewee so the learning curve to getting started with this library should be relatively minimal.
Read on for a quick introduction to vectordb-orm, or hop into the source code here.
vectordb-orm offers an opinionated way to define and query for objects that have vector embeddings. Everything is oriented around the declared schema of the objects that you're looking to store. Typehints specify what kind of data these fields should accept and the ORM takes care of synchronizing the database to this schema definition. To define an example object that has a unique identifier with
embedding fields, do:
class MyObject(VectorSchemaBase): __collection_name__ = 'my_collection' __consistency_type__ = ConsistencyType.STRONG id: int = PrimaryKeyField() purchase: int tag: str = VarCharField(max_length=128) embedding: np.ndarray = EmbeddingField(dim=128, index=Milvus_IVF_FLAT(cluster_units=128))
Each key is optionally configured by a constructor that gives additional options. Some of these are required to give additional metadata about what the database expects (like in the case of embedding dimensions). The type annotations themselves indicate what form the values will take, and are used for casting and validation from the backend storage systems.
Querying also makes use of these type definitions to define the fields that you can search. Searching relies on native Python operations so requests can filter for values:
results = ( session .query(MyObject) .filter(MyObject.tag == 'in-store', MyObject.purchase > 5) .order_by_similarity(MyObject.embedding, search_vector) .limit(2) .all() )
Once the query executes, it'll cast the found database objects into instances of
MyObject. It will also return the relevancy score returned by the vector similarity method. This lets you pass these ORM objects around your application logic, complete with IDE typehinting:
print(results.result.tag, results.score) > in-store 0.05
The ORM masks a good amount of complexity on the backend for each provider, like casting types, field validation, and constructing the correct queries to the backend providers.
Rather severe context length limitations in the current generation of LLMs have given rise to approaches like the ReAct model. In this design pattern you embed a user's query or the current context into an embedding, then retrieve the most semantically similar pieces of content from a vector database. These can either be documents in a search system or memories in a more general purpose chatbot.
There's a lot of movement in building the ideal vector database. Like most distributed databases there are usually some fundamental tradeoff between consistency, recall, or querying speed. The most popular right now are Pinecone, Weaviate, and Milvus but new ones are popping up all the time with a different claim to their weighing of the core tradeoffs in search recall.
Given different requirements as deployments grow, I see the actual database in large part as an implementation detail. As it stands right now the switching costs between databases are pretty high.
Why an ORM
The mental model for different vector databases is effectively the same, and very similar to conventional relational databases. You have a datapoint that has some metadata and is enriched with a vector embedding. You want to do some combination of
SELECT from this table, where
SELECT queries involve both filtering for exact match data and finding similar vectors to some new embedding input.
Despite the common similarities, each of the vector database providers has their own API structure that are largely incompatible with one another. As such each major project is having to re-implement these backends manually for their own business logic to allow for the community to plug and play with their own favorite vectordbs.
An ORM naturally makes this easier by abstracting the complexities of backends from user-written application code. And so vectordb-orm was born. Like traditional ORMs it also allows for:
- Improved code maintainability and readability by abstracting low-level database operations
- Easy switching between different vector database providers without changing the application logic
- Encouraging best practices and design patterns for working with vector data
- Native typehints in your IDE hints when developing
- (Future) Centralized optimizations for insert batching and search pagination
vectordb-orm is still quite new so it only supports Milvus and Pinecone backends at the moment. A few items on the roadmap for future versions:
- Add support for additional databases. Weaviate and Redis are the next two on my priority list.
- Support bulk insertion of input vectors for the providers that support them. This can significantly speed up the initial upsert time for requests that go over the wire.
- Support more complex chaining of filters as backends allow. Allow
andchaining to create more complicated predicates. For the providers that don't support these commands natively, provide a local implementation that fetches data and then post-processes locally.
- Enhanced documentation and community support, including sample projects and tutorials.
If you give
vectordb-orm a spin and have some thoughts on the API contract or missing functionality, I'm all ears.