[Read time: 4 minutes] April 22, 2023

This week I wrote an initial version of an ORM for vector databases. It lets you define indexes as Python objects and search for them using method chaining. The API aligns closely with existing SQL ORMs like SQLAlchemy or Peewee so the learning curve to getting started with this library should be relatively minimal.

Read on for a quick introduction to vectordb-orm, or hop into the source code here.

Introducing vectordb-orm

vectordb-orm offers an opinionated way to define and query for objects that have vector embeddings. Everything is oriented around the declared schema of the objects that you're looking to store. Typehints specify what kind of data these fields should accept and the ORM takes care of synchronizing the database to this schema definition. To define an example object that has a unique identifier with purchase , tag, and embedding fields, do:

class MyObject(VectorSchemaBase):
    __collection_name__ = 'my_collection'
    __consistency_type__ = ConsistencyType.STRONG

    id: int = PrimaryKeyField()
    purchase: int
    tag: str = VarCharField(max_length=128)
    embedding: np.ndarray = EmbeddingField(dim=128, index=Milvus_IVF_FLAT(cluster_units=128))

Each key is optionally configured by a constructor that gives additional options. Some of these are required to give additional metadata about what the database expects (like in the case of embedding dimensions). The type annotations themselves indicate what form the values will take, and are used for casting and validation from the backend storage systems.

Querying also makes use of these type definitions to define the fields that you can search. Searching relies on native Python operations so requests can filter for values:

results = (
    .filter(MyObject.tag == 'in-store', MyObject.purchase > 5)
    .order_by_similarity(MyObject.embedding, search_vector)

Once the query executes, it'll cast the found database objects into instances of MyObject. It will also return the relevancy score returned by the vector similarity method. This lets you pass these ORM objects around your application logic, complete with IDE typehinting:

print(results[0].result.tag, results[0].score)

> in-store 0.05

The ORM masks a good amount of complexity on the backend for each provider, like casting types, field validation, and constructing the correct queries to the backend providers.

Why VectorDBs

Rather severe context length limitations in the current generation of LLMs have given rise to approaches like the ReAct model. In this design pattern you embed a user's query or the current context into an embedding, then retrieve the most semantically similar pieces of content from a vector database. These can either be documents in a search system or memories in a more general purpose chatbot.

There's a lot of movement in building the ideal vector database. Like most distributed databases there are usually some fundamental tradeoff between consistency, recall, or querying speed. The most popular right now are Pinecone, Weaviate, and Milvus but new ones are popping up all the time with a different claim to their weighing of the core tradeoffs in search recall.

Given different requirements as deployments grow, I see the actual database in large part as an implementation detail. As it stands right now the switching costs between databases are pretty high.

Why an ORM

The mental model for different vector databases is effectively the same, and very similar to conventional relational databases. You have a datapoint that has some metadata and is enriched with a vector embedding. You want to do some combination of INSERT and SELECT from this table, where SELECT queries involve both filtering for exact match data and finding similar vectors to some new embedding input.

Despite the common similarities, each of the vector database providers has their own API structure that are largely incompatible with one another. As such each major project is having to re-implement these backends manually for their own business logic to allow for the community to plug and play with their own favorite vectordbs.

An ORM naturally makes this easier by abstracting the complexities of backends from user-written application code. And so vectordb-orm was born. Like traditional ORMs it also allows for:

  • Improved code maintainability and readability by abstracting low-level database operations
  • Easy switching between different vector database providers without changing the application logic
  • Encouraging best practices and design patterns for working with vector data
  • Native typehints in your IDE hints when developing
  • (Future) Centralized optimizations for insert batching and search pagination

The Future

vectordb-orm is still quite new so it only supports Milvus and Pinecone backends at the moment. A few items on the roadmap for future versions:

  • Add support for additional databases. Weaviate and Redis are the next two on my priority list.
  • Support bulk insertion of input vectors for the providers that support them. This can significantly speed up the initial upsert time for requests that go over the wire.
  • Support more complex chaining of filters as backends allow. Allow or, and chaining to create more complicated predicates. For the providers that don't support these commands natively, provide a local implementation that fetches data and then post-processes locally.
  • Enhanced documentation and community support, including sample projects and tutorials.

If you give vectordb-orm a spin and have some thoughts on the API contract or missing functionality, I'm all ears.

[Read time: 4 minutes] April 13, 2023

The parameters of LLMs today have to jointly store facts as well as general reasoning logic.

When you ask the model to reason about something that it hasn't seen before, I consider that general reasoning. This could be some word based predicate logic or otherwise executing rote tasks where experts commonly agree upon the end result. I've been pleasantly surprised at how well modern LLMs can reason - although it's certainly not infallable. For instance:

> If a hippo is the only creature in the world to have milk, and Shawn is currently drinking milk, where did it come from?

If a hippo is the only creature in the world to have milk, and Shawn is currently drinking milk, then it must have come from a hippo. However, it's important to note that this hypothetical situation is not reflective of reality, as many other mammals produce milk as well.

Of course, I assume hippo milk wasn't in the training data.

But you can also ask the LLM about real world situations, much of which are at least loosely grounded in the truth:

> What are your favorite coffee shops in san francisco, ca?

As an AI, I don't have personal preferences or experiences. However, I can provide you with a list of popular and highly-rated coffee shops in San Francisco based on information available up to September 2021. Keep in mind that new establishments may have opened or others may have closed since then, so it's always good to check the most recent reviews and information.

1. Four Barrel Coffee - Known for their artisanal, single-origin roasts and hip ambiance.
2. Blue Bottle Coffee - A popular Bay Area-based chain, famous for their cold brews and pour-overs.
3. Ritual Coffee Roasters - Another local favorite, specializing in high-quality, ethically sourced beans.
4. Sightglass Coffee - Offers specialty coffees with a focus on sustainability and direct trade relationships.
5. Philz Coffee - Known for their custom blends, Mint Mojito Iced Coffee, and unique brewing methods.

While some of these are a bit too crowded for me, I can certainly back up most of the suggestions.

The model weights are serving dual purposes, clearly. No doubt what gives the model some portion of its reasoning ability is the ability to pattern match across a diversity of input examples. But let's try to decouple the two phenomenon. There is likely a lottery ticket within the model weights that encodes some generalized pattern. It's the lottery ticket which supports the bulk of generalized reasoning, versus just ad-hoc pattern matching to its universe of observed examples.

If I were forced to choose a large model (hundreds of billions of parameters) that can reason and memorize, versus just a smaller model (hundreds of millions to low billions) that can reason - which do I pick?

I'd argue that the utility of having the model memorize data is actually relatively low, outside of some appealing tech demos.

  • Without being able to verify the accuracy of the original data source, it's difficult to ensure that the model is actually providing trustworthy information. Having source documents readily accessible lets you trace the provenance and place your trust in the original authors - versus in a blackbox.
  • Most industry-scale usages of LLM are not going to rely on world-knowledge memorization. They're going to be reasoning across user documents, providing sentiment analysis, and extracting key terms. They're not going to be regurgitating facts.
  • Bigger models require more server-side resources, which affects their ability to do fast computation and increases latency from prompt to solution. When prompts are able to specify all input data in one-shot, this might be okay. But when chaining LLMs that are focused on different purposes, this quickly becomes untenable.

I'd much rather have some combination of interpretable and summarization systems when it comes to information retrieval. This follows the embedding database and search paradigm:

Input Query -> Vector Space -> Summarize Input Documents
            Indexed Document Search

Right now we can't have reasoning without memorization. But I think one day we will. Some combination of pruning parameters to identify the lottery ticket or model distillation might be enough to get us there with a bit of creative training recipes. And how much will that shrink model size? Maybe enough to bring them on-device or onto the edge.

This reduction in model size could have a significant impact on the deployment of LLMs in real world applications. By making models smaller and more efficient, we could enable on-device or edge-based processing, which would drastically reduce latency and increase real-time responsiveness. This would open up new possibilities for LLMs in areas like robotics, autonomous vehicles, and other applications that require fast decision-making.

Currently, the reliance on centralized server hosting for large language models results in latency ranging from 5 to 20 seconds, which might be acceptable for one-shot queries but can quickly become a drain when manipulating queries or chaining multiple LLMs together. The more application developers can assume latency is near-zero (like we do for graphics rendering, quick matrix multiplication, and even CDN edge caches today) the more applications that will use LLMs as a fundamental primitive.

[Read time: 2 minutes] March 30, 2023

I don't know if people have come up with a good acronym for Python services that compete with MERN or LAMP, but if they have then SQLAlchemy and Alembic are almost certainly included. SQLAlchemy (recently in version 2.0) makes it easy to define ORM schemas for database objects and Alembic keeps everything updated with automatically generated migration files.

If you're using this stack then you probably know the pain that code enums introduce. Declaring an enum requirement in a model is pretty straightforward:

from sqlalchemy import Enum as SqlEnum
enum_field = Column(SqlEnum(MyEnum))

And Alembic will even pick up on the new enum creation:

def upgrade():
    op.add_column('my_table', sa.Column('enum_field', sa.Enum("A", "B", name='myenum'), nullable=True))

So far, so good. Unfortunately when you actually change this enum (as you know does happen) you're out of luck. Alembic ignores this enum value change even when it's outdated from the current database value. So this change:

class MyEnum(Enum):
    A = "A"
    B = "B"


    class MyEnum(Enum):
        A = "A"
        B = "B"
        C = "C"

Creates no diff:

def upgrade():

And will result in a database error if you actually try to use it.

(builtins.LookupError) C is not among the defined enum values. Enum name: myenum. Possible values: A, Bn[SQL: INSERT INTO invitations...

Spoiler alert: We probably want to use it.

I stumbled upon alembic-autogenerate-enums, which is a neat approach to solve this problem. This lets you make changes to an enum value that's already inserted into the database and have alembic auto-generate the value migration commands:

poetry run alembic revision --autogenerate -m "add new enum value"

This will now result in the following:

def upgrade():
    op.sync_enum_values('public', 'myenum', ['A', 'B'], ['A', 'B', 'C'], [('simple_model', 'enum_field')], False)

def downgrade():
    op.sync_enum_values('public', 'myenum', ['A', 'B', 'C'], ['A', 'B'], [('simple_model', 'enum_field')], True)

Running the upgrade will add value C to the database enum specification without affecting previous values A & B. Downgrades to the previous alembic revision will strip this C value (assuming no existing database objects are using it) and restore state A & B.

The version 0.2.0 that I started using only had support for forward migration of enums (appending new values to the overall definition) but couldn't downgrade() to previous enum revisions. You usually only want to downgrade enum definitions in very limited circumstances, but still, it seemed like a good thing to add for locally testing schema changes. My PR hasn't yet hit pypi but you can grab the latest from master.

[Read time: 7 minutes] March 20, 2023

GPT-4 was announced this past week. Some key takeaways from their paper:

  1. Greater performance on human tests (Uniform Bar Exam, LSAT, AP Calculus BC) in addition to ML benchmarks, showing greater logical reasoning ability and the capability to synthesize information across multiple academic domains
  2. Introduction of multimodal input where images can be included alongside text, and where text prompts can reference the content of the images themselves
  3. Greater sequence lengths available, with models for 8K tokens and 32K tokens compared to GPT's current 4k

Improvements in (1) and (2) speak for themselves. But personally I'm far more excited about the trends we're seeing in (3).

Historical Attention Limitations

The explosion of transformer architectures (BERT all the way to the GPT family) was facilitated through attention. Attention allowed for far richer modeling of long-term text dependencies, since the end of a sequence could directly reference the start of a sequence with no loss of representational power. Generative models use this same insight, with the outputted text able to consider all the text that is provided in the prompt.

Attention unlocked the key performance leap from LSTMs and RNNs, where both suffer from a local-window bias because of their need to back-propagate through time. The further back you go in a sequence the more the gradient decreases because of continuous multiplication, classically known as the vanishing gradient problem. This prevented such long-range dependencies from being learned during a regular training loop.

Attention did have one core limitation. Whereas LSTMs can (theoretically) support an unlimited sequence length, attention models force a maximum sequence length up-front so they can add positional encodings and fit all time-steps in memory concurrently. This sequence length also isn't particularly long. It was 512 tokens for the original generation of transformers and has slowly been creeping into the thousands with larger server-based models like GPT. GPT-3 was capped at 2048 and 3.5 is 4096. This is roughly 3000 words assuming a 3/4 token:word ratio. Since classic attention complexity grows quadratically with input length, it becomes progressively harder to increase the model's context window.

Sequence Length In the Wild

Complex input domains (academic papers, financial reports, legal, code) typically have much larger windows of relevance. The preamble might provide term definition for things that follow. The "Related Works" section might provide context for the advancements revealed later on. The author might sketch out two arguments in their own sections and link them together in the third.

Chat might be an even more salient example with the explosion of assistants like ChatGPT and Microsoft Copilot. A conversation naturally meanders - and to provide a human-like experience, a model needs to be able to cite information provided way earlier in the dialogue. The aggregate of earlier information typically dictates the conversation path to follow. Ideally, it would be able to consider all previous sessions you've had with the model; building up its internal knowledge of your habits and preferences to adapt its communication style over time.

Previous implementations to parse these domains would rely on hacks - some pre-processing model to extract high-likelihood text regions, information retrieval of top likelihood documents, or heuristics about the user's current state. The limitation of LLM sequence length constrained the model from being able to learn the best internal representation for these domains.

Linear Attention Models

As far as I've seen, OpenAI hasn't released technical details on how they're increasing sequence length for GPT-4. If I had to guess it's some clever partitioning of GPU resources, a lower level optimized attention head, and potentially some compression of the input bits - like switching from 32/16 floating-point to 8bit representations.

There's been a niche field of research into sub-quadratic or linear attention implementations that allow for much longer context windows to fit into memory. These largely have not caught on, as mentioned in a good survey of the field:

Many applications even in language and vision are still dominated by vanilla Transformers with quadratic attention and none of these xformer variants have caught on as the defacto standard. There might be multiple explanations from multiple angles for this phenomena... As noted by (Rabe and Staats, 2021), for applications that require to flex on sequence length and memory needs time to time, it might be suffice to ‘just sequentially process it’ even if that might not be inherently as satisfying as finding a theoretical approximate.

In other words, typically sliding the context window of 512, 2048, or 4096 is enough to model most problems. But some of this perspective no doubt stems from the fields in which ML is already applied, not fields which might benefit from it in the future. Most of the classic benchmarks of language models are short text classification or generation that cap out at a few thousand tokens.

Context windows are more important with the current breed of chatbots. As LLMs have to interface with a large amount of user chat history in one conversation, or instruction follow over large and complex task definitions, it needs a greater view on user context.

I'm optimistic that the increase of data in fields that require long-context windows will encourage more of this research, and the deployment of these models into highly optimized LLMs. Existing linear attention models have mostly been technical proof of concepts, without shipping an end-to-end generative model like LLaMA or GPT. Some of this comes down to different goals. Many linear attention models are non-autoregressive because of their training process, so they can't do the text generation that has exploded over the past year. But others can and simply haven't benefited from the same computational resources as the LLMs being shipped from OpenAI and Facebook. I bet this will change.

The approach that's gaining the widest traction is Flash Attention. Unlike most of the other linear attention architectures, it provides a numerically exact implementation of attention that is still linear in memory usage (thanks to lower-level cuda memory access optimizations). Most of the new OSS models are being trained with Flash Attention by default; it's easily swappable for classic attention without some of the performance drawbacks of approximate methods. Having the same API might be what the linear attention community needed to really take off.

Lossless attention is non-human, but maybe that's okay

The framing of quadratic attention is quite different from how humans think. It provides the entire sequence at one time, allowing the model to recover the exact content that was provided earlier in a sequence. Humans obviously don't have this level of precision. When you're having a conversation, you can remember generally what was said a few minutes ago, but you can't remember word-for-word. Brains do varying degrees of information compression; short-term memory has moderate compression to allow for higher fidelity and long-term memory has more aggressive compression to allow more content to be stored.

In ML terms humans are creating a continuous representation of the conversation. Let's call this R1. We contextualize new information I that we receive based on our current representation. This serves two goals: to understand the additional information even when not self-explanatory P(I|R1) as well as to update our internal representation for future turns R1 -> R2.

LLMs that use quadric attention doesn't do this, which probably sets an upper limit on how large their context windows can grow. One benefit of linear attention research is typically an intermediary stage (like kernel approximation) that does vaguely represent the human compression process. But this compression process could be modeled more directly.

  1. We could model the representation state more explicitly by providing an external vector that captures session state. This vector necessarily has to be very high dimensional to handle all the subtleties of potentially pertinent information that happened previously. But in theory - and with enough training data - I believe a model can learn a useful compression into this intermediary state. It's a way to outsource some of the memory storage requirements without having to optimize attention directly.
  2. We could use existing LLMs to continuously summarize a longer session over time. Instead of verbatim appending new inputs to the session (I1 + I2), we continuously ask the model to generate a new R. This is guaranteed to keep the session plus input less than the maximum allowable sequence length, to allow for even longer context windows to be given in a text-only domain.

The second is much easier, since it layers on top of existing models. The drawback might be the same that occurs to people. Some concepts are more easily captured as abstract thoughts before pushing them into words. An abstract vector representation might provide that property better than forced language summarization.


I'm particularly excited to see production models with growing sequence lengths, alongside increasing research into sub-quadratic attention. The business mania around NLP right now might be the push that we need to double down on longer context windows. The longer the input sequence the more useful the current generation of LLMs will be: information extraction, summarization, and basic inference. I'm excited to get my hands on gpt4-32k; and even more excited for the prospect of gpt5-1m. A man can dream, right?

[Read time: 8 minutes] March 8, 2023

In Reinforcement Learning there's a core tradeoff between exploration vs. exploitation. In a limited-time game, you have to choose how much you pursue novel paths and how much you stick with what you know. The latter becomes that much harder once you know that a policy mostly works. When do you decide to stop optimizing and call perfection the enemy of good?

Common wisdom says children explore while adults exploit. Children are willing to try new things in their spare time. Adults fall back on old hobbies and accept their limitations. At some point, we tend to transition from one to the other - perhaps because of risk intolerance, time limitations, or sheer laziness.

There's a body of psychology research into cultivating growth mindsets and how certain behaviors can help or hurt learning. At their core, the research shows that mindset is more important than neural plasticity. I'm sure that's right. But I feel like the hesitation to learn new things can be summed up more directly. Namely: it sucks to suck and you often have to suck in public.


  • There's often physical pain associated with learning something new. It's a core limbic function to avoid pain and undue risks. You fall during a new sport; your hands fatigue from playing guitar; you land a cut from a chef's knife chopping veggies.
  • There's always emotional pain with learning something new. You're good at the majority of things you do every day. Being back in a position where you're no longer good - where your body is no longer performing with the accuracy you come to expect - takes a mental toll. You don't throw the ball where you want it; you can't play the note; your soufflé looks like a pancake.
  • Most hobbies are not mastered in private. You need to get out on the court and face another team; you need to play alongside another musician; you want to cook dinner for your family and not give everyone food poisoning.

I never learned how to ski growing up. We went on a handful of family trips to the mountain, but I would always end up re-learning the same pizza/french fry technique in a group lesson. Throughout the weekend I would get a bit better but then I would take off my skis and wait until the next season to put them on again. I never spent enough time accelerating into the fun area.

As it turns out - a ton of people at Stanford learned how to ski at some point. A lot learned on the east coast, some in Europe. We'd sometimes go to Tahoe as a dorm and I'd snowshoe instead of heading to the mountain. It didn't seem practical to learn again during a weekend. Especially with a higher center of gravity and exceeding the weight limit for the magic carpet.

This winter I decided to learn. I was living in Tahoe for the season since we found a good deal on a lease.1 I bought a season pass, skipped the gym membership, and started going to the mountain as my main hobby.

I vividly remember my first run. I was at the top of an absolute cliff face2 with two friends - both pro skiers. I was convinced I was going to fall on my face and roll the rest of the way down.

They skipped the beginner instructions and went straight to parallel skiing. Keep both skis tight together and alternate pressure from one foot to the other. Start with meandering turns almost perpendicular to the run. I followed their advice and got all of ten feet before landing in the snowbank beside us. After that, I not-so-gracefully got down the hill.

We looped that same two runs for the rest of the day as I got my ski legs under me. I'm sure to them it felt like we were circling a city block over and over again. Consistent scenery and not working up a sweat. For me, it still felt like I was balancing on toothpicks - some alien thing breaking the connection between me and the ground. But we got to have some laughs and catch up along the way.

At this point, twelve weeks later, I can get down any run. Some more gracefully than others. And I certainly have the thought "Dang, this is steep" more than a few times in a weekend. But sometimes you just have to swallow that feeling, face your skis down, and push off.

My takeaway from the past few months is that frequency is key. If you want to master technique quickly, you have to be on the mountain multiple days in a row and loop as many runs as you can. Infrequency is a non-starter.

I managed to carve out two hours a day to head to the mountain - filling the same slot that going to the gym typically does. On a weekday from 2-4, you're not battling any lines so you can loop a run for as long as you want. Ride up, ski down, and immediately do it again. I've gotten more skiing done in two hours on a Tuesday than I have in nine hours on a Saturday. Then I drive home and get back to work from 4-8. If you work remotely, weekdays are the move.

Another argument for frequency is pure economics. Skiing is prohibitively expensive for a day pass, even on weekdays. I've seen anywhere from $150-$300 depending on where you're going. The consolidation of passes into Ikon and Epic allows for a much lower cost per day, assuming you're committing to go frequently. For at least one season this seemed like the best way to get started and get comfortable.

I started off renting gear from Sports Basement, a local sports coop in the Bay Area. They have a package that rents skis, boots, poles, and a helmet. That worked fine while I was testing the water. But after day three my feet were killing me. Everyone I talked to said that bad boots were probably the culprit.

Most people say getting solid boots is the most important priority on the mountain. They last for years and give you more control of your skis. So - I took the plunge and bought boots. I went to a local shop where they measured my size and pulled a few different versions from the back. The whole process took two hours. They were thorough.

I was beyond thankful I didn't try to buy them online. The whole process felt a bit like going to Ollivanders in Harry Potter. The boot store owner gave major "The boot chooses you" energy. The nuance even in the same boot size can be significant. And plastic isn't the most comfortable material if it doesn't fit well.

I forced myself to ski with friends as much as I could. They dragged me on runs I had never done before - and where I wasn't positive I'd be able to get down. But they never got me into a situation where I was too far outside my ability horizon. And to that, I credit their understanding of my current capabilities. With maybe a dash of luck thrown in.

I've found that skiing is an activity where your ability quickly outpaces your confidence. You can probably ski most things, you just have to override the feeling of steepness along the way. You need to trust the people you're skiing with. I heard too many stories about people getting stuck on some rough terrain where they had to slide down for an hour. I made sure my friends were on the same page: let's go N+1 not N+10 on each step. With enough trust capital built, I was able to focus on form.

I also started bringing headphones for the times when I wasn't with other people. It let me catch up on a podcast, audio book, or just listen to music when I was on some solo lifts. Looping the same terrain doesn't provide novelty - but it does let you isolate the environment and just focus on technique.

I was told early on to "do every run twice and never end on a bad run." If you can get down one time, you can get down again. That one run is usually enough to rewrite your mental perception of steepness. And with that confidence means more speed and more fun on the ride. I've made that a motto. Even after a few falls it's a good idea to get back up and try again.

There was a red bull documentary a few years ago about a famous motocross jumper. He said that on every run he takes, he would have 10 escape lanes before he had to commit. Being able to bail early let him follow his intuition if something felt wrong. When he had to finally commit, he knew that he had plenty of opportunities behind him where he could have bailed. Having an out let him commit.

That's reflected in my own experience. When you jam on the downhill ski hard enough, you can stop on almost any terrain. Stress testing that theory - and stopping enough times along the way - built confidence in my intuition around speed. Learning how to stop allowed me to go fast.

I now love getting to the mountain - for the views, for the exercise, for the feeling of adventure. But it was also a good reminder about the process of learning something new. It was rough at times and it sometimes still will be for the foreseeable future. But that's part of the experience for which I'm grateful. Sucking in public really isn't that bad.

  1. This house currently has a caved-in ceiling because of the 50 feet of snow that hit Tahoe this winter. But it was nice while it lasted. 

  2. It was objectively a bunny slope.