Reasoning vs. Memorization in LLMs

April 13, 2023

The parameters of LLMs today have to jointly store facts as well as general reasoning logic.

When you ask the model to reason about something that it hasn't seen before, I consider that general reasoning. This could be some word based predicate logic or otherwise executing rote tasks where experts commonly agree upon the end result. I've been pleasantly surprised at how well modern LLMs can reason - although it's certainly not infallable. For instance:

> If a hippo is the only creature in the world to have milk, and Shawn is currently drinking milk, where did it come from?

If a hippo is the only creature in the world to have milk, and Shawn is currently drinking milk, then it must have come from a hippo. However, it's important to note that this hypothetical situation is not reflective of reality, as many other mammals produce milk as well.

Of course, I assume hippo milk wasn't in the training data.

But you can also ask the LLM about real world situations, much of which are at least loosely grounded in the truth:

> What are your favorite coffee shops in san francisco, ca?

As an AI, I don't have personal preferences or experiences. However, I can provide you with a list of popular and highly-rated coffee shops in San Francisco based on information available up to September 2021. Keep in mind that new establishments may have opened or others may have closed since then, so it's always good to check the most recent reviews and information.

1. Four Barrel Coffee - Known for their artisanal, single-origin roasts and hip ambiance.
2. Blue Bottle Coffee - A popular Bay Area-based chain, famous for their cold brews and pour-overs.
3. Ritual Coffee Roasters - Another local favorite, specializing in high-quality, ethically sourced beans.
4. Sightglass Coffee - Offers specialty coffees with a focus on sustainability and direct trade relationships.
5. Philz Coffee - Known for their custom blends, Mint Mojito Iced Coffee, and unique brewing methods.

While some of these are a bit too crowded for me, I can certainly back up most of the suggestions.

The model weights are serving dual purposes, clearly. No doubt what gives the model some portion of its reasoning ability is the ability to pattern match across a diversity of input examples. But let's try to decouple the two phenomenon. There is likely a lottery ticket within the model weights that encodes some generalized pattern. It's the lottery ticket which supports the bulk of generalized reasoning, versus just ad-hoc pattern matching to its universe of observed examples.

If I were forced to choose a large model (hundreds of billions of parameters) that can reason and memorize, versus just a smaller model (hundreds of millions to low billions) that can reason - which do I pick?

I'd argue that the utility of having the model memorize data is actually relatively low, outside of some appealing tech demos.

  • Without being able to verify the accuracy of the original data source, it's difficult to ensure that the model is actually providing trustworthy information. Having source documents readily accessible lets you trace the provenance and place your trust in the original authors - versus in a blackbox.
  • Most industry-scale usages of LLM are not going to rely on world-knowledge memorization. They're going to be reasoning across user documents, providing sentiment analysis, and extracting key terms. They're not going to be regurgitating facts.
  • Bigger models require more server-side resources, which affects their ability to do fast computation and increases latency from prompt to solution. When prompts are able to specify all input data in one-shot, this might be okay. But when chaining LLMs that are focused on different purposes, this quickly becomes untenable.

I'd much rather have some combination of interpretable and summarization systems when it comes to information retrieval. This follows the embedding database and search paradigm:

Input Query -> Vector Space -> Summarize Input Documents
            Indexed Document Search

Right now we can't have reasoning without memorization. But I think one day we will. Some combination of pruning parameters to identify the lottery ticket or model distillation might be enough to get us there with a bit of creative training recipes. And how much will that shrink model size? Maybe enough to bring them on-device or onto the edge.

This reduction in model size could have a significant impact on the deployment of LLMs in real world applications. By making models smaller and more efficient, we could enable on-device or edge-based processing, which would drastically reduce latency and increase real-time responsiveness. This would open up new possibilities for LLMs in areas like robotics, autonomous vehicles, and other applications that require fast decision-making.

Currently, the reliance on centralized server hosting for large language models results in latency ranging from 5 to 20 seconds, which might be acceptable for one-shot queries but can quickly become a drain when manipulating queries or chaining multiple LLMs together. The more application developers can assume latency is near-zero (like we do for graphics rendering, quick matrix multiplication, and even CDN edge caches today) the more applications that will use LLMs as a fundamental primitive.

Stay in Touch

I write mostly about engineering, machine learning, and company building. If you want to get updated about longer essays, subscribe here.

I hate spam so I keep these infrequent - once or twice a month, maximum.