Legacy code and AI copilots

# July 11, 2024

Code LLMs seemingly have a much more difficult task than freeform writing. Since programs need to conform to language specification, code generation is incredible sensitive to small issues. A missing comma or a missing argument results in a program fully not working; it's often a binary choice. When writing prose, the worst downside is a stilted paragraph or a few missing commas. Given that programming fluency is required training data alone (versus more formal guarantees like constrainted grammars) it's miraculous they work as well as they do.

In addition to the core language1, code LLMs also have to interplay with an ecosystem of constantly changing dependencies. These package versions constantly change in features, with different functions and syntax. To further complicate matters, LLMs are typically unaware of the dependency versions of the code that it's being trained on. They take all source files as independent examples. Github's Copilot tries to address this at runtime by building a .h-like representation of the functions that can be called from the current file. But not everything can be expressed in function calls alone.

Back when I was working on a few Next.js prototypes, I spent a lot of time using their new app based router convention. I won't dive into all the differences since the topic at hand is language modeling, not Javascript. But just for reference, a quick comparison between the two approaches.

This is the page router:

import { fetchData } from '../lib/data';

export default function Page({ serverData }) => {
  return (
    <div>
        <h1>Server-side Rendered Page</h1>
        <pre>{JSON.stringify(serverData, null, 2)}</pre>
    </div>
  );
};

export async function getServerSideProps() {
  const data = await fetchData();

  return {
    props: {
      serverData: data
    }
  };
}

And this is the app router, first introduced in Next.js 13:

import { fetchData } from '../lib/data';

export default async function Page() {
    const data = await fetchData();
    return (
        <div>
            <h1>Server-side Rendered Page</h1>
            <pre>{JSON.stringify(data, null, 2)}</pre>
        </div>
    );
}

The app router is clearer to write and can introduce some conditional switches in the view that's rendered, which can be convenient for more complicated interfaces. There's solid documentation available that walks through the most common use cases.

Before GPT4 was updated2, this app router was too new of a paradigm and wasn't included in their training data. ChatGPT was therefore unable to generate app router logic, and even when I added examples to a prompt manually, it would have a higher incidence of generation errors than the equivalent page router code. But even now when it's included, it's too dwarfed by the older corpus.

Probing GPT's knowledge

The below chat is with the current GPT-4 version. It still defaults to page based router examples even when the app router is clearly specified:


Me: Write me a standard server and client boilerplate function for next.js's app based router. It should use server side rendering to render the page. Make sure to use the app based router.

GPT:

Sure! Below is a basic example of a Next.js app with server-side rendering (SSR) using the App Router. This example includes a server component, a client component, and a Next.js page that uses SSR to render the page.

First, let's set up the project structure:

/my-next-app
  /pages
    /_app.js
    /index.js
  /components
    /ServerComponent.js
    /ClientComponent.js
  /styles
    /globals.css
  package.json
  next.config.js

Step 1: Set Up the `package.json` File

Create a `package.json` file for your project if you don't have one:

{
  "name": "my-next-app",
  "version": "1.0.0",
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start"
  },
  "dependencies": {
    "next": "latest",
    "react": "latest",
    "react-dom": "latest"
  }
}

Step 2: Install Dependencies

Run the following command to install the necessary dependencies:

npm install

Step 3: Create the Server Component

Create a file named `ServerComponent.js` in the `components` directory:

// components/ServerComponent.js
import React from 'react';

const ServerComponent = ({ data }) => {
  return (
    <div>
      <h1>Server Component</h1>
      <p>Data from server: {data}</p>
    </div>
  );
};

export default ServerComponent;

Step 4: Create the Client Component

Create a file named `ClientComponent.js` in the `components` directory:

// components/ClientComponent.js
import React from 'react';

const ClientComponent = () => {
  return (
    <div>
      <h1>Client Component</h1>
      <p>This is rendered on the client side.</p>
    </div>
  );
};

export default ClientComponent;

Step 5: Create the Custom App Component

Create a file named `_app.js` in the `pages` directory:

// pages/_app.js
import '../styles/globals.css';

function MyApp({ Component, pageProps }) {
  return <Component {...pageProps} />;
}

export default MyApp;

Step 6: Create the Page Component with SSR

Create a file named `index.js` in the `pages` directory:

// pages/index.js
import React from 'react';
import ServerComponent from '../components/ServerComponent';
import ClientComponent from '../components/ClientComponent';

const Home = ({ serverData }) => {
  return (
    <div>
      <ServerComponent data={serverData} />
      <ClientComponent />
    </div>
  );
};

export async function getServerSideProps() {
  // Fetch data from an external API or perform other server-side operations here
  const serverData = 'Hello from the server!';

  return {
    props: {
      serverData
    }
  };
}

export default Home;

The getServerSideProps is not used within the app router convention. This example still uses the format of the old page router convention. Not a surprise given the training data is highly tilted towards the page convention, which was in the wild for way longer.

If you then follow-up to correct the model, however, then it does get it correct:


Me: No, this is not the app convention. This is the page convention.


Got it, I understand now that you are looking for the app directory-based structure that was introduced in Next.js 13. Here is a boilerplate setup using the app-based router with server-side rendering (SSR).

Step 1: Set Up Project Structure

Create the following project structure:

/my-next-app
  /app
    /page.js
    /layout.js
    /server-component.js
    /client-component.js
  /styles
    /globals.css
  package.json
  next.config.js

Step 2: Set Up the `package.json` File

Create a `package.json` file for your project if you don't have one:

{
  "name": "my-next-app",
  "version": "1.0.0",
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start"
  },
  "dependencies": {
    "next": "13.0.0",
    "react": "18.0.0",
    "react-dom": "18.0.0"
  }
}

Step 3: Install Dependencies

Run the following command to install the necessary dependencies:

npm install

Step 4: Create the Server Component

Create a file named `server-component.js` in the `app` directory:

// app/server-component.js
import React from 'react';

const ServerComponent = ({ data }) => {
  return (
    <div>
      <h1>Server Component</h1>
      <p>Data from server: {data}</p>
    </div>
  );
};

export default ServerComponent;

Step 5: Create the Client Component

Create a file named `client-component.js` in the `app` directory:

// app/client-component.js
'use client';
import React from 'react';

const ClientComponent = () => {
  return (
    <div>
      <h1>Client Component</h1>
      <p>This is rendered on the client side.</p>
    </div>
  );
};

export default ClientComponent;

Step 6: Create the Root Layout

Create a file named `layout.js` in the `app` directory:

// app/layout.js
import '../styles/globals.css';

export default function RootLayout({ children }) {
  return (
    <html>
      <head>
        <title>My Next.js App</title>
      </head>
      <body>{children}</body>
    </html>
  );
}

Step 7: Create the Page Component with SSR

Create a file named `page.js` in the `app` directory:

// app/page.js
import React from 'react';
import ServerComponent from './server-component';
import ClientComponent from './client-component';

async function fetchServerData() {
  // Simulate fetching data from an external API
  return 'Hello from the server!';
}

export default async function Page() {
  const serverData = await fetchServerData();

  return (
    <div>
      <ServerComponent data={serverData} />
      <ClientComponent />
    </div>
  );
}

It takes additional nudging to encourage it to explore other areas of its weight space. And that's assuming that the models even have coverage of the latest frameworks within their training corpus. What if they don't?

Crawling is key

LLMs are notoriously data hungry. That's true in general, but that's particularly important for keeping updated on the latest nuances of programming libraries.

This is one place where Google still holds a pretty key advantage. I think it's undisputed that their search architecture is still the most sophisticated around. Jeffery Dean invented MapReduce just to handle the data volume. But an experienced team could quickly eat away at any particulars of their crawling advantage.3 Even an implementation that's 90% as good would still be able to collect the majority of data that's available online.

But Google's advantage isn't just technical. It's also cultural and economically incentivized. Most websites want to be found by Google. Maybe this is for product discovery, maybe it's to expand the reach of high quality content, or maybe it's paid SEO. But regardless of the reason - you usually want Google to crawl your websites because you want people to click into your domain and become legitimate traffic. Even if you don't love Google indexing your content, you'll usually put up with it anyway. The juice is worth the squeeze.4

Other crawlers (and I include crawlers for AI training data in this category) don't get the royal treatment. Most webmasters block other crawlers at the robots.txt level, by enforcing IP-based bans from the most common data center IP addresses, or by instituting more sophisticated fingerprinting to try and detect human presence. If you've ever tried to build a crawler before, you'll notice that you get rate limited or blocked far before Google crawlers ever would. The 66.249.x.x CIDR block really is a special gem.

Search engines might have the data advantage, but right now they're still training their models like everyone else. They have a batch training job that lasts for weeks or months, then they release the model. They'll repeat the exercise a few months later with new data and perhaps a new architecture. They're not actually leveraging their continuous crawling advantage in model development.

RAG Retrieval

There is a camp of ML researchers that believe RAG (retrieve-augment-generate) is the right abstraction to solve this problem. RAG introduces a conceptual framework of figuring out what users are looking for, then presenting this data to the LLM alongside the original user query. You can independently update the index without having to update the model itself.

Most of this pipeline looks more similar to a classic information retrieval system than to a modern LLM. It specifically combines an upfront pre-processing workflow and an inference workflow.

Upfront:

  • Map your dataset (in this case a crawl of the Internet) into datapoints of raw text.
  • Embed this text into an embedding space that captures its core meaning, or different embeddings that capture different questions that the page might answer.
  • Save these embeddings in an embedding store that's fast to query with a nearest-neighbor search.

Inference:

  • The user comes to a chatbot with a question.
  • Generate relevant search queries for this question. These queries could be phrases or they could be whole sentences. They can also be generated by an supplementary LLM or algorithmically via tf-idf or bm25.
  • Embed these search queries into the same embedding space as the documents that were indexed above.
  • Retrieve the top k (typically 5-15) documents from the index and inject these into a prompt.
  • Rely on the LLM to leverage the contextual data that is relevant, or ignore it if it's not helpful to answer the user's query.

As context windows get longer, we can fit more documents into the prompt. This allows the model to directly attend to the source content and potentially cite the content verbatim. This is in contrast to the core model weights that are baked into the architecture. They're both uninterpretable and have a fuzzier sense of knowledge; they usually can't verbatim cite their original training data.5

RAG is only as good as your embedding space, however. You have to be sure the right documents are retrieved to help the model answer the question. Otherwise it risks confusing the generation more than it aids the answer.

Stream-Based Training

I went to a lecture once from a tech lead on the Google speech-to-text project. He mentioned that Google received enough training data each day (either through in-house recording or through user submitted samples) that they were able to train their models in a continuous loop. Train the base model as a bulk job, then progressively fine-tune every day.

Unlike RAG, stream based training solves the problem of training a separate query generator/embedding model by baking that knowledge directly into the core model. It's also faster and cheaper at inference time since it uses fewer tokens as pure context in the prompt. The negative to this approach is it can result in more hallucinations that aren't supported by the original data; a proverbial crossing of wires with the conventions used by different framework revisions. But as models have gotten bigger and our training recipes have become more stable, I've seen a solid decrease in the frequency of hallucinations for common programming tasks. So I'm optimistic that we'll solve this problem more structurally - since it affects much more than just code generation.

Stream based training is a largely untapped recipe in modern LLMs, but I can see it gaining more traction over time as we start to converge on the optimal size/speed tradeoffs of models for most daily tasks. With the march towards subtle architecture differences and increased model weights, there's just not much incentive today to spend the time continuously training your model. Why burn the GPUs when you could just be working on another model architecture that reaches SOTA in a few months anyway?


  1. Even there, there can be some large additions from one minor vesion to the next especially in batteries-included languages like Python. 

  2. Currently finetuned on 2023 data. Originally it was locked to a crawl completed in 2021. 

  3. Once something in CS is figured out the first time, it's always quick to figure out the second. 

  4. NYT, I'm looking over at you. 

  5. Although there is some evidence of the contrary. 

Related tags:
#ml#llm
Reasoning vs. Memorization in LLMs
By virtue of their training objective, LLMs are optimized to model language and minimize the perplexity of examples. Memorization of input facts is an expected biproduct of this pipeline. General reasoning skills are the more unexpected emergent property.
Speeding up runpod
One issue I've occasionally observed on Runpod is varying runtime performance box-to-box. My working mental model of VMs is that you have full control of your allocation; if you've been granted 4 CPUs you get the ability to push 4 CPUs to the brink of capacity. Of course, the reality is a bit more murky depending on your underlying kernel and virtual machine manager, but usually this simple model works out fine.
The curious case of LM repetition
I was doing some OSS benchmarking over the weekend and was running into an odd issue. Some families of models would respond with near-gibberish, even with straightforward prompt inputs. This is a debugging session for LLM repetition.

Hi, I'm Pierce

I write mostly about engineering, machine learning, and company building. If you want to get updated about longer essays, subscribe here.

I hate spam so I keep these infrequent - once or twice a month, maximum.