[Read time: 7 minutes] August 20, 2023

flash-attention is a low level implementation of exact attention. Unlike torch, which processes attention multiplications in sequence, flash-attention combines the operations into a fused kernel, which can speed up execution by 85%. And since attention is such a core primitive of most modern language models, it makes for much faster training and inference across the board.

It now has an install time that's just as fast. Tri Dao (the main package author) and I recently added precompiled binaries to the python package. I'll say upfront: this particular implementation is a rather unorthodox use of wheels. Standard wheels only can depend on operating system and python version; flash-attention requires CUDA and torch versions as well. So it naturally required a bit of off-roading. This is a breakdown of the approach we took.

What's in a wheel

Many optimized libraries contain C or C siblings, which must be built at some point before Python can execute it at runtime. Python's had support for these for a long time: first the setuptools egg format and now wheels. Both let maintainers delegate compilation to a CI machine before clients install the package. They pre-build a version of the code - targeted for a specific OS and Python version - and push this to pypi along the raw code. One build, potentially millions of installs.

There are three main concepts for wheels:

sdist (source distribution): Raw code that is uploaded to pypi, so individuals can build from scratch if necessary. This is the fallback behavior when wheels are not available.

bdist (binary distribution): Compiled version of code, shipped as binaries. This could be wheels, but also .exe for Windows executables, .rpm for Red Hat package manager, etc.

bdist_wheel: a type of bdist that creates a .whl file. These wheels conform to PEP 427, which specifies the conventional format for how libraries are expected to build this code for it to be picked up by pypi.

Solution Sketch

The flash-attention library is a Python wrapper over C++ and CUDA, so at install time it needs to compile itself for the current OS and installed dependencies. Here we have a 5D tensor of dependencies: OS, Python Version, Torch Version, CUDA version, and the C++11 binary interface.

The existing wheel installation behavior was so close to what we needed, but it couldn't quite be shoehorned. There's a hard assumption in pip that wheels will only be based on host operating system and python version, and wheel filenames will be named accordingly. As such there's a lot of path-sniffing logic to determine matching resources. It would be near impossible to override all of these places without writing a fully custom wheel installer.

So let's take a step back. How is this whole process architected?

  1. pip / setuptools determine if there's a compatible resource in pypi
  2. If so, we install this version
  3. Otherwise, we use the sdist raw code to build the binary from scratch.

Build logic is handled through bdist_wheel, which implements the build logic for packages that aren't already built:

  1. Determine dependencies: OS, Python Version
  2. Construct a string for the wheel filename, specifying its dependencies
  3. run() the wheel building, which sets up the C-level compiler and builds the file in a temporary directory
  4. After running, determine if there were any errors in the build process
  5. If everything worked, move the filepath out of the built artifact into a permanent location

There might not be an easy way to modify the standard wheel installation logic, but there is an easy way to short circuit this build process. We target the third step. Instead of always running the build, we make it conditional: If a matching dependency is found, use that as the built file. If it's not, build from scratch. If we're clever about filepaths, from the rest of the bdist_wheel pipeline it will look like we just built the file. From there all downstream linking and installation should happen the same as if everything's completely local.

The Code

1. Install a custom cmdclass in setup()

class CachedWheelsCommand(_bdist_wheel):
    def run(self):
        if FORCE_BUILD:
            return super().run()

        ...

setup(
    ...
    cmdclass={
        'bdist_wheel': CachedWheelsCommand,
        "build_ext": BuildExtension
    } if ext_modules else {
        'bdist_wheel': CachedWheelsCommand,
    },
    ...
)

By default, setuptools will build wheels through the bdist_wheel command. It supports overriding the class that is used, however, by specifying an alternative command class in the setup(). We keep the bulk of the logic but re-implement the behavior of the main runner.

We also keep an environment variable for force-building from scratch. This is useful if clients run into some issues during install time, and is also how CI will use our same setup.py file to force a new wheel build before it's pushed into the artifact repository.

2. Determining Dependencies

class CachedWheelsCommand(_bdist_wheel):
    def run(self):
        ...

        # Determine the version numbers that will be used to determine
        # the correct wheel
        # We're using the CUDA version used to build torch, not the
        # one currently installed
        torch_cuda_version = parse(torch.version.cuda)
        torch_version_raw = parse(torch.__version__)
        python_version = f"cp{sys.version_info.major}{sys.version_info.minor}"
        platform_name = get_platform()
        flash_version = get_package_version()
        cuda_version = f"{torch_cuda_version.major}{torch_cuda_version.minor}"
        torch_version = f"{torch_version_raw.major}.{torch_version_raw.minor}"
        cxx11_abi = str(torch._C._GLIBCXX_USE_CXX11_ABI).upper()

        ...

In addition to the standard wheel dependencies, flash-attention requires specific cuda versions, torch versions, and C++11 ABI versions to run. We first parse the dependencies that are installed locally to make sure we're pulling a compatible wheel.

3. Conventional github source

BASE_WHEEL_URL = "https://github.com/Dao-AILab/flash-attention/releases/download/{tag_name}/{wheel_name}"

class CachedWheelsCommand(_bdist_wheel):
    def run(self):
        ...

        # Determine wheel URL based on CUDA version, torch version, python version and OS
        wheel_filename = f'{PACKAGE_NAME}-{flash_version}+cu{cuda_version}torch{torch_version}cxx11abi{cxx11_abi}-{python_version}-{python_version}-{platform_name}.whl'
        wheel_url = BASE_WHEEL_URL.format(
            tag_name=f"v{flash_version}",
            wheel_name=wheel_filename
        )
        print("Guessing wheel URL: ", wheel_url)

        ...

Since the package is already hosted on github, including the wheels in a release was pretty natural. Each file tied to a github release is associated with the given tag_name of a particular release. We let the setup script guess this automatically using the current flash version that is specified in the versions file. The general artifact pattern is specified in BASE_WHEEL_URL.

From there, we use the dependencies to build up a conventional name for the wheel. The format of the actual wheel_filename doesn't technically matter. We just need to make sure CI builds the file to the same path, so it's uploaded properly to the github releases.

4. Download

class CachedWheelsCommand(_bdist_wheel):
    def run(self):
        ...

        try:
            urllib.request.urlretrieve(wheel_url, wheel_filename)

            # Make the archive
            if not os.path.exists(self.dist_dir):
                os.makedirs(self.dist_dir)

            ...
        except urllib.error.HTTPError:
            print("Precompiled wheel not found. Building from source...")
            # If the wheel could not be downloaded, build from source
            super().run()

While we could also parse the release page and determine artifacts that way, these github artifacts can be accessed directly from their URL. This is the most direct route to checking for the presence of a compatible wheel file that has been uploaded. If the file path exists, we have a prebuilt file available and download it locally. If not, we fall back to the super's run() behavior that will proceed to build the file from scratch.

5. Swapping the file

class CachedWheelsCommand(_bdist_wheel):
    def run(self):
        ...

        try:
            ...

            impl_tag, abi_tag, plat_tag = self.get_tag()
            archive_basename = f"{self.wheel_dist_name}-{impl_tag}-{abi_tag}-{plat_tag}"

            wheel_path = os.path.join(self.dist_dir, archive_basename + ".whl")
            print("Raw wheel path", wheel_path)
            os.rename(wheel_filename, wheel_path)

        ...

The rest of the bdist_wheel pipeline assumes that run() will write its artifact to a specific place. It uses the convention in wheel_path here to validate success, and move this file to other system paths. We copy this same pattern from where it's defined originally in the bdist_wheel class:

f"{self.wheel_dist_name}-{impl_tag}-{abi_tag}-{plat_tag}"

Writing to this convention ensures that for all logic downstream, the wheel is treated like a locally built copy of the build files.

Conclusion

I've written custom cmd classes before, so knew on some level that they were just executing arbitrary python code. But my mental model was still that they're only used for building local binaries. Doing surgery on the existing build_ext logic seemed out of the question.

But at the end of the day, the existing run() logic has a pretty simple API contract. Callers expect it to take in the raw source and write a binary file to a path. Everything else is left up to the implementation. cmd_class really is just a generic install hook for packages that need wheels; you could write as much logic as you'd like here and pip will run it alongside the install.

This custom wheel command decreased the package install time from a worst case 2hours (on one core) to an easy 10 seconds. The time just comes from having to download the new wheel dependency. It's certainly made our CI pipelines faster - hope it can also speed up your development workflow.

[Read time: 5 minutes] May 26, 2023

The most compelling bearish case I see for LLMs is that they'll plateau in performance, either because we'll saturate novel data from large scale crawling or because of some fundamental logical reasoning limitation of existing transformer architectures. "Hooray," the critics will cheer. We can go back to humans being in the driver's seat. Skynet will have to wait until another day.

That last part might be true. But even a bearish case for performance can still change a lot. Let's make three conservative assumptions:

  1. LLMs will be trained in largely the same way as they are today
  2. LLMs will only have knowledge of public information through large-scale text corpora
  3. Experts (either individually or a consortium) will still perform better than LLMs for professional grade tasks

In this world the real breakthrough with large language models might not be exceeding human levels of performance in a discrete task. Perhaps it's enough that they can approach human level performance in a variety of tasks. There might be more whitespace in intersectional disciplines than aiming for true expert status in any one.

Jack of all trades

There's a good book by David Epstein that argues for breadth over depth for most pursuits in life. The subtitle of the book sums it up: "Why Generalists Triumph in a Specialized World." His illustrative examples are mostly focused on research in education, sports, and business. These arguments mostly boil down to two key observations:

  1. The majority of successful adults took a circuitous route to their professions. They almost all built up general skills in childhood before focusing on niche domains later.
  2. More problems in the world today require some intersectionality than ever before. They require people to take novel information or signal from across domains and apply them to new problems (think: data science x product, surgery x robotics, law x net neutrality).

Extrapolating a bit from these two points: Learning general skills allowed people to build up a mental model of the world. This applied both physiologically (precision while throwing a ball, fast twitch muscle fibers, etc) and neurologically (mental models for thinking about problems, collaboration, creativity). These skills were broad enough to be applied to their more specific field.

Large capacity generative models are by definition generalists. They are not isolated to training data in a specific discipline, but instead it's the act of being exposed to a whole variety of different domains that seems to give them the positive emergent properties of reasoning skills.

This broad training also has the benefit of individuals being able to ask them about a whole range of areas. It'll usually give an answer with some compelling validity. Even when hallucinating, the answers often sound plausible. They have been able to internalize a lot of the vocabulary and facts of each domain. LLMs in many ways are a programatic Jack of all trades.

The rare connective tissue between disciplines

Back when I was at university I was surprised by how little collaboration happened between departments. Even within one department people are often so separated by philosophical schools of thought that they continue to pursue their own research interests in shallow lanes.1 This is despite clear applicability and relevancy of external domains.

Even in ML, a lot of the most innovative work of the past ten years came from other fields. Optimization momentum in Adam came from an interpretation of physics; word embeddings came from linguistics and lexical semantics going back to 1957.

These are anecdotal experiences but it seems like the trend rings true. Most experts have spent their lives honing a skill in one particular discipline. They went deep to pursue some new course. And they needed to; there are enough smart people working on hard problems in each discipline that to truly do something novel, you have to go into the weeds.

But there are just not enough hours in the day to get a grasp of everything. An expert can't wade through arxiv to pattern match between multiple disparate disciplines in the hope of stumbling upon something that might help your own pursuit. And if they try, they are often faced with such differing terminology to refer to the same concepts that they can miss a pattern that's otherwise hiding in plain sight. I'm reminded of the overridden meanings for the k constant: everything from the spring constant in physics to the Boltzmann constant in thermodynamics.

Enter language models

Where humans can't pursue that breadth, neural networks might be able to. They have enough of that breadth codified into weights that they're able to regurgitate them on command. And if they're able to summarize facts from a particular discipline, there's no reason to think they might not be able to automatically mine similarities between two of them.

I'm convinced there is a sea of research (perhaps some quite meaningful) that are relatively low hanging fruit today. They've just been historically overlooked because they sit at the intersection of two disciplines; or require pulling pulling on threads from the far ranges of two unrelated disciplines. Connecting these dots likely don't require a PhD. They don't require being a true world class expert. But they do require enough understanding of two subjects to pattern match in novel ways.

This fits with the above theses on the current limitations of LLMs. They might not exceed human performance but still could develop some breakthroughs - even if that breakthrough is just reframing the problem in understandable terms, and having a person run from there.

The key lifecycle to research in the early stages are:

  1. Need identification: What are the biggest problems that are facing a given research area today?
  2. Literature review: Why has existing prior art in your field not solved for this problem?
  3. (External) What relevant streams of research exist from other disciplines?
  4. Frame hypotheses and experiment.

The 3rd part of the cycle seems like the most obvious usage of these broad models today. But with a bit more sophistication, there's no reason to think they can't run a broader loop of this lifecycle: research literature, develop additional questions, perform additional research, and hone from there.

Conclusion

LLMs may not be the best at everything, but their very nature as broad, generalist systems might just prove to be their biggest strength. And this isn't even a pie-in-the-sky idea reserved for the next generation of architectures. I'd be surprised if these models aren't already working behind the scenes in some PhD theses.

As the original saying goes, "A jack of all trades is a master of none, but oftentimes better than a master of one."


  1. This is one phenomenon that even artificial intelligence research does not escape. The school of ML that comes from a more classical logician heritage dismissed end-to-end neural networks for years. Now the tables have turned and it's the neural network practitioners that are largely ignoring the logicians. 

[Read time: 3 minutes] May 11, 2023

I took a computer vision course in college with a rather provocative professor. One of the more memorable lectures opened with a declaration: representations in machine learning are everything1. With a good enough representation, everything is basically a linear classifier. Focus on the representation, not on the network.

When we were simultaneously training neural networks with millions of parameters, I thought it a rather insane thing to say.

Technically he's right - of course. If you can shove most of the difficult work of taking an input and projecting it into a numerical representation (and that numerical representation is conditioned on the task that you want to accomplish), you by definition have turned your problem into a separable one. Once you've solved your problem, you've basically... solved your problem. Technically speaking the representations can collapse to one common set of representations for True and one common set of representations for False.

It's interesting to think about this again with the recent wave of autoregressive models. The latency in the current generation of generative models are caused because they feed every newly generated token back into the decoder - one by one by one. The longer your desired output the longer it's going to take the model to generate the answer.

But if you can fit a lot of data into the initial request to the model, and manage to frame the problem it as a binary classification, results can be delivered almost instantly. You still need the model to encode the initial string but this is lightning fast. It's a few big tensor multiplications and can be parallelized accordingly. That's it. Results happen in one decoder step.

Representations are hidden to the average user of a language model but they're hiding just underneath the surface. It's the representation that lives within a layer or two short of the projection head, right before the network has to decide the next word to output. It has a sense of what you're intending to do; at least in so far as that intent minimizes the perplexity of what comes next.

Interestingly, encouraging the model to "think" or "criticize" itself in writing helps to refine these representations further. Encouraging some text-based linear reasoning can be enough to arrive at a right answer. This is true even though jumping straight to predicting the output might result in a wrong one.

Another perspective on LLMs is that they're universal representation agents, and able to condition a good representation based on the goals that you want it to achieve. Once it has that internal state the eventual projection to [True,False] is the easy part. Representations might not be everything but they come pretty close.


  1. Representations being the numerical equivalents of real world objects. Neural networks operate on numbers; not words or images. So before a network can even start learning, you need to make some choices about how to turn those words into floating points (bigram, character-level, wordpiece, byte-pair encoding, etc). 

[Read time: 7 minutes] April 28, 2023

I was considering a quick weekend project to route Siri requests to ChatGPT. My immediate thought was pretty simple based on my existing mental model for Siri as a local<->server pipeline:

  • Set up a MITM proxy or custom DNS server
  • Intercept outgoing Siri requests and route them to a backend LLM
  • Respond with a correctly formatted payload, potentially with some additional utilization of the widgets that are bundled in iOS for weather or calculated responses

That required me seriously poking around with Siri for the first time in a couple years. A lot has changed since I last took a look.

Everything's Local

Since iOS 15, all Siri processing is done locally on device. There's a local speech-to-text model, a local natural-language-understanding module, and a local text-to-speech model. The local NLU module surprised me the most. All logic appears hard-coded and baked into the current iOS version. It'll respond to known tasks like the weather, setting a timer, converting weight, looking up a word, and sending a text message. For all other requests it will open up a web view and search your default search engine for a result. It doesn't attempt to create a text response to queries that fall out of its known action space.

To confirm this behavior I set up a proxy server and started capturing requests. Making a Siri request did indeed issue no external requests, until you ask it for something that requires current world understanding. Asking What's the Weather? routes a specific weather request to Apple's SiriSearch backend through PegasusKit. PegasusKit is a private framework that contains some miscellaneous utilities for image search and server communication.

No Visible Conversation History

One of the original Siri announcement demos was reading a text message, checking for conflicting appointments, and responding to the text message. This demonstrated some contextual understanding - discussing a topic, having a sidebar, and going back to the same topic again. It was impressive because it was similar to how humans communicate. Because we have a robust context memory, we can use sentences that drop the subject, object, or verb because the meaning can still be inferred from what was said before.

On previous versions of iOS, the logical object in Siri was one of these conversations. You'd hold down the home button and Siri would take over the screen. New requests would pop to the top, but you could scroll up to reveal past requests in the same session. The new Siri removed support for these conversation flows. But the underlying logic is still there, as evidenced by requests that do reference previous context:

How's the weather in San Francisco?
How about in Oakland?

This works - it successfully knows we're asking about weather. It's just that the interface for previous prompts is hidden. The new logical object in Siri is intended to be adhoc questions.

An aside on old NLU

The previous generation of personal assistants had control logic that was largely hard-coded. They revolved around the idea of an intent - a known task that a user wanted to do like sending a message, searching for weather, etc. Detecting this intent might be keyword based or trained into a model that converts a sequence to a one-hot class space. But generally speaking there were discrete tasks and the job of the NLU pipeline was to delegate it to sub-modules. If it believes you're looking for weather, a sub-module would attempt to detect what city you're asking about. This motivated a lot of the research into NER (named entity recognition) to detect the more specific objects of interest and map them to real world quantities. city:San Francisco and city:SF to id:4467 for instance.

Conversational history was implemented by keeping track of what the user had wanted in previous steps. If a new message is missing some intent, it would assume that a previous message in the flow had a relevant intent. This process of back-detecting the relevant intent was mostly hard-coded or involved a shallow model.

With increasing device processing and neural inference, all of these models could be brought to the edge. So why not?

Motivation

I don't know the internal reason why Apple chose to roll out local Siri processing in iOS 15 - but we can loosely speculate. The first BETA was released at WWDC in June 2021, which meant that work on a local migration probably started around a year prior at June 2020. GPT-3 was released at nearly the same time: June 2020. Prior to that point generative models were still pretty niche; their main power was generating cohesive text but not logical reasoning or reliable output. The risk factor of malicious output was too high and there was no clear roadmap to decreasing the amount of hallucinations and increasing the logical abilities.

So, given this landscape, I imagine Apple had two key motivations:

  1. Getting Siri on a local device would decrease latency and increase its ability to function offline.

    Those are big wins for a platform that often forced users to wait longer for a server response than for users to do the task themselves. Speech-to-text and text-to-speech models were getting good enough to deploy on the edge and have fast inference to happen in realtime. And Siri's business logic itself was always a relatively simple control system, so this would be easy enough to code locally. There was no need to keep this pipeline on the server.

  2. Privacy

    Apple's has long tried to push more processing to the edge to avoid sending data to their servers if avoidable. Object detection in photos happens locally, encrypted iMessages are routed through a central routing system but otherwise sent directly to devices for storage, etc. Siri was a hole in this paradigm - so if Apple could push it to the edge, why wouldn't they?

Future

The new generation of self-supervised LLMs have almost nothing in common with these previous generation of NLU models. They may support task delegation through something like ChatGPT Plugins or LangChain, but their control logic and subsequent follow-ups are all the emergent property of the training data. They don't limit their universe of responses to known intents, which has shown to be incredibly powerful both for its ability to respond in natural language and its ability to bridge logic across multiple sub-systems.

Apple's in somewhat of a bind here. On one hand - they made a switch to local devices to improve offline support and improve privacy. On the other - the new generation of LLM models are drastically better than the NLU approaches of previous years. They support more functionality and better reasoning than the systems that came before.

Can't Apple just implement a new backend to Siri using LLMs? There's been a lot of movement in compressing LLMs onto laptops and phones using bit quantization. The phone POCs have focused on the 7B or 11B Alpaca models because of memory requirements (and almost certainly inference computation speeds). This is in the ballpark of the GPT3.5 model powering ChatGPT (at 20B) but a far cry away from GPT-4's 1T parameters 1.

At least until we improve model distillation and quantization we can always assume local models will be a generation behind server hosted versions. And people are perfectly willing to use server processing to access the latest and greatest models across personal and businesses2. 11B models are useful; 1T models are super useful; 5T models will probably be even more so - although with some diminishing returns to scale. Privacy might take a backseat to processing performance.

I have no doubt that Apple is working on a local generative architecture that can back future versions of Siri. I'd actually put money on them rebranding Siri in iOS 17 or iOS 18 and drop the legacy baggage. The real question in my mind is how Apple will weigh higher cognitive performance (server-side only) or more privacy (local only).

This is how I'd roadmap a feature rollout like this:

  1. V1. Re-introduce a server-side processing model. Speech can be converted into text on-device for speed of device->server text streaming, but the LLM processing logic should be on the server.
  2. V2. Allow 3rd party applications to provide a manifest with their own API contracts. Define what each API endpoint does and the data that it requires to work. If the LLM detects that these applications are relevant to the current query, route the parsed data into the application payload and send it back to the device.
  3. V3. Add a local model to the device that's only supported offline and routes to a server side when users have the bandwidth.

OS integration is certainly where we're headed. I'll hold my breath for WWDC this year to see if any of these dreams are realized, and what it looks like when they are.


  1. At least according to Semafor. There's been some public debate about how many parameters GPT-4 actually contains. 

  2. Quoting the linked, "ChatGPT is the fastest growing service in the history of the internet. In February 2023, it reached the 100 million user mark." 

[Read time: 7 minutes] April 27, 2023

Minimum Viable Products (MVPs) are popular in startups because they allow testing of underlying risks without needing to build a full solution. In hardware companies, MVPs demonstrate that a solution can be physically built given specification constraints. In software companies, they prove that a problem is legitimate and there is consumer interest in the solution. Once you get the initial signal, iterate. Move fast and break things. The other clichés also apply.

Public infrastructure obviously isn't treated the same way. You can't iterate when you're building huge things. You also can't tolerate failure in the same way. You don't want a bridge constructed in a month only to fall down the year after. The bulk of the bureaucracy for infrastructure is making sure projects meet this bar of safety; safe to use, safe to be around, and safe for the environment.

But we do have minimum viable infrastructure: situations where we have some basic infrastructure but it's simply not good. You can check the boxes on a tourism advertisement and that's about it. It doesn't actually solve for the key needs that the infrastructure seeks to address but it's typically visible enough for people to be aware of its existence.

As one representative case, a staggering number of people I speak with in San Francisco categorically refuse to ride the Muni, which services the bus and lightrail lines in the city. Common complaints include reliability, noise, safety, and the inconvenience of switching lines. The drawbacks are so severe they've opted to avoid public transit altogether. If their destination is nearby, they'll walk; otherwise, they'll call an Uber or buy a car1.

But as someone who still rides Muni despite its problems, I'll be the first to say it's really not that bad. And given my own hesitations, I'm always surprised that I often end up being its strongest defender. But the reputation of public transit in the Bay Area is unfortunately so deeply buried in the trash can that it would be a Herculean task to lift it out. Only 57 percent of Muni riders rate its overall service positively; and that's only considering people that actually ride the Muni.

That's bad. It causes a negative flywheel - fewer people ride transit, resulting in less public support for funding transit, which in turn leads to service interruptions or cuts, and even fewer riders. The cycle continues.

Irritatingly, nothing's actually being done about it. There needs to be more investment in public transportation around US metro centers to make it legitimately useful, but I'm sure voters would galk at the pricetag to fund changes that are actually required. For comparison, here's a non-scientific comparison of SF and a few other cities2:

City Operating Budget USD Revenue USD Note Latest Project Costs USD
San Francisco 1.3B 219M (usage fares)
361M (parking fees)
Shortfall made up by city general fund and state operating grants
Large capital projects funded by proposition funds
1. 300M (Van Ness Bus Lanes)
2. 1.95B (SF Central Subway)
London 9.82B 11.32B 925M in capital renewals
518M in net interest costs
1. 25B (Elizabeth line)
2. 435M (DLR Train Upgrade)
Copenhagen Unknown Unknown 1. 2.3B (M1 & M2 Subways)
2. 3.8B (M3 Subway)
3. 492M (M4 Subway)
Auckland 2.29B (amortized budget for the decade) 2B

Muni's operating budget is $1.3 billion annually. The composition of this budget is relatively rare, however, in having such a large discrepancy between budget and revenue brought in by ridership fares. Even when you include parking ticket revenue, the SFMTA is still reliant on the city and state for the $720million of shortfall in additional subsidies. Imagine increasing the Muni budget by 2x to equate it to Auckland, or 8x to equate it to London. Based on current revenue this would mean almost all budget increases would have to be funded by additional tax revenue. Most voters would shrug and ask why. They're not going to use it anyway.

Let's take away SF's minimum viable infrastructure for a second. Let's imagine that the city had no public transit. None at all. No busses, no lightrail, no subway, no anything. Just cars on the winding hills for as far as the eye can see. It can still have a CalTrain terminal in Mission Bay, that I'll give to you.

Given San Francisco's progressive tendencies and its wealthy tax base, the absence of public transit would certainly be deemed unacceptable. Task forces would be formed, activists would get involved, and before you know it there would be a proposal on the local ballot. The outcome would be one of two things:

  1. Subsidize ridesharing, like Uber or Waymo
  2. Invest in public transit

I imagine the bureaucratic headache of the first approach would be severe but not entirely unsurmountable. More problematic is the congestion; rideshare services will need more cars to service the demand and those cars will necessarily make traffic worse. Public transit has the benefit of being able to bypass individual cars - either in dedicated lines or underground - so it can be net beneficial to reducing traffic and getting people somewhere faster.

So let's assume SF chooses to invest in a serious public transit works project. How do you pay for it? Levy additional property/sales tax most likely, or sell bonds for a future return on investment. I'd bet you voters would pass that proposition in a heartbeat - even if it ended up being a 2x, 5x, or 10x multiplier of what Muni's budget is today. The unknown whitespace of something new (dare I say - going from 0 to 1) creates excitement. The promise of a better future for a currently broken system just doesn't deliver in the same way.

The danger of having infrastructure offerings at all when they're bad is that people think they can never get good. There's nothing physically intractable about getting good infrastructure in San Francisco or in the US more broadly. It's a question of policy and funding. But I do believe those policy issues are intractable if there's not a hard fork with the current system.

Let's instead play out this:

  1. Modify California's stringent (and often unnecessarily litigious) environmental review for certain public transit projects. Have a fixed public comment window and then stop accepting roadblocks.
  2. Take two billion dollars (the same amount that SF just spent on one 2 mile extension of a subway) and invite a public competition for digging tunnels and building stations. The Boring Company is obviously a candidate but let the city put its innovation where its mouth is. I'm sure the proptech ecosystem would be thrilled to throw their hat into the ring.
  3. Create a new public-private company to manage the new effort. Cap the staffing at 20 of the best people you can. Pay them way above top of market, assuming they meet certain project milestones.
  4. Don't discount the public facing brand. Before any concrete is poured, people will look towards graphic design for signs that this effort is a break the same bureaucratic policy that gave us Muni, Bart, and Caltrain. You have some good material to tap if you're looking for inspiration.
  5. Make it go where people want to go outside of working hours. Connect Chrissy Field with the Financial District. Connect the Mission with the Sunset. Connect the Richmond with the Embarcadero.
  6. Dig. Dig out of the lime-light. Dig without having to close Van Ness for 5 years. But whatever you do, just dig.
  7. Don't open it prematurely. It's better to exceed expectations than to fall flat. First impressions are lasting. Build momentum, keep building excitement.

Then finally, in 5 years, or 10 years, release it to the public. Maybe coincide it with the first flower blooms of spring. Bypass rush hour traffic and get to the ocean shore from your office building. Go out in Hayes when you live in Potrero and pop back home. For that vision 2028 or 2033 doesn't even sound that far away.

The core of my point - and unfortunately I fear it may be true - is we need a reset on public transit. For all the promises of finally reinvesting in infrastructure development, we're not doing the best job. Voters need to be handed a more comprehensive choice: a package of law changes that make it easier to build, the establishment of learner agencies to conduct that building, and a better marketing message to the people: "If you fund this, we're going all in. It'll be worth the wait."

There's nothing minimum viable about that.


  1. When I really pushed, most people said they last tried to ride the Muni 5 years ago. Some admitted to never trying to ride it at all. Its damaged reputation so far preceded it that they just wrote it off entirely. 

  2. I looked around for a centralized source that aggregated funding, revenue, and large projects but came up short. If you dig into the raw data I was surprised to find how variable accounting models are for transit across the world. There are a lot of subtle differences in how revenue sources are reported, what is drawn from the city budget versus increases in tax rates, and how large infrastructure projects are green-lit and funded. If anyone has more updated or consistent data, please shoot it my way.