Independent work: November recap

# December 22, 2022

Quick update for the second month of independent work. I've been working on many of the same projects as last month, in addition to a couple new ones.

Dagorama

Dagorama is a new open source project I'm starting to use in more deployments. It's not quite ready for prime-time but I see its promise.

The Motivation

Background processing jobs are often dependent on one another. You need to bundle up multiple datapoints for GPU batch processing. You need to crawl a group of webpages, then do a similarity measurement between them.

Whenever I ran into these situations, I'd usually shove them into an existing celery or kafka system. But these tools are made for queues. To escape from the queue, and start dealing with more input datapoints, it means you're looking at a fully fledged computation graph instead of task queues.

Airflow: Have used extensively at a previous job. Specify configuration via programmed flows (but requires these to be curated ahead of time). Doing something like dynamic branching of jobs depending on the current state isn't possible. Also the requirement to run a full scheduler, webserver, and metadata database makes it too heavy for smaller deployments. The Python API, while improved in recent versions, still feels more like configuration than actual programming.

Dask: Works but relatively heavy requirement. Feels like you're programming a computation graph, because well, you are. While it's solid for parallel computing and has great NumPy/Pandas integration, it's primarily focused on data processing rather than general workflow orchestration. The distributed scheduler is also relatively complex to configure and monitor. Good in development - harder in production.

Ray: Powerful for distributed computing but clearly designed for enterprise usecases. It requires complex cluster setup and brings along a heavy dependency footprint. The actor-based model also often requires restructuring existing code to fit its paradigm. For simpler workflows, it feels like using a sledgehammer to crack a nut.

The theme of all of these cons was simplicity. I wanted the benefits of well-defined background processing but easier to spin up for smaller projects. I wanted something:

  • Python-native: Client code should work with modern Python best practices, type-hinting, linting, etc.
  • Lightweight: One simple python dependency, self contained, and easy to read the code end to end to figure out what is going on.
  • Zero configuration: Works out of the box and scales from 0-1.

This led me to write a POC of Dagorama. It works with simple @actions decorators that be attached to any function and have the data flow from one to the next. It requires a single-machine broker (similar to the Redis model) to track the state of the computation. Easy to deploy in docker and mirror the remote setup locally.

When to ship?

One question I've debated is the question of when to officially ship. The repo is certainly at an MVP stage. It works on local testing and on a small cluster I'm using for development. I've decided my requirements for an official launch post are going to be:

  • More robust unit test coverage
  • Deployment in a production environment (to find any obvious bugs)
  • Documentation of the core concepts
  • Documentation of the core APIs

I'm also going to try to put together a simple demo of Dagorama in action.

Related tags:
#startup #programming #open source
Buzzword peaks and valleys
It's 2023 and once again, we are all in on AI. This is thanks in part to the cultural phenomena that is ChatGPT. Many companies are racing to deploy AI models (generative where possible) just to put it on their slide deck. Like clockwork, three years later, we've reverted back to AI. It sometimes feels like we're back in 2017.
Programatic Money
Cryptocurrency's real innovation isn't in replacing traditional currency, but in making money programmable. Institutional trust can still have an API.
Using grpc with node and typescript
Most of the grpc docs use the dynamic approach - I assume for ease of getting started. The main pro to dynamic generation is faster prototyping if the underlying schema changes, since you can hot reload the server/client. But one key downside includes not being able to typehint anything during development or compilation. For production use compiling it down to static code is a must.

Hi, I'm Pierce

I write mostly about engineering, machine learning, and company building. If you want to get updated about longer essays, subscribe here.

I hate spam so I keep these infrequent - once or twice a month, maximum.