Independent work: November recap

Quick update for the second month of independent work. I've been working on many of the same projects as last month, in addition to a couple new ones.

Dagorama

Dagorama is a new open source project I'm starting to use in more deployments. It's not quite ready for prime-time but I see its promise.

The Motivation

Background processing jobs are often dependent on one another. You need to bundle up multiple datapoints for GPU batch processing. You need to crawl a group of webpages, then do a similarity measurement between them.

Whenever I ran into these situations, I'd usually shove them into an existing celery or kafka system. But these tools are made for queues. To escape from the queue, and start dealing with more input datapoints, it means you're looking at a fully fledged computation graph instead of task queues.

Airflow: Have used extensively at a previous job. Specify configuration via programmed flows (but requires these to be curated ahead of time). Doing something like dynamic branching of jobs depending on the current state isn't possible. Also the requirement to run a full scheduler, webserver, and metadata database makes it too heavy for smaller deployments. The Python API, while improved in recent versions, still feels more like configuration than actual programming.

Dask: Works but relatively heavy requirement. Feels like you're programming a computation graph, because well, you are. While it's solid for parallel computing and has great NumPy/Pandas integration, it's primarily focused on data processing rather than general workflow orchestration. The distributed scheduler is also relatively complex to configure and monitor. Good in development - harder in production.

Ray: Powerful for distributed computing but clearly designed for enterprise usecases. It requires complex cluster setup and brings along a heavy dependency footprint. The actor-based model also often requires restructuring existing code to fit its paradigm. For simpler workflows, it feels like using a sledgehammer to crack a nut.

The theme of all of these cons was simplicity. I wanted the benefits of well-defined background processing but easier to spin up for smaller projects. I wanted something:

Python-native: Client code should work with modern Python best practices, type-hinting, linting, etc.
Lightweight: One simple python dependency, self contained, and easy to read the code end to end to figure out what is going on.
Zero configuration: Works out of the box and scales from 0-1.

This led me to write a POC of Dagorama. It works with simple @actions decorators that be attached to any function and have the data flow from one to the next. It requires a single-machine broker (similar to the Redis model) to track the state of the computation. Easy to deploy in docker and mirror the remote setup locally.

When to ship?

One question I've debated is the question of when to officially ship. The repo is certainly at an MVP stage. It works on local testing and on a small cluster I'm using for development. I've decided my requirements for an official launch post are going to be:

More robust unit test coverage
Deployment in a production environment (to find any obvious bugs)
Documentation of the core concepts
Documentation of the core APIs

I'm also going to try to put together a simple demo of Dagorama in action.

Related tags:

#startup #programming #open source

Independent work: November recap

# December 22, 2022

Dagorama

When to ship?

👋🏼