Independent work: November recap

# December 22, 2022

Quick update for the second month of independent work. I've been working on many of the same projects as last month, in addition to a couple new ones.

Dagorama

Dagorama is a new open source project I'm starting to use in more deployments. It's not quite ready for prime-time but I see its promise.

The Motivation

Background processing jobs are often dependent on one another. You need to bundle up multiple datapoints for GPU batch processing. You need to crawl a group of webpages, then do a similarity measurement between them.

Whenever I ran into these situations, I'd usually shove them into an existing celery or kafka system. But these tools are made for queues. To escape from the queue, and start dealing with more input datapoints, it means you're looking at a fully fledged computation graph instead of task queues.

Airflow: Have used extensively at a previous job. Specify configuration via programmed flows (but requires these to be curated ahead of time). Doing something like dynamic branching of jobs depending on the current state isn't possible. Also the requirement to run a full scheduler, webserver, and metadata database makes it too heavy for smaller deployments. The Python API, while improved in recent versions, still feels more like configuration than actual programming.

Dask: Works but relatively heavy requirement. Feels like you're programming a computation graph, because well, you are. While it's solid for parallel computing and has great NumPy/Pandas integration, it's primarily focused on data processing rather than general workflow orchestration. The distributed scheduler is also relatively complex to configure and monitor. Good in development - harder in production.

Ray: Powerful for distributed computing but clearly designed for enterprise usecases. It requires complex cluster setup and brings along a heavy dependency footprint. The actor-based model also often requires restructuring existing code to fit its paradigm. For simpler workflows, it feels like using a sledgehammer to crack a nut.

The theme of all of these cons was simplicity. I wanted the benefits of well-defined background processing but easier to spin up for smaller projects. I wanted something:

  • Python-native: Client code should work with modern Python best practices, type-hinting, linting, etc.
  • Lightweight: One simple python dependency, self contained, and easy to read the code end to end to figure out what is going on.
  • Zero configuration: Works out of the box and scales from 0-1.

This led me to write a POC of Dagorama. It works with simple @actions decorators that be attached to any function and have the data flow from one to the next. It requires a single-machine broker (similar to the Redis model) to track the state of the computation. Easy to deploy in docker and mirror the remote setup locally.

When to ship?

One question I've debated is the question of when to officially ship. The repo is certainly at an MVP stage. It works on local testing and on a small cluster I'm using for development. I've decided my requirements for an official launch post are going to be:

  • More robust unit test coverage
  • Deployment in a production environment (to find any obvious bugs)
  • Documentation of the core concepts
  • Documentation of the core APIs

I'm also going to try to put together a simple demo of Dagorama in action.

Related tags:
#startup #programming #open source

👋🏼

Hey, I'm Pierce! I write mostly about engineering, machine learning, and building companies. If you want to get updated about longer essays, subscribe here.

I hate spam so I keep these infrequent - once or twice a month maximum. Promise.

Inline footnotes with html templates
I couldn’t write without footnotes. Or at least - I couldn't write enjoyably without them. They let you sneak in anecdotes, additional context, and maybe even a joke or two. They're the love of my writing life. For that reason, I wanted to get them closer to the content itself through inline footnotes.
Webcrawling tradeoffs
A couple of years ago I built our internal crawling platform at Globality, which needed to be capable of scaling to billions of pages each crawl. The two main types of crawlers that are deployed in the wild are typically raw or headless. We ended up implementing a hybrid architecture. Hybrid crawling can make use of the strengths of both while trying to minimize their weaknesses.
Falling for Kubernetes
I default to bare metal where I can. But recently I had to adopt a more complicated server management solution. And after a couple months of building for kubernetes, I must admit I'm falling for it more every day.