Independent work: November recap

# December 22, 2022

Quick update for the second month of independent work. I've been working on many of the same projects as last month, in addition to a couple new ones.

Dagorama

Dagorama is a new open source project I'm starting to use in more deployments. It's not quite ready for prime-time but I see its promise.

The Motivation

Background processing jobs are often dependent on one another. You need to bundle up multiple datapoints for GPU batch processing. You need to crawl a group of webpages, then do a similarity measurement between them.

Whenever I ran into these situations, I'd usually shove them into an existing celery or kafka system. But these tools are made for queues. To escape from the queue, and start dealing with more input datapoints, it means you're looking at a fully fledged computation graph instead of task queues.

Airflow: Have used extensively at a previous job. Specify configuration via programmed flows (but requires these to be curated ahead of time). Doing something like dynamic branching of jobs depending on the current state isn't possible. Also the requirement to run a full scheduler, webserver, and metadata database makes it too heavy for smaller deployments. The Python API, while improved in recent versions, still feels more like configuration than actual programming.

Dask: Works but relatively heavy requirement. Feels like you're programming a computation graph, because well, you are. While it's solid for parallel computing and has great NumPy/Pandas integration, it's primarily focused on data processing rather than general workflow orchestration. The distributed scheduler is also relatively complex to configure and monitor. Good in development - harder in production.

Ray: Powerful for distributed computing but clearly designed for enterprise usecases. It requires complex cluster setup and brings along a heavy dependency footprint. The actor-based model also often requires restructuring existing code to fit its paradigm. For simpler workflows, it feels like using a sledgehammer to crack a nut.

The theme of all of these cons was simplicity. I wanted the benefits of well-defined background processing but easier to spin up for smaller projects. I wanted something:

  • Python-native: Client code should work with modern Python best practices, type-hinting, linting, etc.
  • Lightweight: One simple python dependency, self contained, and easy to read the code end to end to figure out what is going on.
  • Zero configuration: Works out of the box and scales from 0-1.

This led me to write a POC of Dagorama. It works with simple @actions decorators that be attached to any function and have the data flow from one to the next. It requires a single-machine broker (similar to the Redis model) to track the state of the computation. Easy to deploy in docker and mirror the remote setup locally.

When to ship?

One question I've debated is the question of when to officially ship. The repo is certainly at an MVP stage. It works on local testing and on a small cluster I'm using for development. I've decided my requirements for an official launch post are going to be:

  • More robust unit test coverage
  • Deployment in a production environment (to find any obvious bugs)
  • Documentation of the core concepts
  • Documentation of the core APIs

I'm also going to try to put together a simple demo of Dagorama in action.

Related tags:
#startup #programming #open source

👋🏼

Hey, I'm Pierce! I write mostly about engineering, machine learning, and building companies. If you want to get updated about longer essays, subscribe here.

I hate spam so I keep these infrequent - once or twice a month maximum. Promise.

Typehinting from day-zero
Static typehinted languages can make us lazy about adding types at the right time. We have all the context when we start a new project, but as we increase complexity and focus on other things that context wains. Rewards compound from typing on day zero.
Building an accurate LinkedIn post simulator
You know the old saying "you have only 15 minutes to impress someone?" On social media feeds it's more like 500 milliseconds. For my new social media product Saywhat, I set out to build a fully accurate post previewer - so you know what your post's going to look like before you hit submit.
Webcrawling tradeoffs
A couple of years ago I built our internal crawling platform at Globality, which needed to be capable of scaling to billions of pages each crawl. The two main types of crawlers that are deployed in the wild are typically raw or headless. We ended up implementing a hybrid architecture. Hybrid crawling can make use of the strengths of both while trying to minimize their weaknesses.