The provenance of copy and paste

Much of the American corporate world runs off of manual workflows. Excel, Email, Slack, and CRMs are still the workhorses of getting things done. They also share some core similarities, namely that they're tools prized for their flexibility. They are the hammers that make everything a nail.

I've seen a fair share of complex workflows across the above applications. Many approximate the functionality of a full fledged computation graph. But instead of nodes and edges, they have people running hard-to-automate processes. These workflows are usually seen as pretty brittle and for good reason. The common engineering concern is on the application layer: missing schemas and the lack of value typechecking that can lead to unexpected issues downstream.

A topic that receives less attention, however, is the provenance of data that flows into and out of these different tools. I often find myself going through documents that I've written or were written by colleagues. I almost inevitably have to wonder where in the world some of the data came from. What website informed the citation? What open source project provided this reference snippet? In which email did the client identify their new requirement? The statement came from somewhere but the context of which somewhere is unknown. The link between the original reference and the new one was lost the second it was taken out of context.

This is due to an inherent limitation of how most data moves around via copy and paste. The inherent flexibility of these tools to store manually inputted data leads to poor data hygiene and cascading work down the road to re-derive what was used as an input source in the first place.

Why is this a problem? It makes verification of data accuracy or future automation near impossible. Issues include:

Checking whether the underlying data has changed in some way; an encyclopedia statistic being updated, a database table refreshing, or a client CRM status changing.
Auditing that analogue->digital transcriptions have been completed successfully, like extracting a numerical field from a scanned pdf.
Training ML models to automate extraction from the original source. Without the (input, output) contract of the original data source, this won't be possible in a classic supervised setting.

The early days of ⌘C

The old model of computing often switched from analogue to digital and back again, which made provenance near impossible to track over time.

Writing -> Digital Input -> Print -> Digital Input

This input interface only has access to raw strings provided through stdin. How would someone even begin to track provenance? There are MLA citations or the dewey decimal system but not a meaningful equivalent when inputting day-to-day data. The best solution was the simplest: support raw inputs of textual fields and disregard origin.

At some point this workflow changed with the full embrace of digital compute.

Digital Output -> Digital Input -> Digital Output

Output here could be an email, a website, or some sensor reading. We're taking digital bytes and utilizing them far away from their origin. But at the end of the day there is usually a single centralized source of truth that was the origin of that datapoint, no matter how far it spreads.

For formally built pipelines, there are ETL flows that connect from databases to data warehouses to feature stores to other output layers. There's usually some traceability of data signals through primary key references. But even so - the terminus of these systems is usually a tool that humans can manipulate. Most backoffice data capture systems still allow some form of copy and paste of data because of the ubiquity of having to input content manually.

Copy and paste was developed when computer storage was a rare and slow commodity. You wanted to discourage more complex utilization of the disk and only write necessary bytes for fear of saturating the floppy disk that you were running, or the cost of having to hydrate spinning disk data into memory.

Data mutability became the name of the game. File saves overwrite previous versions. You have one working draft and no more, absent backups or manual file duplications. Pasting was built for this space limited world. It copied raw values only and allowed you to migrate these to any location.

A modern build-out

But what if things were different? What if copy+paste didn't default to pasting static values but instead pasted a pointer to an archived piece of content and persisted this pointer whenever its value is copied and pasted again. Let's consider a possible specification if we're given the state of current technology and storage but no already-implemented copy and paste. This is obviously more of a thought experiment than an actual tech design.

Requirements:

Each application supports copy+paste with a contextual payload. This payload must include the copied value itself and an optional link to the source. The source should be communicated in a globally indexable way, possibly in URI format via a domain namespace like http://salesforce.com/profile/1234.
Maintain version control to recover older data revisions for sources that are not under the user's control, like Internet pages.
Copies can include an arbitrary amount of original links. If a specific datapoint (like a number) is embedded within a broader context (a paragraph), the original numerical reference should be persisted even when the entire context is selected.
Application-based page archival. Since applications control their own database schema and business logic, they are best positioned to define what is necessary to hydrate the user's link state in the future.

Some questions that come to mind once you start to think about an actual implementation:

Level of Immutability: Do you maintain the whole filesystem history in an immutable form? Or something more like a git-like store where only diffs are captured? Alternatively, should archives instead be done on-demand where a new commit on the index is only created when copies are made explicitly? Can applications be relied on to maintain pointers to copy state, or do we need to perform a manual / local archival of each original source?

File size: Just because disk size is now plentiful doesn't mean it's unlimited, especially on edge devices. How can the provenance chain be stored in an optimized way? Should it be stored locally or somewhere on remote servers?

Versioning schema: How can platforms handle versioning of data or schema changes? Is there some way to track a mapping of URLs over time to the same underling object without this becoming too much of an operational burden? On the web we see this all the time with broken links to old static pages. The problem is even more magnified during application redesigns. We ideally want to point the datapoint to the same logical origin. How can we do this in a stable way?

Archive availability: Documents are frequently shared across devices. A provenance chain should be visible to the maximum amount of individuals that have context of the underlying source document. On the internet, this is everyone. In an enterprise context this may only be individuals who are part of the current team. How do we solve for widely sharable documents that maintain the provenance chain?

There's likely not a one-size fits all solution to this problem. Some assumptions might be useful:

For consistency we denote "rich text" as the raw original format. This might be the webpage, email, or slidedeck where the user wants to extract a smaller piece of text from the whole. A "snippet" is the smaller piece of extracted text that is copied.
The main provenance concern is written text and not images or videography, where file sizes are still prohibitively large. Almost all useful data for corporations is text.
We mostly care about tracking the journey of snippets to other places in the system. This creates a chain from the original copy to subsequent references. Recording how we got snippets from rich text is important but a secondary concern.

With those assumptions, here's one possible proposal:

Each application (or website) must respond to an archiveState request. This request expects pages to return a rich text representation of the current page state, perhaps with some standard rendering markup like Github flavored Markdown.
- As we can see with websites that hijack the print dialogue functionality, sometimes third parties are not a reliable source for these archives. There can be some system level user configuration for which the display renderer is leveraged to produce a page archive.
The page has the ability to return a sha that corresponds to the current logical revision. As a backup, the client can create this revision number manually based on the page contents.
The local client always creates a sha of the returned page contents, which can be used to validate other archives of this page to our known value at copy-time.
When copying data from an application, the archive state will be requested and saved locally. It can optionally be uploaded to a cloud host to act as the store of document revisions.
Copying includes some context payload of {uri: document#revision, contents_hash: sha, raw_contents: bytes}. The uri and contents hash correspond to the values that are captured at copy time. The raw contents mirrors today's implementation of copying.
Each application in the OS must support pasting in the above format. Even if the context is invisible (and in many cases it should be like for a cell in a spreadsheet), this context payload is linked with the underlying textual data in the data model.

We recognize that when a document is shared with context links, other machines might want to retrieve the original provenance of the links as well. We should support this use-case. - Devices on a local network (or perhaps the entire internet like in a P2P network) respond to a stateForHash request. This request communicates the desired URI and asks any hosts with a cached version of the page to deliver it to the requestor. This endpoint should optionally allow authentication tokens to be provided in case users need to prove their own identity to hosts. Users (or automated inhouse systems) with access to the raw archive could optionally choose whether to allow or deny their request. - These raw page archives are checked against the stored contents_hash to help ensure that the host has not tampered with the page state. In theory it's possible to derive a spoofed contents hash but finding a matching conflicting sha or md5 unidirectional hash requires a lot of raw compute power and likely isn't worth the investment.

Closing

Perhaps more feasible than the above would be introducing contextualized copy+paste into some explicit business applications. Airtable or Excel could do this today to great impact even if they just wrap in-platform objects in some provenance graph. Data flowing through different spreadsheets would still have provanance tracked and accessible via API for use in automation scripts or a manual traceback. It still suffers from the lack of visibility into data copied from other platforms (like CRM solutions) but once data is already in the system its route can be stored.

Alternately, it might be possible to get quite a bit of mileage by simply recording the contents of a local clipboard alongside the application that recorded them. Important values are usually relatively unique in nature so it would be easy to back-reduce where the original source came from. It doesn't seem too hard to set up Applescripts upon copy events. If it's a webpage, archive the current contents. If it's an email, export contents to some known folder. Keep track of a global mapping of text->archive location. If you're ever curious about an annotation that you made, just consult the archive. Some friends recommended Hooksmart which does something similar for local copies but I haven't yet given it a try.

Might have to file that under future weekend projects.

Thanks to Austin Ray and Jeff Zoch for their proofread and thoughts on the draft.

The provenance of copy and paste

# December 19, 2022

The early days of ⌘C

A modern build-out

Closing

👋🏼