We will soon be needing a blob service to store archives of systrees. Later on, we will want to use this for storing build artifacts.

I intend the blob service to be quite simple. The first version should be able to manage with the following API:

  • POST /id with an application/json body with the pipeline parameters that were used to build the blob. Computes a blob id from the body and returns it.

  • PUT /blobs/id where id is a blob id calculcated by by the previous request. Body is the blob to be stored. The blob service stores the body, ignoring content type for now. The service does not check hte id, at least for now.

  • GET /blobs/id where ìd is again a blob id. Returns the blob, with a content type of application/octet-stream.

This is enough to implement systree building and storing. It's not good enough long-term but we don't need to worry about that yet.

The id is computed by normalising the incoming JSON body and calculating the SHA256 of that. It explicitly not the SHA256 of the blob itself, since that would mean you'd need to know that to retrieve it, and we don't. We do know the pipeline parameters, though, so that's an easy way of identifying the blob we need.

Longer term plans

I'm thinking that longer term the blob service needs to start doing things like these:

  • store content type, creation timestamp, last use timestamp
  • possibly allow clients to specify arbitrary metadata, such as "filename" and a project/pipeline instance id
  • list all blobs, with metadata, including size of each blob
  • delete blobs
  • possibly have a way to specifying a retention policy, such as "keep the most recently used blobs up to N gigabytes of data for each project", and a way of deleting anything that doesn't fit the policy
  • do aggressive de-duplication of blob content: it's likely there's chunks of data that are identical across blobs, and there's no need to store those chunks more than once
    • de-duplicating backup software can offer tips here, such as calculating a weak rolling checksum and ending a chunk whenever its lowest N bits are 0 (with N=16, chunks will on average be around 64 KiB)
    • that kind of chunking doesn't care of the type of data, and doesn't mind if chunks are at different offsets in different blobs
    • it may be worthwhile to have the client do the chunking so that it can avoid uploading chunks that already exist in the service; likewise, possibly the client should download chunks directly, so it can avoid downloading chunks it already has; we'll see

All of this is tentative, and not meant for the first version, which will be as simple as possible so we get into use earlier rather than later.