Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal, Extension to Blob Upload, Dedupe API #18

Closed
sargun opened this issue Jun 25, 2018 · 9 comments
Closed

Proposal, Extension to Blob Upload, Dedupe API #18

sargun opened this issue Jun 25, 2018 · 9 comments

Comments

@sargun
Copy link
Contributor

sargun commented Jun 25, 2018

In the specification, layer upload de-duplication is explicitly listed as part of the spec.

Today, this is scoped to a given image, as opposed to the whole registry itself. The of de-duplication is to call the HEAD request on blobs, and see if the blob exists. Iff it doesn't, then upload it.

Unfortunately, in the case where you have two images that are built on top of a common image, this doesn't work. This is often a common pattern, a la:

                      +-----------+
                      |           |
                      |  image-1  |
+---------------+     |           |
|               |     +-----^-----+
| Ubuntu:latest +-----------|
|               |     +-----v-----+
+---------------+     |           |
                      |  image-2  |
                      |           |
                      +-----------+

See:

> HEAD /v2/image-1/blobs/sha256:4f1bb8b6572003ca3526ab23c80c385c5eadd1748e3996e444b66d47ce0ee7af HTTP/1.1
> Host: myregistry.com:7002
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 404 Not Found


> HEAD /v2/image-2/blobs/sha256:4f1bb8b6572003ca3526ab23c80c385c5eadd1748e3996e444b66d47ce0ee7af HTTP/1.1
> Host: myregistry.com:7002
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 307 Temporary Redirect
< Date: Mon, 25 Jun 2018 01:06:59 GMT
< Server: Apache
< Docker-Distribution-Api-Version: registry/2.0
< Content-Type: application/octet-stream
< Docker-Distribution-Api-Version: registry/2.0
< Location: http://<INSERT REDIRECT HERE>

On upload, the underlying registry may deduplicate these layers.

Although there exists a mechanism for image-1, and image-2 to have "cross-mounted" layers, this requires that the owners of image-1, and image-2 know of one another.

My proposal is to extend the Starting An Upload API. I suggest extending the API from:

POST /v2/<name>/blobs/uploads/

To accept the optional query parameter digest. If the optional query parameter digest exists, and we can find the layer from another image, we will return with status code 201, and link the layer in. Given that this field would be optional, this would a backwards compatible change, and could be implemented optionally.

Opinion?

@jonjohnsonjr
Copy link
Contributor

Something worth considering: I don't think this works well with the (current, docker/distribution) auth model. In order to implement this, you'll need to determine if a layer is present in any images that the caller has access to pull. For certain registries, this might be impossible to calculate; for others, this might be a slow operation.

@sargun
Copy link
Contributor Author

sargun commented Jun 25, 2018

@jonjohnsonjr That's a good point. I guess the way to solve this would be via challenge-response? i.e. server responds with some block (say 4 kilobytes), and the client can respond with a hash of the layer + 4k? That way, only 4K has to be transmitted, as opposed to the entire layer?

@amouat
Copy link
Contributor

amouat commented Jun 25, 2018 via email

@jonjohnsonjr
Copy link
Contributor

a user that somehow finds a hash can use it as a base layer and then pull it without actually having the file

@amouat essentially, yes. We don't want to make layer digests have to be secrets. You can imagine someone pulling an image that contains secret, proprietary code during a recorded presentation. Anyone watching that presentation could then upload an image with the same layers using the dedupe mechanism, then immediately re-pull that image, effectively bypassing the auth mechanism.

I guess the way to solve this would be via challenge-response?

@sargun I don't think that completely solves the problem, since you would still be leaking the fact that a layer existed in a repository you don't have access to. I'm not sure how much we care about that.

@sargun
Copy link
Contributor Author

sargun commented Jun 26, 2018

@jonjohnsonjr Is there really a problem with leaking the fact a layer exists with that hash? What does that tell you?

@jonjohnsonjr
Copy link
Contributor

What does that tell you?

That someone else has uploaded this layer before.

Is there really a problem

Perhaps not, but it's worth pausing to reflect on it, since we would be leaking more information than before.

As a strawman example:

Imagine a situation where someone (let's say Alice) might want to secretly upload an image to a private registry. Perhaps this image contains some proprietary code, and perhaps Alice's coworker (let's say Bob) also has access to this image. If Bob is suspicious that Alice has been uploading this image to, say, private-registry.example.com, he can start probing that registry via this dedupe mechanism to confirm his suspicions.

Now... do we care about that? I don't know. I'm sure more security-minded people can come up with a better example (or perhaps a better dedupe mechanism), but that's all I've got 😄

@jzelinskie
Copy link
Member

There isn't really a problem with the registry auth model; it's more a result of the implementation of docker-distribution. Quay stores all of its metadata in a relational database, so we actually can perform the look up for "automatic blob mounting" and we do this already automatically with the Docker v1 protocol. I'm pretty sure we just never got around to implementing it for v2.

Either way, I don't think you even need to change the protocol to support this functionality. If you can do it, just do it transparently.

@stevvooe
Copy link
Contributor

The original API had this parameter but took it out due to a few reasons:

  1. We don't want to require uploaders to have to know the digest before uploading layers. This allows cut-through generation of the content directly into the registry.
  2. If two uploaders exists concurrently for the same digest, you cannot let them block each other, or you create a vector for a DoS attack.
  3. Terminating uploads for layers that the registry already knows about in unrelated repositories can expose private information. @jonjohnsonjr described this in detail, and yes, @sargun, you should care ;). I don't have my notes describing the exact attack but there are a few that can be enabled if there are other exploits in the registry implementation (ACL, auth, etc.).

In general, I don't see a huge issue with this. In CAS upload models, I've always struggled with whether the digest should be a pre-declared or commit time parameter, but it turns out, it depends on the context.

The key to making this work is defining the behavior when it exists in a way that won't break old clients. I think the proposal is fine, except it should probably follow the mount behavior exactly.

We can leave the policy up to the implementation.

@SteveLasker
Copy link
Contributor

De-duping, and how and where de-duping is implemented seems like a registry implementation detail. Some may want to share "public" layers across registries to give a better first impression, or reduce the overall storage costs. Some, may be concerned about the security implications of sharing layers (please don't start a rant here :))
There are other scenarios where we may not want to share layers for performance reasons.
I'd suggest we close this one and leave it as a registry optimization implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants