Skip to content

Stable ETag for Multipart Uploads #6785

@decoursin

Description

@decoursin

Describe the feature

As I described here, the current etag scheme for multi-part uploads is bad. In summary, the etag for multi-part uploads is based on the md5 hash of the md5 hash of each uploaded part; as a result, the etag for multi-part uploads is pretty much just an arbitrary string (because the upload parts could be any arbitrary size).

It would be far better if the etag was meaningful.

Let me demonstrate why this is important:

The problem is that when the etag is based on completely meaningless data (or just an arbitrary string), then you can have the following scenario:

SomeCompany uploads huge file (uploaded via the multipart-upload
process) to S3 using a program. SomeCompany wants to reupload
the file later on, but only if it's not already there. However,
since the etag is pretty much a completely arbitrary string, the
same file uploaded one day could have a completely different etag
when uploaded a different day. That means the etags could be
different even for the same exact file!

Now, this SomeCompany could circumvent the unstable etag by using 2 queries (for example, HeadObject followed by MultipartUpload):

  1. HeadObject - see if the file is there by comparing one of the full-object checksums to make sure.
  2. MultipartUpload - only if the file is not there, upload the file. Use the etag found in the HeadObject call as part of the condition.

But having to make 2 queries just to do a CAS operation sucks! However, if the etag were based on stable data, then that 2 queries wouldn't be necessary.


This would also improve the multipart upload process. Currently, as documented here, the developer has to do the following to complete the multipart upload:

You first initiate the multipart upload and then upload all parts using the UploadPart operation or the UploadPartCopy operation. After successfully uploading all relevant parts of an upload, you call this CompleteMultipartUpload operation to complete the upload. Upon receiving this request, Amazon S3 concatenates all the parts in ascending order by part number to create a new object. In the CompleteMultipartUpload request, you must provide the parts list and ensure that the parts list is complete. The CompleteMultipartUpload API operation concatenates the parts that you provide in the list. For each part in the list, you must provide the PartNumber value and the ETag value that are returned after that part was uploaded.

Why does the developer have to "provide the PartNumber value and the ETag value that are returned after that part was uploaded" to complete the multipart upload? This seems like a completely unnecessary step for the developer.

Now that the full object checksum is being computed, there shouldn't be any need for the developer to have to pass the PartNumber and Etag values to complete the multipart upload.

Use Case

There's a need for atomic CAS Operations like om MultipartUpload, PutObject, DeleteObject, and CopyObject, yet since the etag is unstable, the developer has to always first use the HeadObject` call to get the etag, then use that etag in the CAS Operations.

Proposed Solution

As @bhoradc wrote here, S3 now supports full object checksums, like CRC64NVME checksum algorithm.

My suggestion is to allow the developer to have the file's etag be based on the CRC64NVME checksum algorithm.

This could be provided as an option when uploading/putting the file. The developer can choose whether the etag is based on the current methodology (the md5 hash of the md5 hash of each uploaded part), or he could elect for the etag to be the hash of the CRC64NVME (or the base64 hash of the CRC64NVME hash).

Other Information

No response

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

AWS Java SDK version used

Any of them

JDK version used

Any of them

Operating System and version

Any of them

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature-requestA feature should be added or improved.needs-triageThis issue or PR still needs to be triaged.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions