-
Notifications
You must be signed in to change notification settings - Fork 989
Description
Describe the feature
As I described here, the current etag scheme for multi-part uploads is bad. In summary, the etag for multi-part uploads is based on the md5 hash of the md5 hash of each uploaded part; as a result, the etag for multi-part uploads is pretty much just an arbitrary string (because the upload parts could be any arbitrary size).
It would be far better if the etag was meaningful.
Let me demonstrate why this is important:
- The PutObject uses the etag for the CAS If-Match and If-None-Match operators.
- The DeleteObject uses the etag for
If-Match. - The CopyObject uses the etag for
If-MatchandIf-None-Matchfor both the source objects and the destination objects. - The CompleteMultipartUpload uses the etag for the CAS
If-MatchandIf-None-Matchoperators.
The problem is that when the etag is based on completely meaningless data (or just an arbitrary string), then you can have the following scenario:
SomeCompany uploads huge file (uploaded via the multipart-upload
process) to S3 using a program. SomeCompany wants to reupload
the file later on, but only if it's not already there. However,
since the etag is pretty much a completely arbitrary string, the
same file uploaded one day could have a completely different etag
when uploaded a different day. That means the etags could be
different even for the same exact file!
Now, this SomeCompany could circumvent the unstable etag by using 2 queries (for example, HeadObject followed by MultipartUpload):
- HeadObject - see if the file is there by comparing one of the full-object checksums to make sure.
- MultipartUpload - only if the file is not there, upload the file. Use the etag found in the HeadObject call as part of the condition.
But having to make 2 queries just to do a CAS operation sucks! However, if the etag were based on stable data, then that 2 queries wouldn't be necessary.
This would also improve the multipart upload process. Currently, as documented here, the developer has to do the following to complete the multipart upload:
You first initiate the multipart upload and then upload all parts using the UploadPart operation or the UploadPartCopy operation. After successfully uploading all relevant parts of an upload, you call this CompleteMultipartUpload operation to complete the upload. Upon receiving this request, Amazon S3 concatenates all the parts in ascending order by part number to create a new object. In the CompleteMultipartUpload request, you must provide the parts list and ensure that the parts list is complete. The CompleteMultipartUpload API operation concatenates the parts that you provide in the list. For each part in the list, you must provide the PartNumber value and the ETag value that are returned after that part was uploaded.
Why does the developer have to "provide the PartNumber value and the ETag value that are returned after that part was uploaded" to complete the multipart upload? This seems like a completely unnecessary step for the developer.
Now that the full object checksum is being computed, there shouldn't be any need for the developer to have to pass the PartNumber and Etag values to complete the multipart upload.
Use Case
There's a need for atomic CAS Operations like om MultipartUpload, PutObject, DeleteObject, and CopyObject, yet since the etag is unstable, the developer has to always first use the HeadObject` call to get the etag, then use that etag in the CAS Operations.
Proposed Solution
As @bhoradc wrote here, S3 now supports full object checksums, like CRC64NVME checksum algorithm.
My suggestion is to allow the developer to have the file's etag be based on the CRC64NVME checksum algorithm.
This could be provided as an option when uploading/putting the file. The developer can choose whether the etag is based on the current methodology (the md5 hash of the md5 hash of each uploaded part), or he could elect for the etag to be the hash of the CRC64NVME (or the base64 hash of the CRC64NVME hash).
Other Information
No response
Acknowledgements
- I may be able to implement this feature request
- This feature might incur a breaking change
AWS Java SDK version used
Any of them
JDK version used
Any of them
Operating System and version
Any of them