Stable ETag for Multipart Uploads

### Describe the feature

As I described [here](https://github.com/aws/aws-sdk-java-v2/issues/4705), the current etag scheme for multi-part uploads is bad. In summary, the etag for multi-part uploads is based on [the md5 hash of the md5 hash of each uploaded part](https://stackoverflow.com/a/19896823/1938094); as a result, the etag for multi-part uploads is pretty much just an arbitrary string (because the upload parts could be any arbitrary size).

It would be far better if the etag was meaningful.

Let me demonstrate why this is important:

- The [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html#API_PutObject_RequestSyntax) uses the etag for the CAS [If-Match](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/If-Match) and [If-None-Match](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/If-None-Match) operators.
- The [DeleteObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObject.html#API_DeleteObject_RequestSyntax) uses the etag for `If-Match`.
- The [CopyObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObject.html#API_CopyObject_RequestSyntax) uses the etag for `If-Match` and `If-None-Match` for both the source objects and the destination objects.
- The [CompleteMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html) uses the etag for the CAS `If-Match` and `If-None-Match` operators.

The problem is that when the etag is based on completely meaningless data (or just an arbitrary string), then you can have the following scenario:

```
SomeCompany uploads huge file (uploaded via the multipart-upload
process) to S3 using a program. SomeCompany wants to reupload
the file later on, but only if it's not already there. However,
since the etag is pretty much a completely arbitrary string, the
same file uploaded one day could have a completely different etag
when uploaded a different day. That means the etags could be
different even for the same exact file!
```

Now, this SomeCompany could circumvent the unstable etag by using 2 queries (for example, HeadObject followed by MultipartUpload):

1. HeadObject - see if the file is there by comparing one of the full-object checksums to make sure.
2. MultipartUpload - only if the file is not there, upload the file. Use the etag found in the HeadObject call as part of the condition.

But having to make 2 queries just to do a CAS operation sucks! However, if the etag were based on stable data, then that 2 queries wouldn't be necessary.

--------------------------------------------

This would also improve the multipart upload process. Currently, as documented [here](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html), the developer has to do the following to complete the multipart upload:

> You first initiate the multipart upload and then upload all parts using the [UploadPart](https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html) operation or the [UploadPartCopy](https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPartCopy.html) operation. After successfully uploading all relevant parts of an upload, you call this CompleteMultipartUpload operation to complete the upload. Upon receiving this request, Amazon S3 concatenates all the parts in ascending order by part number to create a new object. In the CompleteMultipartUpload request, you must provide the parts list and ensure that the parts list is complete. The CompleteMultipartUpload API operation concatenates the parts that you provide in the list. For each part in the list, you must provide the PartNumber value and the ETag value that are returned after that part was uploaded.

Why does the developer have to "provide the PartNumber value and the ETag value that are returned after that part was uploaded" to complete the multipart upload? This seems like a completely unnecessary step for the developer.

Now that the full object checksum is being computed, there shouldn't be any need for the developer to have to pass the PartNumber and Etag values to complete the multipart upload.

### Use Case

There's a need for atomic CAS Operations like om `MultipartUpload, `PutObject`, `DeleteObject`, and `CopyObject`, yet since the etag is unstable, the developer has to always first use the `HeadObject` call to get the etag, then use that etag in the CAS Operations.

### Proposed Solution

As @bhoradc wrote [here](https://github.com/aws/aws-sdk-java-v2/issues/4705#issuecomment-4041768594), S3 now supports full object checksums, like `CRC64NVME` checksum algorithm.

My suggestion is to allow the developer to have the file's etag be based on the `CRC64NVME` checksum algorithm.

This could be provided as an option when uploading/putting the file. The developer can choose whether the etag is based on the current methodology (the md5 hash of the md5 hash of each uploaded part), or he could elect for the etag to be the hash of the `CRC64NVME` (or the base64 hash of the CRC64NVME hash).

### Other Information

_No response_

### Acknowledgements

- [ ] I may be able to implement this feature request
- [ ] This feature might incur a breaking change

### AWS Java SDK version used

Any of them

### JDK version used

Any of them

### Operating System and version

Any of them

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable ETag for Multipart Uploads #6785

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

AWS Java SDK version used

JDK version used

Operating System and version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stable ETag for Multipart Uploads #6785

Description

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

AWS Java SDK version used

JDK version used

Operating System and version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions