-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement bulk_delete_request for Azure #5681
Conversation
Sorry for the delay, I'll try to get to this next week, it's a spicy one 😅 Azure's APIs are truly something else 🙃 |
@tustvold Don't tell me! It's sad to see the amount of gymnastics their official clients have to do, like encoding and decoding HTTP manually just because someone thought it would be a good idea to send quasi HTTP requests within a multipart request... 🫠 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.
I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly
@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> { | |||
} | |||
} | |||
|
|||
#[inline] | |||
fn extend(dst: &mut Vec<u8>, data: &[u8]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a shorthand to avoid typing extend_from_slice
too much, I saw it in hyper's source code and I think it makes it a little more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it obfuscates what the code is doing, but I don't feel strongly
} | ||
|
||
// Write header names as title case. The header name is assumed to be ASCII. | ||
fn title_case(dst: &mut Vec<u8>, name: &[u8]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this may not be necessary as headers are case insensitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Azure rejects requests if some headers aren't using a very specific casing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Wow...
object_store/src/azure/client.rs
Outdated
}; | ||
|
||
// Parse part response headers | ||
let mut headers = [httparse::EMPTY_HEADER; 10]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
httparse
expects a slice of headers and does not attempt to grow it, so we would need to make the code more complex to support arbitrary headers. I chose to increase the amount to a more conservative one and explain it in a comment
My opinion is that having this feature as a part of object store (with a caveat in the doc and gated behind a non default feature flag) would be a good idea. My rationale is that while I agree with the technical concerns, unless there is some alternate strategy for implementing this feature today this seems like the best way to get experience / testing for this feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding dedicated tests for the multipart parse and generate logic, similar to: https://github.com/apache/opendal/blob/be6f359d15b965147d2c0ecc4de93a95863167aa/core/src/raw/http_util/multipart.rs#L748
@tustvold @konjac I still have to take the time to go through all comments here but thanks in advance for taking the time to review it. I can understand the hesitation of approving something that does manual encoding/decoding like this and I would personally hope that we could rely on something else that handle the trickiest bits for us but my searches for a crate that could handle this were unfruitful:
But what really ended my search was looking at how the official clients do this and realizing they are also doing it manually: |
@tustvold After going through the comments, I think this is ready for another review pass. |
I'm afraid I do not have capacity in the foreseeable future for reviewing this, as I no longer work on arrow-rs. One of the other maintainers may be able to help out |
Maybe @alamb could help me find a reviewer for this? |
Hi @andrebsguedes, I'm willing to help review it later this week, although I'm not a maintainer of arrow-rs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this looks good overall. The only thing missing from my side is unit tests for the multipart parse logic.
The bulk delete request builder and response parsing logic are much more complex compared to other requests. I suggest we have dedicated unit tests specifically for them.
50dfb18
to
e38ee50
Compare
e38ee50
to
412c3b2
Compare
Hi @Xuanwo, thanks for the review, finally got some time to add the tests for this, would appreciate another pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andrebsguedes a lot for your hardwork!
@tustvold Do you know anyone that can help with merging this now that we have got the approval? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I think this is fine as is, but I left some comments on how we could reduce the dependency footprint. This could also be done as a follow on.
} | ||
|
||
// Write header names as title case. The header name is assumed to be ASCII. | ||
fn title_case(dst: &mut Vec<u8>, name: &[u8]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Wow...
object_store/src/azure/client.rs
Outdated
let credential = self.get_credential().await?; | ||
|
||
// https://www.ietf.org/rfc/rfc2046 | ||
let boundary = format!("batch_{}", uuid::Uuid::new_v4()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this actually needs to be a UUID or if we could just generate 128-bits of random data and avoid the additional dependency
|
||
// Encode end marker | ||
extend(&mut body_bytes, b"--"); | ||
extend(&mut body_bytes, boundary.as_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate that boundary doesn't appear in the encoded body?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't because the first thing we add to body_bytes
is the boundary itself (within serialize_part_delete_request
)
object_store/src/azure/client.rs
Outdated
|
||
let stream = batch_response.bytes_stream(); | ||
|
||
let mut multipart = multer::Multipart::new(stream, boundary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to go back and forth on this, but looking at multer it has a fairly monstrous dependency footprint, including https://crates.io/crates/encoding_rs/0.8.35 which is 1MB on its own...
Perhaps we could just revert to the custom parsing logic you had before
@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> { | |||
} | |||
} | |||
|
|||
#[inline] | |||
fn extend(dst: &mut Vec<u8>, data: &[u8]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it obfuscates what the code is doing, but I don't feel strongly
@tustvold No problem! Ready for another pass |
Thank you for sticking with this |
Which issue does this PR close?
Closes #5680.