Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(architecture): consolidate sink I/O driver logic into reusable component #9215

Merged
merged 14 commits into from
Oct 5, 2021

Conversation

tobz
Copy link
Contributor

@tobz tobz commented Sep 17, 2021

This PR moves the logic previously baked into the run_io method for the S3 sink into a reusable component called Driver. Simply put, you give it a Stream of items which can be used as the request for a Service, and it handles building calls for each item, as well as providing finalization and acking as the responses come through.

As it stands, we've got a lot of changes going on here as part of trying to make the boilerplate specific to sinks be as simple/flexible as possible. To name a few:

  • a new trait, Encoder, that allows defining a type that can "encode" a single event (EncodingConfiguration + Encoding enums)
  • a new generalized StandardEncodings enum that implements Encoder for the most common encodings
  • Compression tweaks slightly to depend more on defined compression levels from flate2 itself
  • a new trait, RequestBuilder, to try and provide a starting point for generic "how do we take events and make a request?" interface
  • new stream combinators, from SinkBuilderExt, to wrap streams of events with common building blocks, such as batching, request building, Stream -> Service driving, for building a complete sink

@netlify
Copy link

netlify bot commented Sep 17, 2021

✔️ Deploy Preview for vector-project canceled.

🔨 Explore the source changes: 8957e8f

🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/615b528c1dec8e0008419f77

/// return an error for a legitimate reason in the future.
pub async fn run(self) -> Result<(), ()> {
let in_flight = FuturesUnordered::new();
let mut pending_acks = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note that we could probably parameterize this with twox hash or some such if we wanted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the exact same thought. There's actually a nohash-hasher crate since our keys are already u64. I guess if we've both had this thought, maybe I should just go ahead and do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about nohash-hasher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost a glorified VecDeque. I wonder if it could be emulated with a wrapper around VecDequeue<Option<(u64, u64)>> where inserting inserts "blanks" if it's not actually sequentially next.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point regarding VecDeque. I'll have to think about your idea... off the top of my head, I can't visualize it, so might just be one of those things I need to write down first to grok.

Copy link
Contributor Author

@tobz tobz Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 05bc5f3, I reworked this as a BinaryHeap because I didn't like the idea of having to insert blanks that we would then had to do a linear scan to find when the call finishes and we want to mark the pending acknowledgement as completed.

Admittedly, though, the code using BinaryHeap is longer than the HashMap approach.

src/sinks/aws_s3/sink.rs Outdated Show resolved Hide resolved
@blt
Copy link
Contributor

blt commented Sep 17, 2021

I really like where this is headed.

Copy link
Member

@bruceg bruceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big 👍🏻 to the reusable component in vector-core.

src/sinks/aws_s3/sink.rs Outdated Show resolved Hide resolved
src/sinks/aws_s3/sink.rs Outdated Show resolved Hide resolved
/// return an error for a legitimate reason in the future.
pub async fn run(self) -> Result<(), ()> {
let in_flight = FuturesUnordered::new();
let mut pending_acks = HashMap::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost a glorified VecDeque. I wonder if it could be emulated with a wrapper around VecDequeue<Option<(u64, u64)>> where inserting inserts "blanks" if it's not actually sequentially next.

@blt blt mentioned this pull request Sep 20, 2021
40 tasks
@tobz tobz added the ci-condition: integration tests enable Run integration tests on this PR label Sep 24, 2021
@tobz tobz force-pushed the tobz/streamify-new-sink-io-task branch from a8a2430 to d24f943 Compare September 24, 2021 15:17
@tobz tobz force-pushed the tobz/streamify-new-sink-io-task branch from d24f943 to 45d95cd Compare September 24, 2021 15:25
@tobz tobz force-pushed the tobz/streamify-new-sink-io-task branch from 1f30482 to 021a974 Compare September 24, 2021 18:50
Copy link
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through this, but a couple quick notes.

lib/vector-core/src/stream/driver.rs Outdated Show resolved Hide resolved
lib/vector-core/src/stream/driver.rs Show resolved Hide resolved
tobz added 3 commits October 4, 2021 12:10
Signed-off-by: Toby Lawrence <[email protected]>
Signed-off-by: Toby Lawrence <[email protected]>
Signed-off-by: Toby Lawrence <[email protected]>
Signed-off-by: Toby Lawrence <[email protected]>
Copy link
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Just one note about a log message.

let result: Result<(u64, usize), JoinError> = result;
match result {
Ok((seq_no, ack_size)) => {
trace!(message = "Sending request.", seq_no, ack_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this message accurate? It seems like we've already sent the request and gotten a response at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Good catch.

@@ -220,7 +220,7 @@ mod integration_tests {

let (_lines, events, receiver) = make_events_batch(1, 1);
sink.run(events).await.unwrap();
assert_eq!(receiver.await, BatchStatus::Errored);
assert_eq!(receiver.await, BatchStatus::Failed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this switch would happen, but I'm not opposed. Mostly I want to flag that there's some logic behind this change, which I am missing. I have to look up the difference between 'failed' and 'errored' every time; they are "at least one event in the batch had a permanent failure" and "at least one event in the batch had a transient error in delivery" respectively. These definitions lead to tricky questions about priority if a batch has an event that suffered a permanent failure and also a transient error, so I would tend to imagine the line between them is blurry in practice in vector today.

Copy link
Contributor

@blt blt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed you've left the Datadog log sink's run_io in place, here. I'd be happy to see that converted in this PR or in a follow-up issue, whichever your preference. I had a question about a follow-up issue for a movement of code into core as well.

That said, this is excellent and I look forward to it being merged up.

@tobz
Copy link
Contributor Author

tobz commented Oct 5, 2021

@blt I'll be working on an immediate follow-up PR where I handle some of the code movement into vector_core as well as converting the Datadog Logs sink. I just wanted to get this merged in since now other folks have PRs rebased on my branch, and it's all starting to get a bit too long in the tooth. 😅

@blt
Copy link
Contributor

blt commented Oct 5, 2021

@blt I'll be working on an immediate follow-up PR where I handle some of the code movement into vector_core as well as converting the Datadog Logs sink. I just wanted to get this merged in since now other folks have PRs rebased on my branch, and it's all starting to get a bit too long in the tooth. sweat_smile

I'm for it. Merge this thing.

@tobz tobz merged commit 6b5141b into master Oct 5, 2021
@tobz tobz deleted the tobz/streamify-new-sink-io-task branch October 5, 2021 00:43
jszwedko pushed a commit that referenced this pull request Oct 8, 2021
…component (#9215)

* chore(architecture): consolidate sink I/O driver logic into reusable component

Signed-off-by: Toby Lawrence <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-condition: integration tests enable Run integration tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants