Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK-parquet] parquet sized buffer and gcs handler #602

Merged

Conversation

yuunlimm
Copy link
Contributor

@yuunlimm yuunlimm commented Nov 12, 2024

add a parquet specific step that stores parquet structs in a buffer and triggers uploads using gcs_handler which can handle any of parquet types defined in the enum

Screenshot 2024-11-13 at 5.41.29 PM.png

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch 3 times, most recently from a0cf889 to 1f2a615 Compare November 12, 2024 22:51
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from 9bff5f3 to 9c6dc91 Compare November 13, 2024 19:25
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch 2 times, most recently from 0592350 to 4fc6f8c Compare November 13, 2024 23:30
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from 472b366 to daf3f6e Compare November 14, 2024 00:09
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 4fc6f8c to 0d05560 Compare November 14, 2024 00:09
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from daf3f6e to f77aaac Compare November 14, 2024 00:10
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 0d05560 to 9a4a1e2 Compare November 14, 2024 00:10
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from f77aaac to eb396de Compare November 14, 2024 00:15
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 9a4a1e2 to 2514fd7 Compare November 14, 2024 00:15
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch 3 times, most recently from 2f91ffa to 2dfdb15 Compare November 14, 2024 00:25
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch 2 times, most recently from c2e7723 to 6f12cd4 Compare November 14, 2024 00:56
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from 2dfdb15 to 1bc8672 Compare November 14, 2024 00:59
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch 2 times, most recently from 4d7551c to c59ea51 Compare November 14, 2024 01:38
@yuunlimm yuunlimm marked this pull request as ready for review November 14, 2024 01:42
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch 2 times, most recently from ef311b0 to 8f43e54 Compare November 14, 2024 17:49
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from c59ea51 to f1ca3f9 Compare November 14, 2024 17:49
@yuunlimm yuunlimm requested review from rtso and dermanyang November 14, 2024 17:52
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from f1ca3f9 to 4d56db7 Compare November 14, 2024 17:59

#[async_trait]
pub trait Uploadable {
async fn handle_buffer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called upload_buffer? handle is very generic-sounding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, upload makes more sense!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like this was addressed so opening back up!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I think Imissed, updated!

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch 2 times, most recently from d0a5380 to 946b726 Compare November 15, 2024 18:21

#[tokio::test]
#[allow(clippy::needless_return)]
async fn test_parquet_buffer_step_trigger_upload() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

pub fn update_current_batch_metadata(&mut self, cur_batch_metadata: &TransactionMetadata) {
if let Some(buffer_metadata) = &mut self.current_batch_metadata {
// Update metadata fields with the current batch's end information
buffer_metadata.end_version = cur_batch_metadata.end_version;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we set, the new versions, should we add some validation that there are not gaps in data (buffer_metadata.end_version + 1 = cur_batch_metadata.start_version? It seems like everything is parsed in order 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it shouldn't happen but it's good to add a safety here!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to avoid panics if possible and return a Result

{
internal_buffers: HashMap<ParquetTypeEnum, ParquetBuffer>,
pub poll_interval: Duration,
pub buffer_uploader: U,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does buffer_uploader need to be a trait? It looks like the type will always be GCSUploader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we can remove the generics

Comment on lines 149 to 160
// if it wasn't uploaded -> we update only end_version, size, and last timestamp
if file_uploaded {
if let Some(buffer_metadata) = &mut buffer.current_batch_metadata {
buffer_metadata.start_version = cur_batch_metadata.start_version;
buffer_metadata.start_transaction_timestamp =
cur_batch_metadata.start_transaction_timestamp.clone();
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary if we already did line 147?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's slightly different b/c here we are updating the start_version and start_trxn_timestamp, which only needs to be updated after we upload the file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOuldn't 2nd part of the function (lines 45-46) take care of updating after uploading the file too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, if we set none after upload, it will be handled when we actually handle the current batch.

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from e939fee to 2198641 Compare November 18, 2024 21:33
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch 2 times, most recently from c7c2586 to 168ac53 Compare November 18, 2024 21:52
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch 2 times, most recently from 4455b67 to 61ecc68 Compare November 18, 2024 21:55
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 168ac53 to 40b32d4 Compare November 18, 2024 21:58
Copy link
Collaborator

@rtso rtso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!!

pub fn update_current_batch_metadata(&mut self, cur_batch_metadata: &TransactionMetadata) {
if let Some(buffer_metadata) = &mut self.current_batch_metadata {
// Update metadata fields with the current batch's end information
buffer_metadata.end_version = cur_batch_metadata.end_version;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to avoid panics if possible and return a Result

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 40b32d4 to 191f514 Compare November 18, 2024 23:13
@yuunlimm yuunlimm changed the base branch from 11-12-_sdk-parquet_parquet_default_processor_extractor_step to graphite-base/602 November 18, 2024 23:15
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 191f514 to 55389e1 Compare November 18, 2024 23:17
@yuunlimm yuunlimm changed the base branch from graphite-base/602 to main November 18, 2024 23:18
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 55389e1 to 6c2a18e Compare November 18, 2024 23:18
@yuunlimm
Copy link
Contributor Author

updated test with concrete GCSUploader instance, verified that it won't upload anything since data is empty.

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 6c2a18e to d8954bd Compare November 18, 2024 23:26
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from d8954bd to b482e4e Compare November 18, 2024 23:32
@yuunlimm yuunlimm merged commit 5160738 into main Nov 19, 2024
7 checks passed
@yuunlimm yuunlimm deleted the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch November 19, 2024 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants