Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK-parquet] parquet default processor extractor step #601

Merged

Conversation

yuunlimm
Copy link
Contributor

@yuunlimm yuunlimm commented Nov 12, 2024

Description

This is to add an extractor step for parquet default processor.

Changes invovled:

  • added a map to filter parquet structs.
  • added a new model RawTableItem and a trait TableItemConvertible which is implemented for both Parquet and Postgres TableItem.
  • updated a (Postgres) TableItem::from_write_table_item to return table_items only

Context on why we added a new RawTableItem model:
tl;dr
This is to reduce the duplicated code as much as possible and reuse the common parsing logic. RawTableItem represents parsed transaction data. This struct will act as the unified output from the parsing function. Parquet TableItem and Postgres TableItem will implement a TableItemConvertible trait to convert RawTableItem into the appropriate format.

With this approach, parsing logic is centralized, and each processor (Postgres or Parquet) can convert RawTableItem to its respective TableItem type, keeping code DRY and maintainable while supporting format-specific needs.

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch 2 times, most recently from b4f46db to 9bff5f3 Compare November 12, 2024 19:55
@yuunlimm yuunlimm marked this pull request as ready for review November 12, 2024 22:03
rust/sdk-processor/src/steps/mod.rs Outdated Show resolved Hide resolved
Comment on lines 97 to 139
pub fn process_transactions(
transactions: Vec<Transaction>,
) -> (
Vec<MoveResource>,
Vec<WriteSetChangeModel>,
Vec<TransactionModel>,
Vec<TableItem>,
Vec<MoveModule>,
) {
// this will be removed in the future.
let mut transaction_version_to_struct_count = AHashMap::new();
let (txns, _, write_set_changes, wsc_details) = TransactionModel::from_transactions(
&transactions,
&mut transaction_version_to_struct_count,
);

let mut move_modules = vec![];
let mut move_resources = vec![];
let mut table_items = vec![];

for detail in wsc_details {
match detail {
WriteSetChangeDetail::Module(module) => {
move_modules.push(module);
},
WriteSetChangeDetail::Resource(resource) => {
move_resources.push(resource);
},
WriteSetChangeDetail::Table(item, _, _) => {
table_items.push(item);
},
}
}

(
move_resources,
write_set_changes,
txns,
table_items,
move_modules,
)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to re-use the process_transactions from the Parquet default processor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet, resuse this one from the Postgres default processor:

Copy link
Contributor Author

@yuunlimm yuunlimm Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can reuse the the process_transactions from the parquet default processor!

just sharing the context why initially I decided not to re-use the process_transactions from the parquet processor because there is that we wouldn't need the struct count map. having new code in this file allows us to simply remove the code later after migrated all. but it already re-uses some of the functions from the existing parquet processor lol.

Or better yet, resuse this one from the Postgres default processor:

I don't think we can reuse the current version of process_transactions from the postgres default processor since they are kind of diverged due to Parquet migration. Keeping them as separate functions could simplify the code by avoiding conditionals for deprecated tables, making each version cleaner and more maintainable.

Alternatively, if there’s significant overlap, we could refactor shared logic into helper functions. What do you think—is it better to keep them separate or refactor shared portions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand the context: what part of the Postgres and Parquet diverged during the migration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unresolving, also curious about the above ^ @yuunlimm

@yuunlimm yuunlimm force-pushed the restored_branch branch 2 times, most recently from 7e0016b to 58ac581 Compare November 13, 2024 19:25
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from 9bff5f3 to 9c6dc91 Compare November 13, 2024 19:25
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from 472b366 to daf3f6e Compare November 14, 2024 00:09
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch 3 times, most recently from eb396de to 16f0a25 Compare November 14, 2024 00:20
@yuunlimm yuunlimm changed the base branch from restored_branch to graphite-base/601 November 14, 2024 00:24
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from 16f0a25 to 2f91ffa Compare November 14, 2024 00:25
@yuunlimm yuunlimm changed the base branch from graphite-base/601 to main November 14, 2024 00:25
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch 3 times, most recently from 1bc8672 to ef311b0 Compare November 14, 2024 17:48
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from ef311b0 to 8f43e54 Compare November 14, 2024 17:49
@yuunlimm yuunlimm requested a review from rtso November 14, 2024 17:52
Copy link
Contributor

@dermanyang dermanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving to unblock with some questions

_parquet_table_items,
move_modules,
),
_transaction_version_to_struct_count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what the convention is in general for var names with leading underscore?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the variables are unused, so the compiler doesn't mark this as a lint error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ +1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would you do the leading underscore instead of just the underscore like in

 let (txns, _, write_set_changes, wsc_details) = TransactionModel::from_transactions(
        &transactions,
        &mut transaction_version_to_struct_count,
    );

Comment on lines 97 to 139
pub fn process_transactions(
transactions: Vec<Transaction>,
) -> (
Vec<MoveResource>,
Vec<WriteSetChangeModel>,
Vec<TransactionModel>,
Vec<TableItem>,
Vec<MoveModule>,
) {
// this will be removed in the future.
let mut transaction_version_to_struct_count = AHashMap::new();
let (txns, _, write_set_changes, wsc_details) = TransactionModel::from_transactions(
&transactions,
&mut transaction_version_to_struct_count,
);

let mut move_modules = vec![];
let mut move_resources = vec![];
let mut table_items = vec![];

for detail in wsc_details {
match detail {
WriteSetChangeDetail::Module(module) => {
move_modules.push(module);
},
WriteSetChangeDetail::Resource(resource) => {
move_resources.push(resource);
},
WriteSetChangeDetail::Table(item, _, _) => {
table_items.push(item);
},
}
}

(
move_resources,
write_set_changes,
txns,
table_items,
move_modules,
)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unresolving, also curious about the above ^ @yuunlimm

Copy link
Contributor

@dermanyang dermanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thanks for the detailed decription.

Copy link
Collaborator

@rtso rtso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with comments!

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch 2 times, most recently from 2198641 to 4455b67 Compare November 18, 2024 21:53
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch from 4455b67 to 61ecc68 Compare November 18, 2024 21:55
@yuunlimm yuunlimm merged commit 552ecc7 into main Nov 18, 2024
7 checks passed
@yuunlimm yuunlimm deleted the 11-12-_sdk-parquet_parquet_default_processor_extractor_step branch November 18, 2024 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants