Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small bug in recursiveFix func #91

Open
julianedwards opened this issue Jul 18, 2022 · 1 comment
Open

Small bug in recursiveFix func #91

julianedwards opened this issue Jul 18, 2022 · 1 comment

Comments

@julianedwards
Copy link

Describe the bug
I uncovered a small bug in schema.go, the recursiveFix func overwrites the passed in colPath by using it in the append function to add the column name to the column path on L684. Since go slices may or may not reuse the underlying array depending on capacity, this does not always necessarily return a slice that points to a newly allocated array (I have been bitten by this bug before, it's one of go's "gotchas"). This resulted in the go schema to have the same column path for all fields in a list element and thus invalid parquet.

message test_result {
  required binary version (STRING);
  required binary variant (STRING);
  required binary task_name (STRING);
  optional binary display_task_name (STRING);
  required binary task_id (STRING);
  optional binary display_task_id (STRING);
  required int32 execution;
  required binary request_type (STRING);
  required int64 created_at (TIMESTAMP(MILLIS,true));
  required group results (LIST) {
    repeated group list {
      required group element {
        required binary test_name (STRING);
        optional binary display_test_name (STRING);
        optional binary group_id (STRING);
        required int32 trial;
        required binary status (STRING);
        optional binary log_test_name (STRING);
        optional binary log_url (STRING);
        optional binary raw_log_url (STRING);
        optional int32 line_num;
        required int64 task_create_time (TIMESTAMP(MILLIS,true));
        required int64 test_start_time (TIMESTAMP(MILLIS,true));
        required int64 test_end_time (TIMESTAMP(MILLIS,true));
      }
    }
  }
}

Using this schema, all the column paths for the list elements were ["results", "list", "element", "test_end_time"] because test_end_time is the last field in the group and the recursive function continuously overwrites the colPath argument and assigns it to each column's path field.

I can put up a PR for the fix, it is a one-line change:

col.path = append(append(col.path, colPath...), col.name)

Let me know if I can provide any more info, thanks!

@csimplestring
Copy link

i already made a pr but nobody replies yet, here is the repo: https://github.com/csimplestring/parquet-go if you want to use

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants