Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for INT96 columns and large decimal columns #109

Closed
wants to merge 2 commits into from

Conversation

davidtsai
Copy link

Problem

When reading parquet files:

  • INT96 columns are not being parsed correctly and are being truncated.
  • Files with large DECIMAL columns stored in FIXED_LEN_BYTE_ARRAY is not supported

Solution

  • Parses INT96 columns into a BigInt to return untruncated values.
  • Adds support for parsing FIXED_LEN_BTYE_ARRAY columns into arbitrary precision DECIMALs.

Change summary:

  • Tidy, well formulated commit message
  • Another great commit message
  • Something else I/we did

Steps to Verify:

  1. A setup step / beginning state
  2. What to do next
  3. Any other instructions
  4. Expected behavior
  5. Suggestions for testing

@davidtsai
Copy link
Author

PR likely still needs tests etc to be mergeable, but contains proof of concept to parse parquet files with large integers/decimals properly.

@keen85
Copy link

keen85 commented Jan 6, 2024

@davidtsai, will this PR also bring support for parquet files where TIMESTAMP columns are encoded as INT96?
Example file see here: part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet

@davidtsai
Copy link
Author

@keen85 it's halfway there to being able to support it. Right now I'm handling it in our application code:

function parquetInt96DateToLuxon(int96: bigint, timezone?: string) {
  // Extract nanoseconds and Julian day number
  const nanoseconds = int96 & BigInt('0xFFFFFFFFFFFFFFFF'); // first 8 bytes
  const julianDay = int96 >> BigInt(64); // last 4 bytes

  // Julian day number for Unix epoch (January 1, 1970)
  const unixEpochJulianDay = BigInt(2440588);

  // Calculate the difference in days between the Julian day and the Unix epoch
  const daysSinceEpoch = julianDay - unixEpochJulianDay;

  // Convert days to milliseconds
  // 86400000 milliseconds in a day
  const millisecondsSinceEpoch = daysSinceEpoch * BigInt(86400000);

  // Convert nanoseconds to milliseconds and add to the Unix timestamp
  const totalMilliseconds = millisecondsSinceEpoch + (nanoseconds / BigInt(1000000));

  // Create a DateTime object in UTC
  const date = DateTime.fromMillis(Number(totalMilliseconds), { zone: 'utc' });

  if (timezone) {
    // // Convert to the specified timezone
    return date.setZone(timezone, { keepLocalTime: true });
  }
  return date;
}

I'm not sure when we can reliably assume the INT96 column is a date in a parquet file. If there is documented convention for that, would be easy enough to add the above function to this library itself.

@keen85
Copy link

keen85 commented Jan 6, 2024

@davidtsai awesome, thanks for your work!

If there is documented convention for that

...not sure if this is proof enough 😅

@wilwade wilwade self-assigned this Jan 8, 2024
@wilwade
Copy link
Member

wilwade commented Jan 8, 2024

@davidtsai Thanks for the PR!

Enabled tests, and seeing some fail, but looking closer, I think they might have been bad tests before perhaps? Or might be a mix. Tests should now run for this PR automatically. (At least I think that's what I told GitHub)

Let me know if you want/need some help, although you likely understand this part of the codebase better than I now.

@@ -874,9 +874,13 @@ async function decodeDictionaryPage(cursor: Cursor, header: parquet_thrift.PageH
};
}

return decodeValues(opts.column!.primitiveType!, opts.column!.encoding!, dictCursor, (header.dictionary_page_header!).num_values, opts)
.map((d:Array<unknown>) => d.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing this has caused some of the test errors

@davidtsai
Copy link
Author

davidtsai commented Jan 18, 2024 via email

@wilwade
Copy link
Member

wilwade commented Jan 18, 2024

@davidtsai Javascript is limited to 53 bit numbers, so doesn't it need to be something besides a native number for 96?

@wilwade
Copy link
Member

wilwade commented Feb 13, 2024

Closing as stale. Please reopen if this is not so.

@wilwade wilwade closed this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants