-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for INT96 columns and large decimal columns #109
Conversation
PR likely still needs tests etc to be mergeable, but contains proof of concept to parse parquet files with large integers/decimals properly. |
@davidtsai, will this PR also bring support for parquet files where |
@keen85 it's halfway there to being able to support it. Right now I'm handling it in our application code: function parquetInt96DateToLuxon(int96: bigint, timezone?: string) {
// Extract nanoseconds and Julian day number
const nanoseconds = int96 & BigInt('0xFFFFFFFFFFFFFFFF'); // first 8 bytes
const julianDay = int96 >> BigInt(64); // last 4 bytes
// Julian day number for Unix epoch (January 1, 1970)
const unixEpochJulianDay = BigInt(2440588);
// Calculate the difference in days between the Julian day and the Unix epoch
const daysSinceEpoch = julianDay - unixEpochJulianDay;
// Convert days to milliseconds
// 86400000 milliseconds in a day
const millisecondsSinceEpoch = daysSinceEpoch * BigInt(86400000);
// Convert nanoseconds to milliseconds and add to the Unix timestamp
const totalMilliseconds = millisecondsSinceEpoch + (nanoseconds / BigInt(1000000));
// Create a DateTime object in UTC
const date = DateTime.fromMillis(Number(totalMilliseconds), { zone: 'utc' });
if (timezone) {
// // Convert to the specified timezone
return date.setZone(timezone, { keepLocalTime: true });
}
return date;
} I'm not sure when we can reliably assume the INT96 column is a date in a parquet file. If there is documented convention for that, would be easy enough to add the above function to this library itself. |
@davidtsai awesome, thanks for your work!
...not sure if this is proof enough 😅 |
@davidtsai Thanks for the PR! Enabled tests, and seeing some fail, but looking closer, I think they might have been bad tests before perhaps? Or might be a mix. Tests should now run for this PR automatically. (At least I think that's what I told GitHub) Let me know if you want/need some help, although you likely understand this part of the codebase better than I now. |
@@ -874,9 +874,13 @@ async function decodeDictionaryPage(cursor: Cursor, header: parquet_thrift.PageH | |||
}; | |||
} | |||
|
|||
return decodeValues(opts.column!.primitiveType!, opts.column!.encoding!, dictCursor, (header.dictionary_page_header!).num_values, opts) | |||
.map((d:Array<unknown>) => d.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing this has caused some of the test errors
Yes, the test was looking for strings, so they would need to be updated to
match numbers and bigints. It is a breaking API change though, but in my
opinion the correct behavior to not always return strings. I can work on
this PR more in about a week to get it over the finish line if it'll be
helpful!
- David
…On Thu, Jan 18, 2024 at 11:57 AM Wil Wade ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lib/reader.ts
<#109 (comment)>
:
> @@ -874,9 +874,13 @@ async function decodeDictionaryPage(cursor: Cursor, header: parquet_thrift.PageH
};
}
- return decodeValues(opts.column!.primitiveType!, opts.column!.encoding!, dictCursor, (header.dictionary_page_header!).num_values, opts)
- .map((d:Array<unknown>) => d.toString());
I think removing this has caused some of the test errors
—
Reply to this email directly, view it on GitHub
<#109 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJMWIFSZQ5MGLDXQOPSYUTYPF5B5AVCNFSM6AAAAABBPLE3N2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMZQGQ3DAMBUGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@davidtsai Javascript is limited to 53 bit numbers, so doesn't it need to be something besides a native number for 96? |
Closing as stale. Please reopen if this is not so. |
Problem
When reading parquet files:
Solution
Change summary:
Steps to Verify: