-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading and writing bigint
pdarrays from files
#2032
Comments
bigint
pdarrays from files
I think we have some ideas on this. However, I would like to wait until we have some of the creation speeds resolved. |
Did a little bit of digging on this. Adding a note here for HDF5 with something to try. hid_t new_type = H5Tcopy (H5T_NATIVE_INT);
H5Tset_precision (new_type, 128); // set precision to max_bits
H5Tset_order (new_type, H5T_ORDER_LE); // set endian type based on what the system is This brings up a few questions that we will need to figure out. Do we need to indicate the endian type or max bits in an attribute to be able to read the data correctly or is this able to be determined by the type? |
The code in the previous comment works great for writing the If using the custom type does not work out, for HDF5 we can convert the bigint arrays to uint64 arrays of the limbs and write those arrays out as datasets to a group. |
After some testing, I discovered that reading a dataset created using a copy of the integer results in the dataset being read as a 64-bit integer without any modification to our current read implementation. |
Another odd issue that is coming up with this is that the server dies when reading a bigint array. This appears to happen on the to_ndarray. This does not happen if it is read just as an int array. We will need to investigate what is happening here a bit further. Listing the error for reference:
Another issue is that the meta data appears correct for the bigint, but we are not getting all the bits read out. This even happens when manually indicating the configuration of the data type for read. Based on this, I am not sure what our options will be. I will be digging into these issues a bit more. |
After doing a bit more digging, I think the issue with storing the bigint directly is actually by design. HDF5 requires the read and write datasets to be the same format. Since we support bigint through gmp with Chapel, it does not appear that we will be able to take advantage of the configuration options from HDF5 because the data being writing into the file is not seen as N-bits. I want to get @pierce314159 to weigh in, but I believe the best option is to use a group and store the individual uint64 arrays rather than a single array. |
I did a bit more digging on this to see how Pandas handles the case. The |
I have configured bigint support for Once that is merged, SegArray handling (specifically in the read case) should be updated to work similar to All writes and reads are configured to expect BigInt to be convert to the |
Figure out a good way to read and write
bigint
pdarrays from files using hdf5 and parquet. We might be able to just convert to a list ofuint64
arrays and use existing functionalityThe text was updated successfully, but these errors were encountered: