Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading and writing bigint pdarrays from files #2032

Closed
Tracked by #2030
stress-tess opened this issue Jan 6, 2023 · 8 comments · Fixed by #2460
Closed
Tracked by #2030

reading and writing bigint pdarrays from files #2032

stress-tess opened this issue Jan 6, 2023 · 8 comments · Fixed by #2460
Assignees

Comments

@stress-tess
Copy link
Member

stress-tess commented Jan 6, 2023

Figure out a good way to read and write bigint pdarrays from files using hdf5 and parquet. We might be able to just convert to a list of uint64 arrays and use existing functionality

@stress-tess stress-tess mentioned this issue Jan 6, 2023
2 tasks
@stress-tess stress-tess changed the title reading and writing bigint pdarrays from files reading and writing bigint pdarrays from files Jan 6, 2023
@stress-tess stress-tess mentioned this issue Jan 6, 2023
5 tasks
@Ethan-DeBandi99
Copy link
Contributor

I think we have some ideas on this. However, I would like to wait until we have some of the creation speeds resolved.

@Ethan-DeBandi99
Copy link
Contributor

Did a little bit of digging on this. Adding a note here for HDF5 with something to try.

hid_t new_type = H5Tcopy (H5T_NATIVE_INT);
H5Tset_precision (new_type, 128);  // set precision to max_bits
H5Tset_order (new_type, H5T_ORDER_LE);  // set endian type based on what the system is

This brings up a few questions that we will need to figure out. Do we need to indicate the endian type or max bits in an attribute to be able to read the data correctly or is this able to be determined by the type?

@Ethan-DeBandi99
Copy link
Contributor

The code in the previous comment works great for writing the bigint data sets. However, reading the datasets does not work because the custom type cannot be identified because HDF5 does not have a reference for the custom type. I am going to look into creating an extern reference directly in C that outlines the type. This may allow us to handle bigints.

If using the custom type does not work out, for HDF5 we can convert the bigint arrays to uint64 arrays of the limbs and write those arrays out as datasets to a group.

@Ethan-DeBandi99
Copy link
Contributor

After some testing, I discovered that reading a dataset created using a copy of the integer results in the dataset being read as a 64-bit integer without any modification to our current read implementation.

@Ethan-DeBandi99
Copy link
Contributor

Another odd issue that is coming up with this is that the server dies when reading a bigint array. This appears to happen on the to_ndarray. This does not happen if it is read just as an int array. We will need to investigate what is happening here a bit further.

Listing the error for reference:

arkouda_server(21416,0x16f1c8000) malloc: *** error for object 0x600000390000: pointer being freed was not allocated
arkouda_server(21416,0x16f1c8000) malloc: *** set a breakpoint in malloc_error_break to debug
[1]    21416 abort      ./arkouda_server -nl 1 --logLevel=DEBUG

Another issue is that the meta data appears correct for the bigint, but we are not getting all the bits read out. This even happens when manually indicating the configuration of the data type for read. Based on this, I am not sure what our options will be.

I will be digging into these issues a bit more.

@Ethan-DeBandi99
Copy link
Contributor

After doing a bit more digging, I think the issue with storing the bigint directly is actually by design. HDF5 requires the read and write datasets to be the same format. Since we support bigint through gmp with Chapel, it does not appear that we will be able to take advantage of the configuration options from HDF5 because the data being writing into the file is not seen as N-bits. I want to get @pierce314159 to weigh in, but I believe the best option is to use a group and store the individual uint64 arrays rather than a single array.

@Ethan-DeBandi99
Copy link
Contributor

I did a bit more digging on this to see how Pandas handles the case. The pandas.to_hdf call converts any BigInt to a string and then stores a uint8 array in HDF5. Based on this, it seems like it would be best to store bigints as a group containing datasets for with the uint64 limbs.

@Ethan-DeBandi99
Copy link
Contributor

I have configured bigint support for pdarray, ArrayView, and GroupBy. I am waiting for PR #2439 to be merged before handling SegArray. There are a lot of changes that would cause merge conflicts with that PR so I am waiting to limit any potential rework due to the merge.

Once that is merged, SegArray handling (specifically in the read case) should be updated to work similar to GroupBy where we call readPdarrayFromFile. This will limit code duplication and ensure the same handling between the data structures.

All writes and reads are configured to expect BigInt to be convert to the uint64 array representation of limbs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants