reading and writing `bigint` pdarrays from files #2032

stress-tess · 2023-01-06T18:10:15Z

Figure out a good way to read and write bigint pdarrays from files using hdf5 and parquet. We might be able to just convert to a list of uint64 arrays and use existing functionality

The text was updated successfully, but these errors were encountered:

Ethan-DeBandi99 · 2023-01-27T12:50:51Z

I think we have some ideas on this. However, I would like to wait until we have some of the creation speeds resolved.

Ethan-DeBandi99 · 2023-04-18T16:37:14Z

Did a little bit of digging on this. Adding a note here for HDF5 with something to try.

hid_t new_type = H5Tcopy (H5T_NATIVE_INT);
H5Tset_precision (new_type, 128);  // set precision to max_bits
H5Tset_order (new_type, H5T_ORDER_LE);  // set endian type based on what the system is

This brings up a few questions that we will need to figure out. Do we need to indicate the endian type or max bits in an attribute to be able to read the data correctly or is this able to be determined by the type?

Ethan-DeBandi99 · 2023-05-19T13:42:33Z

The code in the previous comment works great for writing the bigint data sets. However, reading the datasets does not work because the custom type cannot be identified because HDF5 does not have a reference for the custom type. I am going to look into creating an extern reference directly in C that outlines the type. This may allow us to handle bigints.

If using the custom type does not work out, for HDF5 we can convert the bigint arrays to uint64 arrays of the limbs and write those arrays out as datasets to a group.

Ethan-DeBandi99 · 2023-05-19T14:06:38Z

After some testing, I discovered that reading a dataset created using a copy of the integer results in the dataset being read as a 64-bit integer without any modification to our current read implementation.

Ethan-DeBandi99 · 2023-05-19T17:32:56Z

Another odd issue that is coming up with this is that the server dies when reading a bigint array. This appears to happen on the to_ndarray. This does not happen if it is read just as an int array. We will need to investigate what is happening here a bit further.

Listing the error for reference:

arkouda_server(21416,0x16f1c8000) malloc: *** error for object 0x600000390000: pointer being freed was not allocated
arkouda_server(21416,0x16f1c8000) malloc: *** set a breakpoint in malloc_error_break to debug
[1]    21416 abort      ./arkouda_server -nl 1 --logLevel=DEBUG

Another issue is that the meta data appears correct for the bigint, but we are not getting all the bits read out. This even happens when manually indicating the configuration of the data type for read. Based on this, I am not sure what our options will be.

I will be digging into these issues a bit more.

Ethan-DeBandi99 · 2023-05-19T18:53:21Z

After doing a bit more digging, I think the issue with storing the bigint directly is actually by design. HDF5 requires the read and write datasets to be the same format. Since we support bigint through gmp with Chapel, it does not appear that we will be able to take advantage of the configuration options from HDF5 because the data being writing into the file is not seen as N-bits. I want to get @pierce314159 to weigh in, but I believe the best option is to use a group and store the individual uint64 arrays rather than a single array.

Ethan-DeBandi99 · 2023-05-22T13:22:32Z

I did a bit more digging on this to see how Pandas handles the case. The pandas.to_hdf call converts any BigInt to a string and then stores a uint8 array in HDF5. Based on this, it seems like it would be best to store bigints as a group containing datasets for with the uint64 limbs.

Ethan-DeBandi99 · 2023-05-23T13:51:33Z

I have configured bigint support for pdarray, ArrayView, and GroupBy. I am waiting for PR #2439 to be merged before handling SegArray. There are a lot of changes that would cause merge conflicts with that PR so I am waiting to limit any potential rework due to the merge.

Once that is merged, SegArray handling (specifically in the read case) should be updated to work similar to GroupBy where we call readPdarrayFromFile. This will limit code duplication and ensure the same handling between the data structures.

All writes and reads are configured to expect BigInt to be convert to the uint64 array representation of limbs.

stress-tess mentioned this issue Jan 6, 2023

bigint data transfer #2030

Closed

2 tasks

stress-tess assigned stress-tess and Ethan-DeBandi99 Jan 6, 2023

stress-tess changed the title ~~reading and writing bigint pdarrays from files~~ reading and writing bigint pdarrays from files Jan 6, 2023

stress-tess mentioned this issue Jan 6, 2023

Add BigInt support #1919

Open

5 tasks

Ethan-DeBandi99 unassigned stress-tess May 19, 2023

Ethan-DeBandi99 mentioned this issue May 25, 2023

Closes #2032 - BigInt Support for HDF5 #2460

Merged

stress-tess closed this as completed in 8526e03 May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading and writing `bigint` pdarrays from files #2032

reading and writing `bigint` pdarrays from files #2032

stress-tess commented Jan 6, 2023 •

edited

Loading

Ethan-DeBandi99 commented Jan 27, 2023

Ethan-DeBandi99 commented Apr 18, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 22, 2023

Ethan-DeBandi99 commented May 23, 2023

reading and writing bigint pdarrays from files #2032

reading and writing bigint pdarrays from files #2032

Comments

stress-tess commented Jan 6, 2023 • edited Loading

Ethan-DeBandi99 commented Jan 27, 2023

Ethan-DeBandi99 commented Apr 18, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 19, 2023

Ethan-DeBandi99 commented May 22, 2023

Ethan-DeBandi99 commented May 23, 2023

reading and writing `bigint` pdarrays from files #2032

reading and writing `bigint` pdarrays from files #2032

stress-tess commented Jan 6, 2023 •

edited

Loading