Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hdf5] Reading with BytePointer stops on 0 values (zeros) #1311

Closed
calvertdw opened this issue Jan 17, 2023 · 37 comments
Closed

[hdf5] Reading with BytePointer stops on 0 values (zeros) #1311

calvertdw opened this issue Jan 17, 2023 · 37 comments

Comments

@calvertdw
Copy link
Contributor

calvertdw commented Jan 17, 2023

When 0s are in the data, I find I must use a ShortPointer to get the full data for some reason. I'm not sure why. The top test case passes, native bytes and be written and read correctly, the second test fails.

import org.bytedeco.hdf5.*;
import org.bytedeco.hdf5.global.hdf5;
import org.bytedeco.javacpp.*;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;

import java.nio.ByteOrder;
import java.util.Arrays;

public class HDF5TestForReport
{
   static
   {
      System.out.println(String.format("Native byte order: %s", ByteOrder.nativeOrder()));
   }

   @Test // Passes
   public void testBytesWithoutZeros()
   {
      String filePath = "NativeBytesWithoutZeros.hdf5";
      H5File h5File = new H5File(filePath, hdf5.H5F_ACC_TRUNC);
      String datasetId = "bytes";

      byte[] writeData = new byte[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'f',
                                      -2, -1, 1, 2, 3, 4, 5, Byte.MIN_VALUE, Byte.MAX_VALUE };
      BytePointer dataPointer = new BytePointer(writeData);

      int rank = 1;
      long[] dimensions = { dataPointer.limit() };
      DataSpace fileSpace = new DataSpace(rank, dimensions);
      DataType dataType = new DataType(PredType.NATIVE_B8());
      DataSet dataSet = h5File.createDataSet(datasetId, dataType, fileSpace);

      dataSet.write(dataPointer, dataType);

      dataSet.close();
      fileSpace.close();
      h5File.close();

      h5File = new H5File(filePath, hdf5.H5F_ACC_RDONLY);
      dataSet = h5File.openDataSet(datasetId);

      dataPointer = new BytePointer(writeData.length);
      dataSet.read(dataPointer, dataType);

      byte[] readData = new byte[(int) dataPointer.limit()];
      dataPointer.get(readData);

      dataSet.close();
      h5File.close();

      System.out.printf("Wrote: %s%n", Arrays.toString(writeData));
      System.out.printf("Read:  %s%n", Arrays.toString(readData));
      Assertions.assertArrayEquals(writeData, readData);
   }

   @Test
   public void testBytesWithZeros()
   {
      String filePath = "NativeBytesWithZeros.hdf5";
      H5File h5File = new H5File(filePath, hdf5.H5F_ACC_TRUNC);
      String datasetId = "bytes";

      // There is a 0 in here that causes the problem
      byte[] writeData = new byte[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'f',
                                      0, -2, -1, 1, 0, 2, 3, 4, Byte.MIN_VALUE, Byte.MAX_VALUE };
      BytePointer dataPointer = new BytePointer(writeData);

      byte[] beforeWriteData = new byte[(int) dataPointer.limit()];
      dataPointer.get(beforeWriteData);
      System.out.printf("Before: %s%n", Arrays.toString(beforeWriteData));
      System.out.printf("Writing %s bytes%n", writeData.length);

      int rank = 1;
      long[] dimensions = { dataPointer.limit() };
      DataSpace fileSpace = new DataSpace(rank, dimensions);
      DataType dataType = new DataType(PredType.NATIVE_B8());
      DataSet dataSet = h5File.createDataSet(datasetId, dataType, fileSpace);

      dataSet.write(dataPointer, dataType);
      System.out.printf("Wrote:  %s%n", Arrays.toString(writeData));

      dataSet.close();
      fileSpace.close();
      h5File.close();

      h5File = new H5File(filePath, hdf5.H5F_ACC_RDONLY);
      dataSet = h5File.openDataSet(datasetId);

      // Workaround wrapping with ShortPointer
      dataPointer = new BytePointer(writeData.length);
      dataSet.read(new ShortPointer(dataPointer), dataType);
      System.out.printf("Read    %s bytes%n", dataPointer.limit());
      byte[] workaroundReadData = new byte[(int) dataPointer.limit()];
      dataPointer.get(workaroundReadData);
      System.out.printf("Read:   %s%n", Arrays.toString(workaroundReadData));

      // I would expect this to work, but it stops before the 0 value
      dataPointer = new BytePointer(writeData.length);
      dataSet.read(dataPointer, dataType);                            // <-- Problem line
      System.out.printf("Read    %s bytes%n", dataPointer.limit());
      byte[] readData = new byte[(int) dataPointer.limit()];
      dataPointer.get(readData);
      System.out.printf("Read:   %s%n", Arrays.toString(readData));

      dataSet.close();
      h5File.close();

      Assertions.assertArrayEquals(writeData, workaroundReadData); // Passes
      Assertions.assertArrayEquals(writeData, readData);                    // Fails
   }
}
@saudet
Copy link
Member

saudet commented Jan 17, 2023

That sounds like an issue with the API of HDF5? I mean, a different function gets called, right?

@calvertdw
Copy link
Contributor Author

I'm having trouble tracking it down. That read call is:

public native void read(Pointer buf, @Const @ByRef DataType mem_type);

So, they are calling the same method since they both extend Pointer. I'm not sure what lives on the other side of that native call.

@calvertdw
Copy link
Contributor Author

Oh , actually, there is also a

public native void read(@StdString @ByRef BytePointer buf, @Const @ByRef DataType mem_type);

So that one is a different method. All other pointers would be calling the Pointer one.

@calvertdw
Copy link
Contributor Author

Would the @StdString cause some kind of null termination issue? Maybe we should remove that.

@calvertdw
Copy link
Contributor Author

calvertdw commented Jan 17, 2023

Ah, yeah. In hdf5's H5DataSet.cpp, there are two functions:

void DataSet::read(void *buf, const DataType &mem_type, const Data
              const DSetMemXferPropList &xfer_plist) const

and

void DataSet::read(H5std_string &strg, const DataType &mem_type, const DataSpace &mem_space,
              const DataSpace &file_space, const DSetMemXferPropList &xfer_plist) const

So when we use a BytePointer, we are accidentally calling the string ones.

Note this would also be an issue for the cooresponding write methods, which somehow work, but we should still fix it,

@calvertdw
Copy link
Contributor Author

Btw, until this gets fixed, for writing/reading all you have to do is cast to Pointer to get it to call the correct method, like this:

dataSet.read((Pointer) dataPointer, dataType); 

No need to muck around with ShortPointer or anything.

@saudet
Copy link
Member

saudet commented Jan 17, 2023

Yes, that's how we need to do it. What kind of fix do you have in mind?

@calvertdw
Copy link
Contributor Author

calvertdw commented Jan 19, 2023

So I was thinking of re-mapping the read(H5std_string &strg and write(H5std_string &strg C++ signatures to readString(String and writeString(String on the Java side. Is there a way to do that? (Java side readString(BytePointer would be just fine as well)

@saudet
Copy link
Member

saudet commented Jan 19, 2023

Yeah, it should be possible see https://github.com/bytedeco/javacpp/wiki/Mapping-Recipes#specifying-names-to-use-in-java, but that's a problem with the C++ API. If you call read() or write() with a char* in C++, you're probably going to end up calling the "wrong" function as well.

@saudet
Copy link
Member

saudet commented Jan 19, 2023

In my opinion, that's something you should be reporting upstream:
https://www.hdfgroup.org/community/community-development/

@calvertdw
Copy link
Contributor Author

Good thing we have a good workaround, I could see that one taking a while to get changed 😆.

@calvertdw
Copy link
Contributor Author

I reported a bug there: https://jira.hdfgroup.org/browse/HDFVIEW-297

@mkitti
Copy link
Contributor

mkitti commented Feb 23, 2023

@calvertdw why not create a Github issue here: https://github.com/HDFGroup/hdf5/issues ?

@calvertdw
Copy link
Contributor Author

Great point. The other was getting no activity. Done!

@calvertdw
Copy link
Contributor Author

@saudet It seems that this won't be addressed upstream unfortunately.

Would the re-mapping solution mentioned in the comment above work?

@saudet
Copy link
Member

saudet commented Jun 1, 2023

We can add those as helper methods if you want. Sounds good?

@mkitti
Copy link
Contributor

mkitti commented Jun 2, 2023

Are the strings encoded according to Java's modified UTF-8 encoding? https://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

If so, isn't the correct way to read this via the JNI function NewStringUTF?
https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#NewStringUTF

That's how hdf.hdf5lib.H5Dread_string is implemented. So a simple implementation should just be passing the id field of the C++ H5DataSet to hdf.hdf5lib.H5Dread_string, right?
https://github.com/HDFGroup/hdf5/blob/81bc34ac4c3f42532edee32095651f6bcd5e55a2/java/src/jni/h5dImp.c#L993-L1054

@calvertdw
Copy link
Contributor Author

@mkitti You are misundertanding the issue I was having. I was trying to read raw data (encoded JPEG images) from an HDF5 file. I had a BytePointer JavaCPP object and called the HDF5 read method. What I got was only a handful of bytes until the first '0' byte occurred. This was a major head scratcher and I had to dig in to find out I was accidentally calling the String function.

The problem is that to read/write literally just raw data, HDF5 is making that risky by inviting the opportunity to accidentally call a version of read that truncates the read on the first '0'. As that is the most basic of data storage functions, I am a bit dissappointed that HDF5 is not super clear about what those functions do and also super clear about the data truncation in the string case. There is not even documentation on those most basic of functions in the code, which is what I would read as a user of the C++ API.

@mkitti
Copy link
Contributor

mkitti commented Jul 11, 2023

I think HDFGroup/hdf5#3083 still fixes the situation. Even if you read it as a std::string by accident that fix will read the entire JPEG into the std::string. It will not stop at the first null byte since we no longer use the C string convention.

@saudet
Copy link
Member

saudet commented Jul 24, 2023

@mkitti Sounds like a fix to me. When does it get released?

@mkitti
Copy link
Contributor

mkitti commented Jul 24, 2023

Well it got merged into the main development branch, so now it is just a matter of the HDF5 release schedule in their README.
https://github.com/HDFGroup/hdf5/blob/develop/doc/img/release-schedule.png

@calvertdw
Copy link
Contributor Author

It's still not the fix I was hoping for, because the correct way to read/write bytes is to use the standard read and write functions. The C++ code does seem reasonable, it seems like the issue is now that JavaCPP loses the type difference between a String and ByteBuffer by using BytePointer for both. So for all C++ libraries that have signatures where the only difference is a void* vs a std::string would have an issue.

@calvertdw
Copy link
Contributor Author

calvertdw commented Aug 30, 2023

Even if keeping using the string version, I think users would still sometimes get errors, from the changlog of HDFGroup/hdf5#3083:

Fixed length datasets are now read into H5std_string as a fixed length string of the appropriate size. Variable length datasets will still be truncated at the first null byte.

A JPEG image is variable length, after all.

@saudet
Copy link
Member

saudet commented Aug 30, 2023

It still sounds like an issue with HDF5 to me. The read() function has no idea how much memory was allocated in the pointer it was given, so that's a safety issue. There should be a size parameter somewhere to indicate the maximum size. That's not an issue with std::string since that can allocate as much memory as we need.

@mkitti
Copy link
Contributor

mkitti commented Aug 31, 2023

A JPEG image is variable length, after all.

How so? Its length is not determined by some token in the byte sequence.

When you write the bytes of the image, you tell HDF5 how many bytes you want to write.

@mkitti
Copy link
Contributor

mkitti commented Aug 31, 2023

The read() function has no idea how much memory was allocated in the pointer it was given, so that's a safety issue.

The memory space argument is meant to indicate how many bytes have been allocated.

See the docs for the C function H5Dread.

https://docs.hdfgroup.org/hdf5/v1_14/group___h5_d.html#ga8287d5a7be7b8e55ffeff68f7d26811c

@saudet
Copy link
Member

saudet commented Sep 5, 2023

Ok, so inversely, why do we need to that memory space argument for std::string??

@mkitti
Copy link
Contributor

mkitti commented Sep 5, 2023

Ok, so inversely, why do we need to that memory space argument for std::string??

You do not need it in C++. It defaults to DataSpace::ALL.

https://github.com/hdfgroup/hdf5/blob/02c71bd65f3b9d62b2be167eda3adf80668f0ee4/c%2B%2B/src/H5DataSet.h#L79-L84

That appears to get translated to this via JavaCPP:

public void read(@StdString @ByRef
BytePointer buf,
@Const @ByRef
DataType mem_type)

http://bytedeco.org/javacpp-presets/hdf5/apidocs/org/bytedeco/hdf5/DataSet.html#read-org.bytedeco.javacpp.BytePointer-org.bytedeco.hdf5.DataType-

However, you can specify a file dataspace if you only want to read some of the bytes and copy it into a specific location of a buffer via memory space.

@saudet
Copy link
Member

saudet commented Sep 16, 2023

BTW, does it make sense to map H5std_string to BytePointer at all? Or should that be reserved for char*?

@mkitti
Copy link
Contributor

mkitti commented Sep 17, 2023

BytePointer does not make very much sense to me a peer to std::string.

Perhaps java.lang.StringBuffer or StringBuilder would be better?

@saudet
Copy link
Member

saudet commented Sep 19, 2023

There isn't anything interesting about StringBuilder from a JNI point of view, we might as well just pass the internal char[] in that case, which already works with std::u16string, but I don't think HDF5 supports that. I think it only supports std::string, which doesn't support UTF-16, so that's probably not an interesting idea either.

@calvertdw
Copy link
Contributor Author

It seems like JavaCPP should have a new type to match with a std::string, then? I know that'd be quite some work.

@mkitti
Copy link
Contributor

mkitti commented Sep 19, 2023

Java has String, StringBuffer, and StringBuilder.

  • String is immutable.
  • StringBuffer is mutable and synchronized
  • StringBuilder is mutable and not synchronized

std::string is mutable and not synchronized.

@saudet
Copy link
Member

saudet commented Sep 20, 2023

It seems like JavaCPP should have a new type to match with a std::string, then? I know that'd be quite some work.

No, it's not difficult. I've added support for that in commit bytedeco/javacpp@7227ec6, so we can add something like that to HDF5's presets:

infoMap.put(new Info("std::basic_string<char>", "std::string", "H5std_string").pointerTypes("H5std_string").define());

But it's not exactly a nice (efficient) interface for a string and that replaces all instances of std::string... Probably not what you want.

Java has String, StringBuffer, and StringBuilder.

We can access String efficiently from JNI, not so with StringBuffer or StringBuilder.

@calvertdw
Copy link
Contributor Author

That looks great! That sounds like it will absolutely solve this issue, since this is about saving raw data with HDF5 and avoiding accidental calls to the string functions.

@saudet
Copy link
Member

saudet commented Nov 7, 2023

Looking at this again, it seems like everything works like we want when casting explicitly to Pointer, and if we don't cast, it just goes through a std::string, which might be less efficient, but it still works correctly with the latest fixes in HDF5, so I don't see any problems the way things are now. Can we close this? @calvertdw

@calvertdw
Copy link
Contributor Author

Yes. Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants