-
-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H5DataSet read and write H5std_string functions are likely to get called accidentally #2501
Comments
Just to clarify, you are using the JavaCpp org.bytedeco bindings which are automated from the C++ bindings rather than the hdf.hdf5lib bindings, right? |
I'll check about the H5DataSet read and write |
Yes that is correct @mkitti, I am referring to and JavaCPP binds to the C++ API. |
Thank you! |
Isn't this a bug for JavaCPP then? |
I agree the 0 termination issue is strange if you were using a Java API such as the one that HDF Group provides. However, you are using an automatically generated Java API from a C/C++ API where null terminated strings are normal. Your requested fix is to change the C/C++ API. Wouldn't it make sense to use the canonical Java API instead to resolve this? |
Specifcally the JNI function NewStringUTF Line 1033 in 248045e
would use the modified UTF-8 encoding documented here: https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/DataInput.html#modified-utf-8 avoiding the internal null bytes. |
The JNI that HDF5 provides isn't sufficient for our large team and codebase. The Bytedeco JavaCPP library provides many crucial benefits like portability and seamless use of many AI related C/C++ libraries. It is available on Maven Central with prebuilt binaries for virtually every platform included in the artifacts. I would consider HDF5 to be a very "enterprise" level library, such that the biggest users are likely to be organizations which would value the features that Bytedeco provides. With that said, is there some reason for not changing the names of the string functions to |
As of bytedeco/javacpp-presets#1327 you can use For now you can use the instructions at http://bytedeco.org/builds/ to pull the snapshots via maven or ask saudet when the release will occur:
There probably would be better traction in clearly stating at that reading a fixed length string into Line 732 in 79bb60c
|
Thanks for the detailed reply. The statements "It probably would have be [...] as to not break existing applications" and "The main focus of HDF Group is the C API" seem to conflict. If you aren't focusing on the C++ API anyway, why spend extra effort maintaining backwards compatibility?
Well this wouldn't work, because the old string method having the same signature is the actual issue here. Additionally, the fixed length string reader doesn't help in the situation when you don't know beforehand how long it is. I guess you could find that out using other methods, but that's laborious for no reason. |
To be clear, I am not HDF Group.
I disagree. The problem is that the C++ API always makes the assumption that it is reading a null terminated C string even when the true length is known. Rather the C++ API should create a C++ There are three ways to encode a string in HDF5 as detailed in following URL.
Do you have a variable length string with one or more embedded nulls? If so, how do you know the actual length of the string? |
I filed #3034 in relation to this post. If we resolved that, then we should be able to read embedded null bytes into C++'s |
An interesting conversation: Lib HDF5 defines strings as a sequence of ASCII or utf8 characters, neither of the standards allow
|
I believe it may be possible to have a fixed length |
Some implementations allow non-conforming output, if I understood you right Fortran can insert a null in the middle of a sequence of bytes (which are not UTF8 or ASCII strings); but other conforming libraries will ignore the data after the Yes, you are correct to say that OTOH: Opaque is a good candidate to save binary content, and personally this is what I do. Binary to string encoding such as base64 and variants are alternatives, but will look unusual. Probably this could make a good topic for @gheber Call the Doctor ? |
@mkitti convinced me to reconsider, and from then on I can't un-see this: Since C++ |
The
H5DataSet
read
andwrite
functions have options for doing so with either avoid*
or aH5std_string
.We ran into an issue with a Java binding for HDF5, documented here.
The issue was that in trying to read and write bytes to an
H5DataSet
, we were accidentally calling theH5std_string
versions without knowing it and theread
function in that case will terminate the read on any0
bytes.It took a while to figure out why we couldn't read and write bytes when other types (ints, floats, etc) worked fine. Our team spent several days scratching our heads.
I was wondering if the same issue might apply to all users of HDF5, who might be wishing to read
char*
with0
s in it and getting unexpected results.Once possible solution would be to rename the
H5std_string
versions toreadString
andwriteString
. What do you guys think?Thanks!
The text was updated successfully, but these errors were encountered: