Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Object IDs in NWBFile #991

Merged
merged 25 commits into from
Jul 30, 2019
Merged

Support Object IDs in NWBFile #991

merged 25 commits into from
Jul 30, 2019

Conversation

ajtritt
Copy link
Member

@ajtritt ajtritt commented Jul 1, 2019

Motivation

Fix #989

Checklist

  • Have you checked our Contributing document?
  • Have you ensured the PR description clearly describes problem and the solution?
  • Is your contribution compliant with our coding style ? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using #XXX notation where XXX is the issue number ?

@ajtritt ajtritt requested review from rly and bendichter July 1, 2019 21:51
@ajtritt ajtritt requested a review from lawrence-mbf July 1, 2019 21:53
@codecov
Copy link

codecov bot commented Jul 29, 2019

Codecov Report

Merging #991 into dev will increase coverage by 0.07%.
The diff coverage is 60.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #991      +/-   ##
==========================================
+ Coverage   72.29%   72.37%   +0.07%     
==========================================
  Files          36       36              
  Lines        2718     2722       +4     
  Branches      513      514       +1     
==========================================
+ Hits         1965     1970       +5     
+ Misses        632      627       -5     
- Partials      121      125       +4
Impacted Files Coverage Δ
src/pynwb/spec.py 86.31% <ø> (ø) ⬆️
src/pynwb/file.py 76.4% <50%> (-1.2%) ⬇️
src/pynwb/core.py 71.92% <69.23%> (+0.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bd603f...4852325. Read the comment docs.

@ajtritt ajtritt merged commit 4cd273e into dev Jul 30, 2019
@rly rly deleted the enh/data_id branch July 31, 2019 00:58
@sgratiy
Copy link

sgratiy commented Aug 16, 2019

@oruebel
The object_id changes when re-creating nwb file from the same data. This is problematic for regression testing when you want to create an nwb files from test data and compare to the existing test nwb file. I do the comparison with h5diff and it fails because of the differences in object_id. h5diff does not have an option to exclude some attributes when comparing files.
Is there a way to opt out of saving object_id into nwb file?

@oruebel
Copy link
Contributor

oruebel commented Aug 19, 2019

@sgratiy I'm not sure its a good idea to give folks the option to change object_id. The intend of object_id is to provide a unique identifier for data. The object_id is not a hash of the data, but data that is generated at different times (i.e., even if you make a copy) are and should be considered different for the purpose of the object_id. I am afraid that allowing folks to set object_id will cause more trouble than benefit.

If I understand your use-case correctly what you are doing is:

  1. Save a file as reference
  2. Create in your test a new file with the same data
  3. Run h5diff between the file created in the test and the one you stored as a reference

I think for this specific use-case of regression testing it will be best to fix the issue in the regression test itself. A few possible solutions may be:

  • You could ignore any diffs for object_id in the test, since differences in object_id are intentional. h5diff has an --exclude-path option, but I'm not sure if it works for attributes

  • You could implement a custom diff function using h5py (e.g., as a static method on the HDF5IO backend in HDMF) with an option to ignore object_id.

  • You could overwrite the object_id value after you write the file and before running h5diff. Using h5py this would probably look something like this:

    import h5py
    f = h5py.File('mfile', 'w')
    def overwrite_object_id(name, h5obj):
         if 'object_id' in h5pobj.attrs:
              h5obj.attrs['object_id'] = 'Undefined'
    
  • You could monkey patch the object_id property of the Container class in HDMF for your regression tests to set the object_id to some fixed value https://github.com/hdmf-dev/hdmf/blob/e13fc0bbfcdb927fba291db5a9de2798036ea6ac/src/hdmf/container.py#L35-L39 with something like:

    from hdmf import Container
    def fixed_property(self):
          return "Undefined"
    Container.object_id = property(fixed_property)
    
  • https://github.com/NeurodataWithoutBorders/diff could be useful. I don't know the tool and I don't think it is being actively maintained but it may provide you with a starting point for what you need.

@sgratiy
Copy link

sgratiy commented Aug 20, 2019

@oruebel Yes, you correctly interpreted my use case. The --exclude-path option in h5diff does not apply for attributes. Thanks for a detailed response and outlining possible solutions. I truly appreciate it. I have decided to go with the monkey patching suggestion as it was the easiest to implement.

This solution will work for me, but I am still not quite sure about your statement

The intend of object_id is to provide a unique identifier for data.
By "data" I understand the actual data that is stored in the file and could be used by scientists. In that case case the current object_id is not unique identified for data. Instead it is identified for an object created in memory. I am not sure that uuid would be as useful as data hash.

@oruebel
Copy link
Contributor

oruebel commented Aug 20, 2019

Instead it is identified for an object created in memory.

It identifies a specific object in a specific file.

I am not sure that uuid would be as useful as data hash.

The problem with data hash is that you can easily have different data with the same hash. Also, generating a good hash for large data is time consuming. A data hash and object_id serve different purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add global object ID index to NWBFile
5 participants