Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Introduce REDUCTION and ERRATA file sections #473

Closed
wants to merge 1 commit into from

Conversation

steffenhauf
Copy link
Member

@philsmt - a draft that introduces the additional sections in the H5 structure we had discussed yesterday. We can shape and rename this as needed. For now this brach is to get the CI on the DAQ going again with files that contain additional sections.

@steffenhauf steffenhauf changed the title Introduce REDUCTION and ERRATA file sections Draft: Introduce REDUCTION and ERRATA file sections Dec 15, 2023
@philsmt
Copy link
Contributor

philsmt commented Dec 15, 2023

Some context from recent floor discussions:

  • As with exdf-tools, we were aiming to also have the DAQ introduce metadata about any (online) data reduction performed. The former tool has so far did this with custom datasets in METADATA/reduction and INDEX, if applicable.
  • Looking into this from a DAQ perspective, Steffen suggested it would be easiest for their side to have a new root category and treat any such entries as fast data under a (semi-)virtual source name.
  • It looks to me we should also be able to integrate the work done so far on the exdf-tools side under this umbrella.
  • This should also simplify integration with EXtra-data, as instead of explicit support by properties and methods it leverages all the existing read machinery just under different names
  • For compatibility and to avoid confusion, DataCollection.all_sources and other functionality like item access should probably continue to only include CONTROL and INSTRUMENT sources
  • We could go for a similar solution as with aliases and use a custom indexer attribute for access, i.e. run.reduction[<source>]
  • The same applies to ERRATA, which will be used to record anything that went wrong with data saving, e.g. a dropped train.

@philsmt
Copy link
Contributor

philsmt commented Dec 19, 2023

@takluyver @tmichela

@@ -146,6 +146,8 @@ def __init__(self, filename, _cache_info=None):
self.train_ids = _cache_info['train_ids']
self.control_sources = _cache_info['control_sources']
self.instrument_sources = _cache_info['instrument_sources']
self.reduction_data = _cache_info['reduction_data']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either not be cached at all or use conditional access like validity_flag further below. Since it's only auxiliary information and may often be comparably small, it may be fine to always generate it on demand. If this changes in the future, we can add machinery to augment existing indices, or simply re-create them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll remove from caching.

@@ -418,6 +427,10 @@ def index_groups(self, source):
return {''}
elif source in self.instrument_sources:
return set(self.file[f'/INDEX/{source}'].keys())
elif source in self.reduction_data:
return set(self.file[f'/INDEX/{source}'].keys())
elif source in self.errata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand from our example that REDUCTION sources are always prepended by some category (e.g. REDUCTION/PULSE_SLICING/<XTDF-source>/...). How would ERRATA sources look like? All the reader machinery assumes that source names are unique across all root section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing concrete implemented in the DAQ, but some form of source identification is required. Think of a train that came outside the buffer range, and couldn't be store. You'd want to know what the data was, so it would be prefixed by the source in the path.

Also technically all this additionally data is FAST data for the DAQ - i.e. it will be stored only if a train contains it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, my question was indeed more concrete in how you would pick the source name. Say SA3_XTD10_XGM/XGM/DOOCS:output has a hiccup and sends a late train, under what source name (i.e. the patch below ERRATA) would you save this? It must be different from SA3_XTD10_XGM/XGM/DOOCS:output by at least a prefix or suffix.

Furthermore I had a look into your example file for data reduction. There are INDEX entries for

  • PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf
  • PCLAYER_CI/LFF/DATA_TEST_1:reductionOutput
  • PULSE_REDUCTION/PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf

with a correspondig entry in INSTRUMENT for the first and in REDUCTION for the latter two. Is PCLAYER_CI/LFF/DATA_TEST_1 here a real device used as input for the data aggregator, or also some virtual source used for metadata tracking? Ideally sources in REDUCTION or ERRATA would never be identical to actual sources potentially occuring in CONTROL or INSTRUMENT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the reduction case, I prefix all source names with PULSE_REDUCTION to make them distinct from the "real" sources, one could do the same for ERRATA, e.g. ERRATA/LATE_TRAINS/source/name/and/data/path/.

Hence, for what you observe in the index section:

  • PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf is the real source
  • PCLAYER_CI/LFF/DATA_TEST_1:reductionOutput is the real source
  • PULSE_REDUCTION/PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf is injected by the pulse dim reducer, i.e. is an artificial source.

elif category == 'REDUCTION':
reduction_info.add(h5_source)
elif category == 'ERRATA':
errata.add(h5_source)
elif category == 'Karabo_TimerServer':
# Ignore virtual data source used only in file format
# version 1.1 / pclayer-1.10.3-2.10.5.
pass
else:
raise ValueError("Unknown data category %r" % category)
Copy link
Contributor

@philsmt philsmt Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this bit is probably the big sore point for this proposal - pretty much all prior versions of EXtra-data (or even karabo_data for that matter) will break with these new files.

Since this only affects the METADATA/dataSources entry but not the actual existance of the root groups, one way around this would be not list these sources here and either:

  1. Not list them at all and instead discover them by iterating over the root group
  2. List them in a distinct dataset only for these virtual metadata sources
  3. List both the real and virtual sources in a new dataset while preserving the old one for compatibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't see any way around it though, except if we add below METADATA after all. Generally one should possibly make this less restrictive in my view, as it could hit us again in the future.

@takluyver
Copy link
Member

Do you have some examples of the structure of what goes in the new sections?

To be honest, this doesn't look like a simple solution from my perspective. We need significant changes to EXtra-data just to let it open files with these additions (I don't think the changes here are actually sufficient), we're stuffing an extra layer of info into "source names", and we likely want to introduce some category of hidden sources, since these aren't really like the experimental data. 😕

Putting it into metadata made more sense to me, but I can envisage other ways to do it. Could I have a look at some examples and come back with ideas?

@steffenhauf
Copy link
Member Author

@takluyver - there's sample files on Maxwell: /gpfs/exfel/data/scratch/haufs/daq_online_reduction/

@steffenhauf
Copy link
Member Author

Btw, putting it in METADATA is fine for me, I would then nevertheless like introduce a substructure as a second level underneath. However, having a convenient way to read the info via extradata is beneficial in my view. Technically, it's like fast data in how the DAQ stores it, so in my view much of the existing machinery there could be reused.

@takluyver
Copy link
Member

Main point: We'd really like these additions to happen in a way that doesn't break old versions of EXtra-data (we know people use old versions and will complain to us), and we think that should be easy with a small adjustment to the structure. We can add the new top-level sections, but rather than adding them to METADATA/dataSources, create a new subgroup like METADATA/reductionSources to list them in. Existing versions of EXtra-data should just ignore this, and that gives us more time to work out how to expose them.

I appreciate you're trying to get this in very soon, but if it is possible to change this, we'd really appreciate it.


Secondary points

  1. Having two kinds of sources in REDUCTION - some corresponding to Karabo pipelines, and others which are to be understood as {reduction_technique}/{source_name} seems confusing to me. Would it make sense to put the 'real' Karabo sources in INSTRUMENT/CONTROL, so we can assume that everything under REDUCTION fits the same 'reduction technique' structure?
  2. If there are multiple reductions for the same source, do we define the order in which they were applied? Probably this can just be the order that they appear in reductionSources.

@philsmt
Copy link
Contributor

philsmt commented Jan 9, 2024

Minor correction after discussing it with Steffen: Let's call the new source listing METADATA/auxiliarySources to also account for the ERRRATA section.

@steffenhauf
Copy link
Member Author

Just also had an offline discussion with @philsmt - I took from this that

  • I'll keep the actual data under the high level keys REDUCTION and ERRATA
  • The sources will not be added to METADATA/dataSources but rather METADATA/auxiliarySources/reduction and METADATA/auxiliarySources/errata

Extra data could then provide access to these with a to be developed interface, which closely resembles iterating through INSTRUMENT data like sources.

Would this be within the points you've outline above @takluyver?

@takluyver
Copy link
Member

Yep, I think that works. I've just copied your example file and modified it (in /gpfs/exfel/data/scratch/kluyvert/daq_online_reduction) to do a quick mock-up of this idea, and confirmed that EXtra-data can open the file and read data from it, ignoring the additions.

@steffenhauf
Copy link
Member Author

Thanks. I've added a v2 file to /gpfs/exfel/data/scratch/haufs/daq_online_reduction as an example of what the DAQ would output after refactoring to this new schema.

@takluyver
Copy link
Member

It looks like the INDEX sections for the instrument source in that v2 file are all zero-ed out, so EXtra-data sees the instrument data as empty (length 0 on first dimension). Is that a consequence of these changes, or just an artifact of how it was tested?

@steffenhauf
Copy link
Member Author

I think it's a bug in my changes. I'll check.

@steffenhauf
Copy link
Member Author

I think it's a bug in my changes. I'll check.

Should be fixed now. I've uploaded an updated version of the same file.

@takluyver
Copy link
Member

Thanks!

I might close this PR now that we have a design that doesn't require changes in EXtra-data. We'll still want to provide a nice API to get at the reduction information (& errata - I need to better understand what goes there), but it's somewhat lower priority now. We probably won't add the reduction sources to the JSON cache file. But on the other hand we probably should expose them through virtual overview files, as otherwise they'll be somewhat hidden for runs with a virtual overview file.

@takluyver
Copy link
Member

``

@takluyver
Copy link
Member

I've opened #477 to track creating APIs for these new sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants