Draft: Introduce REDUCTION and ERRATA file sections #473

steffenhauf · 2023-12-15T09:42:25Z

@philsmt - a draft that introduces the additional sections in the H5 structure we had discussed yesterday. We can shape and rename this as needed. For now this brach is to get the CI on the DAQ going again with files that contain additional sections.

philsmt · 2023-12-15T15:26:57Z

Some context from recent floor discussions:

As with exdf-tools, we were aiming to also have the DAQ introduce metadata about any (online) data reduction performed. The former tool has so far did this with custom datasets in METADATA/reduction and INDEX, if applicable.
Looking into this from a DAQ perspective, Steffen suggested it would be easiest for their side to have a new root category and treat any such entries as fast data under a (semi-)virtual source name.
It looks to me we should also be able to integrate the work done so far on the exdf-tools side under this umbrella.
This should also simplify integration with EXtra-data, as instead of explicit support by properties and methods it leverages all the existing read machinery just under different names
For compatibility and to avoid confusion, DataCollection.all_sources and other functionality like item access should probably continue to only include CONTROL and INSTRUMENT sources
We could go for a similar solution as with aliases and use a custom indexer attribute for access, i.e. run.reduction[<source>]
The same applies to ERRATA, which will be used to record anything that went wrong with data saving, e.g. a dropped train.

philsmt · 2023-12-19T07:30:23Z

@takluyver @tmichela

philsmt · 2023-12-19T07:38:27Z

extra_data/file_access.py

@@ -146,6 +146,8 @@ def __init__(self, filename, _cache_info=None):
            self.train_ids = _cache_info['train_ids']
            self.control_sources = _cache_info['control_sources']
            self.instrument_sources = _cache_info['instrument_sources']
+            self.reduction_data = _cache_info['reduction_data']


This should either not be cached at all or use conditional access like validity_flag further below. Since it's only auxiliary information and may often be comparably small, it may be fine to always generate it on demand. If this changes in the future, we can add machinery to augment existing indices, or simply re-create them.

Makes sense. I'll remove from caching.

philsmt · 2023-12-19T07:41:38Z

extra_data/file_access.py

@@ -418,6 +427,10 @@ def index_groups(self, source):
            return {''}
        elif source in self.instrument_sources:
            return set(self.file[f'/INDEX/{source}'].keys())
+        elif source in self.reduction_data:
+            return set(self.file[f'/INDEX/{source}'].keys())
+        elif source in self.errata:


I understand from our example that REDUCTION sources are always prepended by some category (e.g. REDUCTION/PULSE_SLICING/<XTDF-source>/...). How would ERRATA sources look like? All the reader machinery assumes that source names are unique across all root section.

There's nothing concrete implemented in the DAQ, but some form of source identification is required. Think of a train that came outside the buffer range, and couldn't be store. You'd want to know what the data was, so it would be prefixed by the source in the path.

Also technically all this additionally data is FAST data for the DAQ - i.e. it will be stored only if a train contains it.

Certainly, my question was indeed more concrete in how you would pick the source name. Say SA3_XTD10_XGM/XGM/DOOCS:output has a hiccup and sends a late train, under what source name (i.e. the patch below ERRATA) would you save this? It must be different from SA3_XTD10_XGM/XGM/DOOCS:output by at least a prefix or suffix.

Furthermore I had a look into your example file for data reduction. There are INDEX entries for

PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf

PCLAYER_CI/LFF/DATA_TEST_1:reductionOutput

PULSE_REDUCTION/PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf

with a correspondig entry in INSTRUMENT for the first and in REDUCTION for the latter two. Is PCLAYER_CI/LFF/DATA_TEST_1 here a real device used as input for the data aggregator, or also some virtual source used for metadata tracking? Ideally sources in REDUCTION or ERRATA would never be identical to actual sources potentially occuring in CONTROL or INSTRUMENT.

So in the reduction case, I prefix all source names with PULSE_REDUCTION to make them distinct from the "real" sources, one could do the same for ERRATA, e.g. ERRATA/LATE_TRAINS/source/name/and/data/path/.

Hence, for what you observe in the index section:

PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf is the real source

PCLAYER_CI/LFF/DATA_TEST_1:reductionOutput is the real source

PULSE_REDUCTION/PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf is injected by the pulse dim reducer, i.e. is an artificial source.

philsmt · 2023-12-19T09:22:04Z

extra_data/file_access.py

+            elif category == 'REDUCTION':
+                reduction_info.add(h5_source)
+            elif category == 'ERRATA':
+                errata.add(h5_source)
            elif category == 'Karabo_TimerServer':
                # Ignore virtual data source used only in file format
                # version 1.1 / pclayer-1.10.3-2.10.5.
                pass
            else:
                raise ValueError("Unknown data category %r" % category)


So this bit is probably the big sore point for this proposal - pretty much all prior versions of EXtra-data (or even karabo_data for that matter) will break with these new files.

Since this only affects the METADATA/dataSources entry but not the actual existance of the root groups, one way around this would be not list these sources here and either:

Not list them at all and instead discover them by iterating over the root group

List them in a distinct dataset only for these virtual metadata sources

List both the real and virtual sources in a new dataset while preserving the old one for compatibility

Yes, I don't see any way around it though, except if we add below METADATA after all. Generally one should possibly make this less restrictive in my view, as it could hit us again in the future.

takluyver · 2023-12-21T13:42:16Z

Do you have some examples of the structure of what goes in the new sections?

To be honest, this doesn't look like a simple solution from my perspective. We need significant changes to EXtra-data just to let it open files with these additions (I don't think the changes here are actually sufficient), we're stuffing an extra layer of info into "source names", and we likely want to introduce some category of hidden sources, since these aren't really like the experimental data. 😕

Putting it into metadata made more sense to me, but I can envisage other ways to do it. Could I have a look at some examples and come back with ideas?

steffenhauf · 2024-01-08T09:13:29Z

@takluyver - there's sample files on Maxwell: /gpfs/exfel/data/scratch/haufs/daq_online_reduction/

steffenhauf · 2024-01-08T09:30:15Z

Btw, putting it in METADATA is fine for me, I would then nevertheless like introduce a substructure as a second level underneath. However, having a convenient way to read the info via extradata is beneficial in my view. Technically, it's like fast data in how the DAQ stores it, so in my view much of the existing machinery there could be reused.

takluyver · 2024-01-09T09:49:01Z

Main point: We'd really like these additions to happen in a way that doesn't break old versions of EXtra-data (we know people use old versions and will complain to us), and we think that should be easy with a small adjustment to the structure. We can add the new top-level sections, but rather than adding them to METADATA/dataSources, create a new subgroup like METADATA/reductionSources to list them in. Existing versions of EXtra-data should just ignore this, and that gives us more time to work out how to expose them.

I appreciate you're trying to get this in very soon, but if it is possible to change this, we'd really appreciate it.

Secondary points

Having two kinds of sources in REDUCTION - some corresponding to Karabo pipelines, and others which are to be understood as {reduction_technique}/{source_name} seems confusing to me. Would it make sense to put the 'real' Karabo sources in INSTRUMENT/CONTROL, so we can assume that everything under REDUCTION fits the same 'reduction technique' structure?
If there are multiple reductions for the same source, do we define the order in which they were applied? Probably this can just be the order that they appear in reductionSources.

philsmt · 2024-01-09T10:05:02Z

Minor correction after discussing it with Steffen: Let's call the new source listing METADATA/auxiliarySources to also account for the ERRRATA section.

steffenhauf · 2024-01-09T10:06:33Z

Just also had an offline discussion with @philsmt - I took from this that

I'll keep the actual data under the high level keys REDUCTION and ERRATA
The sources will not be added to METADATA/dataSources but rather METADATA/auxiliarySources/reduction and METADATA/auxiliarySources/errata

Extra data could then provide access to these with a to be developed interface, which closely resembles iterating through INSTRUMENT data like sources.

Would this be within the points you've outline above @takluyver?

takluyver · 2024-01-09T11:45:56Z

Yep, I think that works. I've just copied your example file and modified it (in /gpfs/exfel/data/scratch/kluyvert/daq_online_reduction) to do a quick mock-up of this idea, and confirmed that EXtra-data can open the file and read data from it, ignoring the additions.

steffenhauf · 2024-01-09T15:46:49Z

Thanks. I've added a v2 file to /gpfs/exfel/data/scratch/haufs/daq_online_reduction as an example of what the DAQ would output after refactoring to this new schema.

takluyver · 2024-01-09T16:07:53Z

It looks like the INDEX sections for the instrument source in that v2 file are all zero-ed out, so EXtra-data sees the instrument data as empty (length 0 on first dimension). Is that a consequence of these changes, or just an artifact of how it was tested?

steffenhauf · 2024-01-09T16:41:32Z

I think it's a bug in my changes. I'll check.

steffenhauf · 2024-01-10T14:45:38Z

I think it's a bug in my changes. I'll check.

Should be fixed now. I've uploaded an updated version of the same file.

takluyver · 2024-01-12T17:15:02Z

Thanks!

I might close this PR now that we have a design that doesn't require changes in EXtra-data. We'll still want to provide a nice API to get at the reduction information (& errata - I need to better understand what goes there), but it's somewhat lower priority now. We probably won't add the reduction sources to the JSON cache file. But on the other hand we probably should expose them through virtual overview files, as otherwise they'll be somewhat hidden for runs with a virtual overview file.

takluyver · 2024-01-15T10:28:07Z

``

takluyver · 2024-01-15T10:34:09Z

I've opened #477 to track creating APIs for these new sections.

Introduce REDUCTION and ERRATA file sections

655f2c4

steffenhauf changed the title ~~Introduce REDUCTION and ERRATA file sections~~ Draft: Introduce REDUCTION and ERRATA file sections Dec 15, 2023

philsmt reviewed Dec 19, 2023

View reviewed changes

takluyver closed this Jan 15, 2024

takluyver mentioned this pull request Jan 15, 2024

Expose REDUCTION & ERRATA sections #477

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Introduce REDUCTION and ERRATA file sections #473

Draft: Introduce REDUCTION and ERRATA file sections #473

steffenhauf commented Dec 15, 2023

philsmt commented Dec 15, 2023

philsmt commented Dec 19, 2023

philsmt Dec 19, 2023

steffenhauf Jan 8, 2024

philsmt Dec 19, 2023

steffenhauf Dec 19, 2023

philsmt Dec 19, 2023

steffenhauf Dec 19, 2023

philsmt Dec 19, 2023 •

edited

Loading

steffenhauf Dec 19, 2023

takluyver commented Dec 21, 2023

steffenhauf commented Jan 8, 2024

steffenhauf commented Jan 8, 2024

takluyver commented Jan 9, 2024

philsmt commented Jan 9, 2024

steffenhauf commented Jan 9, 2024

takluyver commented Jan 9, 2024

steffenhauf commented Jan 9, 2024

takluyver commented Jan 9, 2024

steffenhauf commented Jan 9, 2024

steffenhauf commented Jan 10, 2024

takluyver commented Jan 12, 2024

takluyver commented Jan 15, 2024

takluyver commented Jan 15, 2024

Draft: Introduce REDUCTION and ERRATA file sections #473

Draft: Introduce REDUCTION and ERRATA file sections #473

Conversation

steffenhauf commented Dec 15, 2023

philsmt commented Dec 15, 2023

philsmt commented Dec 19, 2023

philsmt Dec 19, 2023

Choose a reason for hiding this comment

steffenhauf Jan 8, 2024

Choose a reason for hiding this comment

philsmt Dec 19, 2023

Choose a reason for hiding this comment

steffenhauf Dec 19, 2023

Choose a reason for hiding this comment

philsmt Dec 19, 2023

Choose a reason for hiding this comment

steffenhauf Dec 19, 2023

Choose a reason for hiding this comment

philsmt Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

steffenhauf Dec 19, 2023

Choose a reason for hiding this comment

takluyver commented Dec 21, 2023

steffenhauf commented Jan 8, 2024

steffenhauf commented Jan 8, 2024

takluyver commented Jan 9, 2024

philsmt commented Jan 9, 2024

steffenhauf commented Jan 9, 2024

takluyver commented Jan 9, 2024

steffenhauf commented Jan 9, 2024

takluyver commented Jan 9, 2024

steffenhauf commented Jan 9, 2024

steffenhauf commented Jan 10, 2024

takluyver commented Jan 12, 2024

takluyver commented Jan 15, 2024

takluyver commented Jan 15, 2024

philsmt Dec 19, 2023 •

edited

Loading