-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Introduce REDUCTION and ERRATA file sections #473
Conversation
Some context from recent floor discussions:
|
@@ -146,6 +146,8 @@ def __init__(self, filename, _cache_info=None): | |||
self.train_ids = _cache_info['train_ids'] | |||
self.control_sources = _cache_info['control_sources'] | |||
self.instrument_sources = _cache_info['instrument_sources'] | |||
self.reduction_data = _cache_info['reduction_data'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should either not be cached at all or use conditional access like validity_flag
further below. Since it's only auxiliary information and may often be comparably small, it may be fine to always generate it on demand. If this changes in the future, we can add machinery to augment existing indices, or simply re-create them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll remove from caching.
@@ -418,6 +427,10 @@ def index_groups(self, source): | |||
return {''} | |||
elif source in self.instrument_sources: | |||
return set(self.file[f'/INDEX/{source}'].keys()) | |||
elif source in self.reduction_data: | |||
return set(self.file[f'/INDEX/{source}'].keys()) | |||
elif source in self.errata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand from our example that REDUCTION
sources are always prepended by some category (e.g. REDUCTION/PULSE_SLICING/<XTDF-source>/...
). How would ERRATA
sources look like? All the reader machinery assumes that source names are unique across all root section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing concrete implemented in the DAQ, but some form of source identification is required. Think of a train that came outside the buffer range, and couldn't be store. You'd want to know what the data was, so it would be prefixed by the source in the path.
Also technically all this additionally data is FAST data for the DAQ - i.e. it will be stored only if a train contains it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly, my question was indeed more concrete in how you would pick the source name. Say SA3_XTD10_XGM/XGM/DOOCS:output
has a hiccup and sends a late train, under what source name (i.e. the patch below ERRATA
) would you save this? It must be different from SA3_XTD10_XGM/XGM/DOOCS:output
by at least a prefix or suffix.
Furthermore I had a look into your example file for data reduction. There are INDEX
entries for
- PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf
- PCLAYER_CI/LFF/DATA_TEST_1:reductionOutput
- PULSE_REDUCTION/PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf
with a correspondig entry in INSTRUMENT
for the first and in REDUCTION
for the latter two. Is PCLAYER_CI/LFF/DATA_TEST_1
here a real device used as input for the data aggregator, or also some virtual source used for metadata tracking? Ideally sources in REDUCTION
or ERRATA
would never be identical to actual sources potentially occuring in CONTROL
or INSTRUMENT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the reduction case, I prefix all source names with PULSE_REDUCTION
to make them distinct from the "real" sources, one could do the same for ERRATA
, e.g. ERRATA/LATE_TRAINS/source/name/and/data/path/
.
Hence, for what you observe in the index section:
- PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf is the real source
- PCLAYER_CI/LFF/DATA_TEST_1:reductionOutput is the real source
- PULSE_REDUCTION/PCLAYER_CI/DAQ/DET_DATA_TEST_1:xtdf is injected by the pulse dim reducer, i.e. is an artificial source.
elif category == 'REDUCTION': | ||
reduction_info.add(h5_source) | ||
elif category == 'ERRATA': | ||
errata.add(h5_source) | ||
elif category == 'Karabo_TimerServer': | ||
# Ignore virtual data source used only in file format | ||
# version 1.1 / pclayer-1.10.3-2.10.5. | ||
pass | ||
else: | ||
raise ValueError("Unknown data category %r" % category) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this bit is probably the big sore point for this proposal - pretty much all prior versions of EXtra-data (or even karabo_data
for that matter) will break with these new files.
Since this only affects the METADATA/dataSources
entry but not the actual existance of the root groups, one way around this would be not list these sources here and either:
- Not list them at all and instead discover them by iterating over the root group
- List them in a distinct dataset only for these virtual metadata sources
- List both the real and virtual sources in a new dataset while preserving the old one for compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't see any way around it though, except if we add below METADATA after all. Generally one should possibly make this less restrictive in my view, as it could hit us again in the future.
Do you have some examples of the structure of what goes in the new sections? To be honest, this doesn't look like a simple solution from my perspective. We need significant changes to EXtra-data just to let it open files with these additions (I don't think the changes here are actually sufficient), we're stuffing an extra layer of info into "source names", and we likely want to introduce some category of hidden sources, since these aren't really like the experimental data. 😕 Putting it into metadata made more sense to me, but I can envisage other ways to do it. Could I have a look at some examples and come back with ideas? |
@takluyver - there's sample files on Maxwell: /gpfs/exfel/data/scratch/haufs/daq_online_reduction/ |
Btw, putting it in METADATA is fine for me, I would then nevertheless like introduce a substructure as a second level underneath. However, having a convenient way to read the info via extradata is beneficial in my view. Technically, it's like fast data in how the DAQ stores it, so in my view much of the existing machinery there could be reused. |
Main point: We'd really like these additions to happen in a way that doesn't break old versions of EXtra-data (we know people use old versions and will complain to us), and we think that should be easy with a small adjustment to the structure. We can add the new top-level sections, but rather than adding them to I appreciate you're trying to get this in very soon, but if it is possible to change this, we'd really appreciate it. Secondary points
|
Minor correction after discussing it with Steffen: Let's call the new source listing |
Just also had an offline discussion with @philsmt - I took from this that
Extra data could then provide access to these with a to be developed interface, which closely resembles iterating through INSTRUMENT data like sources. Would this be within the points you've outline above @takluyver? |
Yep, I think that works. I've just copied your example file and modified it (in |
Thanks. I've added a |
It looks like the |
I think it's a bug in my changes. I'll check. |
Should be fixed now. I've uploaded an updated version of the same file. |
Thanks! I might close this PR now that we have a design that doesn't require changes in EXtra-data. We'll still want to provide a nice API to get at the reduction information (& errata - I need to better understand what goes there), but it's somewhat lower priority now. We probably won't add the reduction sources to the JSON cache file. But on the other hand we probably should expose them through virtual overview files, as otherwise they'll be somewhat hidden for runs with a virtual overview file. |
`` |
I've opened #477 to track creating APIs for these new sections. |
@philsmt - a draft that introduces the additional sections in the H5 structure we had discussed yesterday. We can shape and rename this as needed. For now this brach is to get the CI on the DAQ going again with files that contain additional sections.