Requirements Engineering for CAP

Version: V0.1

Status: beginning of March 2015, first round of requirement engineering

Authors V0.1: GS-SIS, IT-CIS and LHC collaborations

Preamble: The CERN Analysis Preservation Framework (CAP) aims at capturing the individual components of an analysis from the primary datasets to the final publication. This includes the materials being used and produced within the individual analysis steps: code, derived datasets, additional information/metadata, presentations, publications. It will provide an easy to use interface to insert metadata and “plug-in” (harvest, link) existing resources such as experiment internal databases. That way it can be integrated easily into the analysis workflow and reduces the burden on the researchers’ side. The tool will be set up with an (internal - depending on the access restrictions defined by the respective collaboration) search.

Note: This is the first draft describing the specifications and expectations for the Analysis Preservation Framework. More iterations will follow based on the forthcoming plenary and parallel one to one meetings. Please suggest changes to this document by using the “comment” functionality.

First set of requirements - experiment specific

What should be connected

ALICE: The LEGO train system that ALICE uses for analysis is restricted to ALICE users. It contains several (persistent) metadata information. Anyone can restart a previous analysis with exactly the same configuration, then change the input datasets or version of the software. A snapshot with the available information for a given train ID is available.

ID of analysis trains from the ALICE LEGO train system corresponding to a given analysis. These trains are fully documenting input/output datasets, analysis code and macros used, versions, processing details (statistics, triggers, processing time, …)

Explicit analysis procedure:

physics background, observables, cuts, dependencies on other analysis, used datasets (real data and simulations), conditions used for simulated datasets
Documentation, relevant discussions and decisions from the responsible physics working group (PWG) for the given analysis.
Link to publications and conf proceedings

ATLAS:

[work in progress, more details and pointers to come]
high-level analysis metadata from Glance
author information (from Glance and/or CDS)
analysis information
abstract of paper/note
ed board membership for conf note and/or paper
links to public documents (conf notes and papers)
links to internal supporting notes
link to egroup mailing list
link to indico for approval talks
Production system information (multiple entries)
dataset IDs for final derived datasets from the production system (usually D3PDs [= plain root file, containing only ROOT TTree and/or histograms])
via AMI database, the dataset IDs in the provenance of those datasets can be obtained
via AMI database, the tag for the software used in the Grid Production system can be obtained (both are demoed in Lukas’s tool)
information about lumiblocks, streams/triggers, data quality flags, etc.
k-factors, cross-sections, etc. used for Monte Carlo samples (these come from a variety of sources. They are in AMI, but that is often not the authoritative source. Instead, they often come from twiki pages etc.). This should be optional field for each dataset ID.
post-production system and Event Selection information (multiple entries)
e.g. the user generated content
a text summary of additional steps after main production system since this can vary significantly
details (ideally the code) for any additional transformations that produce analysis-specific format : eg. AOD -> custom ntuple or D3PD ->custom ntuple that was not produced in the standard production system
List of tags for ROOTCore packages used in end-user analysis (for scale factors, calibration, systematic evaluation, etc.)
the actual custom ntuples
final event selection and histogram filling code
Data at the level of Statistical analysis
HistFactory / HistFitter configuration and input files if applicable
RooFit workspace fed into the RooStats tools for limits, significance, etc.

Ideally the production system and post-production system part can be made into some module/data structure so that one analysis might have several of these. Often one analysis has a team of people studying different topics that only come together at the level of the statistical analysis. During those steps, people might be using totally different software, etc.

CMS:

Analysis information for old analysis: batch import from CADI when access to its API, we see what we get out there and K. will try to fill examples based on her excel sheet
analysis code (GROUPCODE-YEAR-NUMBER)
title (may differ from the title of the final publication)
description
link to the publications (PAS or publication) in cds, arxiv
other: contact person, analysis review committee, twiki (usually for tracking the comments and answers during the approval procedure), hypernews discussion during the approval...
Requirements for the Physics Information box:
start with the final state particles:
expand list of particles, incl. e.g. cosmic muons, dimuons etc.
remove number of particles
add info on Opposite Sign/Same Sign
add invariant mass value range (can be incl. or excl.)
include so called "vetos" (attributes that the final state particles must not have)
how to represent that e.g. e/mu as final state particles is actually 2 analysis combined (with different attributes) and not one?
we can have a two or more parallel entries, each belonging to the analysis but having different attributes, e.g. di-lepton analysis contains e+e- analysis and mu+mu- analysis with different cuts in the final state, with different primary datasets and with different triggers
this division of an entry in two (or more) parts will stay throughout the subsequent information boxes (i.e. post AOD processing)
option to clone an final particle state entry and thus less typing if a lot of attributes are the sam
how to include the option to just have tracks (to cover especially old performance papers) without confusing the menu for the “normal” papers
do we need different forms for different kind of papers (performance, search, measurement) needed? Is using them as keywords enough? This aspect will need to be tested with users.
Primary dataset can queried from DAS and be drop down as it's a defined list.
Depending on the primary data set chosen, the trigger selection can be reduced as well and presented in a drop down
however, an analysis can use different sorts of triggers depending on used datasets ( e.g. electron triggers or muon triggers) or depending on luminosity conditions (i.e. run number) Requirements for the post AOD processing boxes
data management team in CMS would be happy to use this area to record information about analysis practices and data movements
information for the different processing steps is readily available when the jobs are run with CRAB and could be automatically collected
input file names, output file names (and location), configuration file...
this would amount to a fair amount of information about unsuccessful, repeated or superseded steps in the analysis, which are not necessarily of interest to analysis preservation, but would be very useful input to the data management.This should not be a problem if the files are not harvested
having this information automatically filled in would make end-users happier, as the “analysis preservation” action in the end would consist not of entering the information about all processing steps after they were completed, but rather defining from the existing, automatically filled information the relevant parts for analysis preservation
for the CRAB jobs, users would be asked to indicate the analysis code so that the information can be directed to the right entry
include link to presentations and papers
link with code/software, with twiki, hypernews, with publications (CDS, INSPIRE), with talks (indico)

LHCb

The information leading to the following information should be captured…
[dataset DST selection]
description
trigger
stripping line
input data
data
year
reconstruction software
stripping software
location [bookkeeping path or upload file list]
MC
MC production
reconstruction software
stripping software
location [bookkeeping path or upload file list]
code
platform
LHCb code/version
user’s code
Link
How to’s, instructions, config
Output data
Data
MC

Where does this content [data, code, text, documentation, discussion] live currently?

ALICE:

ALICE analysis trains LEGO system [detailed specs to come after meeting with SDT and PH] These trains are fully documenting input/output datasets, analysis code and macros used, versions, processing details (statistics, triggers, processing time, …)
Publications and their available metadata: A lot of this information is held by the physics working groups that performed the given analysis, but there are surely other levels in the publication procedure
Public or internal wiki pages or PWG mailing lists

⇒ Question: we need to understand if this content is referenced/linked in publications, presentations, approval procedures…. This will be done in the next face to face meeting with M.

ATLAS:

[work in progress, more details and pointers to come; see also details above ]

AMI
New: derivation framework
CVMFS, SVN
INDICO
CDS, INSPIRE/HEPdata

CMS:

CADI: PI information, basic analysis information (ID), publication information
DAS: metadata about data
INDICO: talks (little metadata, access restricted in many parts)
different locations for code, e.g. github
trigger information can be extracted from various sources
intermediate data files in various locations (user defined)
see also doi:10.1088/1742-6596/119/7/072013 for details on individual systems

LHCb:

Analysis wiki page
Analysis internal notes (CDS)
Analysis mailing list
The code is in user areas or archived in Urania (high-level physics analysis software repository for the LHCb experiment).
Presentations at the working groups (Indico)

What will we find there? [size, file formats, what metadata available?]

ALICE: Essentially documentation, links to datasets, tables, lists (e.g. good runs for the analysis) web pages. The information will be essentially generic metadata, but also output (such as histograms in ROOT format)

⇒ Remark: We need metadata examples. This will be done in a separate meeting with M.

ATLAS:

[work in progress, more details and pointers to come]

For the tools mentioned above:
metadata to data, links to “primary” datasets
standardized xAOD
Workflow needed to integrate easily user generated files

CMS:

DAS: with JSON output and API
contains the metadata of all CMS datasets
this is where the metadata for the public primary datasets where extracted (manually, but API available)
in the analysis information in DAPF, the primary datasets should be (or should have been if deleted) entries in this catalogue
in CMS, it gives the necessary information to access the datasets (the filenames listing)
see also doi:10.1016/j.procs.2010.04.172
access limited to CMS members
CADI with API
user defined data files: 10k per year, small sizes (1GB max per person; 200-500 persons)
could be non-EDM format, but certainly root based
code:
cmssw - open source in github
cms software package (which does not necessarily contain the code used for actual analysis)
c++, python, documentation

LHCb:

DST datasets are of tens of TB size; final ntuples are tens of GB.
what metadata could be extracted automatically? For the DSTs, information on the software version used to process the data can be extracted by the link to the bookkeeping. Other information could be extracted from the analysis wikis, but the structure is not always the same.

What functionalities would you need? [actions on demand, automatic operations]

ALICE:

Keep all information which is now sparse in a self contained way
A way to output all analysis metadata in a common format, allowing to make experiment specific plugins to read and perform specific actions (for ALICE we could generate a LEGO analysis train automatically, pointing to the right software and datasets). We could have such plugin reading the metadata in our VM open access environment in some particular cases.

⇒ question: could you give an example of a specific plug in or how a use case would look like? [separate meeting will follow]

ATLAS:

[work in progress, more details and pointers to come]

as automated as possible
“user clicks buttons to connect CAP with existing ATLAS databases and so content is automatically retrieved”
possibly the process could be started from the shell

CMS:

search functions,e.g. of intermediate data
central access point for user created content (which is currently only available upon request, if you know where it is or whom to ask)
a prominent use case for CMS are these “user generated files”. They could be made much more accessible and “reusable” if they are preserved and searchable via CAP
we would like to have automatic retrieval of the main contents
integration with with CRAB and CMS dashboard to start capturing information as early in the process as possible; that potentially means also capturing analysis steps that are not followed up
cataloguing options

LHCb:

Same as CMS.
We would also like to archive (and then easily retrieve) the computing environment of the analysis (or at least of the final steps).
It would be wonderful if I could click on “Rerun this analysis” and open a VM image with all the analysis environment (platform, LHCb code, ROOT version,...) properly setup, where I could launch the scripts/code prepared by the authors of the analysis (and archived in this framework).

Access restrictions?

ALICE:

The access for committing to the site should be restricted to ALICE.
For reading, it depends:
for already published analysis (which is what I expect), the access should be for everyone.
if we submit materials used in current analysis, access should be restricted.

ATLAS:

Access to content is usually restricted.
However, if CAP facilitates CERN-SSO and Grid-Certificates, this should be fairly compatible with existing ATLAS access schemes.

CMS:

access restrictions apply to most content, tools and functionalities on DAPF
possibly even access restrictions within the collaboration
good to have easy option to push to CODP

LHCb:

applies to most content (even if an analysis is published, the code, internal notes, etc usually are not public). Material which is already public (papers, talks) should instead be visible to everybody

High-Level Expectations in Terms of Functionalities/Interfaces

ATLAS:

should follow ATLAS analysis practices
Auto-import functionalities are appreciated. Reduce burden to fill manually

CMS:

Need to make sure there is no extra burden on the researcher and thus has to follow analysis practices. Integrate as much as possible into the research and approval tools and workflows
Autocomplete whenever possible, including more complex routines, i.e. when selecting xyz, the following fields already pre-select certain elements…
Show controlled vocabularies if available

LHCb:

The interface needs to follow the the flow of the analysis, otherwise it is difficult to compile and for a user to reproduce.
LHCb example:
step 1 selection of events on DST
step 2 train BDT and add BDT variables to the ntuple
step 3 fit
Step n +1 depend on the specific analysis, framework should be enough flexible to accommodate this. They will have the same structure. User shall be able to add them upon request (+1)

General (First) Commonalities

Workflows

helpful to start early in the research process, e.g. when an analysis is started
should be embedded into the normal research practices
Need to investigate how the tool could be embedded into individual analysis or publication approval procedures

Implementation

there is a lot of meta-information in the experiments’ tools which can be used to pre-populate CAP; e.g. CMS-DAS, ALICE-LEGO, LHCb-?, ATLAS-AMI or Derivation Framework?
challenge: map data models
automated extraction?
versioning?
search interface needed to have “extra” service on top of existing databases
i.e to know who has done what with the data/software/MC
i.e. for data derived from primary datasets
VM: everyone needs/wants a snapshot of current VM. How to implement?

In conclusion, CAP needs to highlight “value added” on top of existing tools: powerful and usable search, making "user generated" content accessible and findable - should be easy to track from primary datasets to user generated content linking primary data with user generated content, potentially feeding this back to the experiments specific tools

⇒ large improvement for discoverability

easier approval process potentially: all the relevant information needed available via "one click"

Questions and next steps:

ALICE LEGO: need to understand what can be extracted automatically and what needs to be added manually by the user or us [functionality similar to CMS DAS? - this will be studied in a separate meeting with M.

LHCb [bookkeeping database or train system] : need to understand what can be extracted automatically and what needs to be added manually by the user or us

is the functionality similar to CMS DAS?
Root everywhere - but where exactly? Interest to have “common” visualization tools on top?
Where do we make public content accessible? Via CODP or via an open search on CAP?

Next/Timeline:

API: need to investigate how “pipes” can be built
technical (IT-CIS)
metadata (GS-SIS): standard example from each product/pipe needed
once we investigate the pipes in more detail, we will do the round of submission form adjustment
what will be prefilled, how?
check dropdowns (how to update the controlled vocabularies?), check the order [for each of the experiments] [early summer]
run first usability tests with power users and “preservation experts” on clickable prototypes [mid summer]. Note that this does not mean that this will a production-ready system by then.
test with first “new” users from the experiments on clickable prototypes [late summer/autumn]
specification of back end functionalities and corresponding tasks.

In particular, it would be great to have additional support from you on the following tasks in the next weeks

what are the core elements of the analysis (steps) across experiments? [spring, with DASPOS and DPHEP?]
are there simple “core elements” which are common across the collaborations?
what is different?
develop data models, which take into account the above [late spring, with DASPOS?]
provide feedback on data models [with collaborations]
workshop to build and discuss the core analysis elements and data model?
API questions and potential adjustments [DPHEP?]

We are happy to support any internal discussions in regard to the CAP service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requirements Engineering for CAP

First set of requirements - experiment specific

What should be connected

Where does this content [data, code, text, documentation, discussion] live currently?

What will we find there? [size, file formats, what metadata available?]

What functionalities would you need? [actions on demand, automatic operations]

Access restrictions?

High-Level Expectations in Terms of Functionalities/Interfaces

General (First) Commonalities

Workflows

Implementation

Questions and next steps:

Next/Timeline:

HOME

Overview

Meeting notes

Workshops

Requirements

Use cases

User stories

Roadmap

Clone this wiki locally