-
Notifications
You must be signed in to change notification settings - Fork 31
Requirements Engineering for CAP
Version: V0.1
Status: beginning of March 2015, first round of requirement engineering
Authors V0.1: GS-SIS, IT-CIS and LHC collaborations
Preamble: The CERN Analysis Preservation Framework (CAP) aims at capturing the individual components of an analysis from the primary datasets to the final publication. This includes the materials being used and produced within the individual analysis steps: code, derived datasets, additional information/metadata, presentations, publications. It will provide an easy to use interface to insert metadata and “plug-in” (harvest, link) existing resources such as experiment internal databases. That way it can be integrated easily into the analysis workflow and reduces the burden on the researchers’ side. The tool will be set up with an (internal - depending on the access restrictions defined by the respective collaboration) search.
Note: This is the first draft describing the specifications and expectations for the Analysis Preservation Framework. More iterations will follow based on the forthcoming plenary and parallel one to one meetings. Please suggest changes to this document by using the “comment” functionality.
ALICE: The LEGO train system that ALICE uses for analysis is restricted to ALICE users. It contains several (persistent) metadata information. Anyone can restart a previous analysis with exactly the same configuration, then change the input datasets or version of the software. A snapshot with the available information for a given train ID is available.
ID of analysis trains from the ALICE LEGO train system corresponding to a given analysis. These trains are fully documenting input/output datasets, analysis code and macros used, versions, processing details (statistics, triggers, processing time, …)
Explicit analysis procedure:
- physics background, observables, cuts, dependencies on other analysis, used datasets (real data and simulations), conditions used for simulated datasets
- Documentation, relevant discussions and decisions from the responsible physics working group (PWG) for the given analysis.
- Link to publications and conf proceedings
ATLAS:
- [work in progress, more details and pointers to come]
- high-level analysis metadata from Glance
- author information (from Glance and/or CDS)
- analysis information
- abstract of paper/note
- ed board membership for conf note and/or paper
- links to public documents (conf notes and papers)
- links to internal supporting notes
- link to egroup mailing list
- link to indico for approval talks
- Production system information (multiple entries)
- dataset IDs for final derived datasets from the production system (usually D3PDs [= plain root file, containing only ROOT TTree and/or histograms])
- via AMI database, the dataset IDs in the provenance of those datasets can be obtained
- via AMI database, the tag for the software used in the Grid Production system can be obtained (both are demoed in Lukas’s tool)
- information about lumiblocks, streams/triggers, data quality flags, etc.
- k-factors, cross-sections, etc. used for Monte Carlo samples (these come from a variety of sources. They are in AMI, but that is often not the authoritative source. Instead, they often come from twiki pages etc.). This should be optional field for each dataset ID.
- post-production system and Event Selection information (multiple entries)
- e.g. the user generated content
- a text summary of additional steps after main production system since this can vary significantly
- details (ideally the code) for any additional transformations that produce analysis-specific format : eg. AOD -> custom ntuple or D3PD ->custom ntuple that was not produced in the standard production system
- List of tags for ROOTCore packages used in end-user analysis (for scale factors, calibration, systematic evaluation, etc.)
- the actual custom ntuples
- final event selection and histogram filling code
- Data at the level of Statistical analysis
- HistFactory / HistFitter configuration and input files if applicable
- RooFit workspace fed into the RooStats tools for limits, significance, etc.
Ideally the production system and post-production system part can be made into some module/data structure so that one analysis might have several of these. Often one analysis has a team of people studying different topics that only come together at the level of the statistical analysis. During those steps, people might be using totally different software, etc.
CMS:
- Analysis information for old analysis: batch import from CADI when access to its API, we see what we get out there and K. will try to fill examples based on her excel sheet
- analysis code (GROUPCODE-YEAR-NUMBER)
- title (may differ from the title of the final publication)
- description
- link to the publications (PAS or publication) in cds, arxiv
- other: contact person, analysis review committee, twiki (usually for tracking the comments and answers during the approval procedure), hypernews discussion during the approval...
- Requirements for the Physics Information box:
- start with the final state particles:
- expand list of particles, incl. e.g. cosmic muons, dimuons etc.
- remove number of particles
- add info on Opposite Sign/Same Sign
- add invariant mass value range (can be incl. or excl.)
- include so called "vetos" (attributes that the final state particles must not have)
- how to represent that e.g. e/mu as final state particles is actually 2 analysis combined (with different attributes) and not one?
- we can have a two or more parallel entries, each belonging to the analysis but having different attributes, e.g. di-lepton analysis contains e+e- analysis and mu+mu- analysis with different cuts in the final state, with different primary datasets and with different triggers
- this division of an entry in two (or more) parts will stay throughout the subsequent information boxes (i.e. post AOD processing)
- option to clone an final particle state entry and thus less typing if a lot of attributes are the sam
- how to include the option to just have tracks (to cover especially old performance papers) without confusing the menu for the “normal” papers
- do we need different forms for different kind of papers (performance, search, measurement) needed? Is using them as keywords enough? This aspect will need to be tested with users.
- Primary dataset can queried from DAS and be drop down as it's a defined list.
- Depending on the primary data set chosen, the trigger selection can be reduced as well and presented in a drop down
- however, an analysis can use different sorts of triggers depending on used datasets ( e.g. electron triggers or muon triggers) or depending on luminosity conditions (i.e. run number) Requirements for the post AOD processing boxes
- data management team in CMS would be happy to use this area to record information about analysis practices and data movements
- information for the different processing steps is readily available when the jobs are run with CRAB and could be automatically collected
- input file names, output file names (and location), configuration file...
- this would amount to a fair amount of information about unsuccessful, repeated or superseded steps in the analysis, which are not necessarily of interest to analysis preservation, but would be very useful input to the data management.This should not be a problem if the files are not harvested
- having this information automatically filled in would make end-users happier, as the “analysis preservation” action in the end would consist not of entering the information about all processing steps after they were completed, but rather defining from the existing, automatically filled information the relevant parts for analysis preservation
- for the CRAB jobs, users would be asked to indicate the analysis code so that the information can be directed to the right entry
- include link to presentations and papers
- link with code/software, with twiki, hypernews, with publications (CDS, INSPIRE), with talks (indico)
LHCb
- The information leading to the following information should be captured…
- [dataset DST selection]
- description
- trigger
- stripping line
- input data
- data
- year
- reconstruction software
- stripping software
- location [bookkeeping path or upload file list]
- MC
- MC production
- reconstruction software
- stripping software
- location [bookkeeping path or upload file list]
- code
- platform
- LHCb code/version
- user’s code
- Link
- How to’s, instructions, config
- Output data
- Data
- MC
ALICE:
- ALICE analysis trains LEGO system [detailed specs to come after meeting with SDT and PH] These trains are fully documenting input/output datasets, analysis code and macros used, versions, processing details (statistics, triggers, processing time, …)
- Publications and their available metadata: A lot of this information is held by the physics working groups that performed the given analysis, but there are surely other levels in the publication procedure
- Public or internal wiki pages or PWG mailing lists
⇒ Question: we need to understand if this content is referenced/linked in publications, presentations, approval procedures…. This will be done in the next face to face meeting with M.
ATLAS:
[work in progress, more details and pointers to come; see also details above ]
- AMI
- New: derivation framework
- CVMFS, SVN
- INDICO
- CDS, INSPIRE/HEPdata
CMS:
- CADI: PI information, basic analysis information (ID), publication information
- DAS: metadata about data
- INDICO: talks (little metadata, access restricted in many parts)
- different locations for code, e.g. github
- trigger information can be extracted from various sources
- intermediate data files in various locations (user defined)
- see also doi:10.1088/1742-6596/119/7/072013 for details on individual systems
LHCb:
- Analysis wiki page
- Analysis internal notes (CDS)
- Analysis mailing list
- The code is in user areas or archived in Urania (high-level physics analysis software repository for the LHCb experiment).
- Presentations at the working groups (Indico)
ALICE: Essentially documentation, links to datasets, tables, lists (e.g. good runs for the analysis) web pages. The information will be essentially generic metadata, but also output (such as histograms in ROOT format)
⇒ Remark: We need metadata examples. This will be done in a separate meeting with M.
ATLAS:
[work in progress, more details and pointers to come]
- For the tools mentioned above:
- metadata to data, links to “primary” datasets
- standardized xAOD
- Workflow needed to integrate easily user generated files
CMS:
- DAS: with JSON output and API
- contains the metadata of all CMS datasets
- this is where the metadata for the public primary datasets where extracted (manually, but API available)
- in the analysis information in DAPF, the primary datasets should be (or should have been if deleted) entries in this catalogue
- in CMS, it gives the necessary information to access the datasets (the filenames listing)
- see also doi:10.1016/j.procs.2010.04.172
- access limited to CMS members
- CADI with API
- user defined data files: 10k per year, small sizes (1GB max per person; 200-500 persons)
- could be non-EDM format, but certainly root based
- code:
- cmssw - open source in github
- cms software package (which does not necessarily contain the code used for actual analysis)
- c++, python, documentation
LHCb:
- DST datasets are of tens of TB size; final ntuples are tens of GB.
- what metadata could be extracted automatically? For the DSTs, information on the software version used to process the data can be extracted by the link to the bookkeeping. Other information could be extracted from the analysis wikis, but the structure is not always the same.
ALICE:
- Keep all information which is now sparse in a self contained way
- A way to output all analysis metadata in a common format, allowing to make experiment specific plugins to read and perform specific actions (for ALICE we could generate a LEGO analysis train automatically, pointing to the right software and datasets). We could have such plugin reading the metadata in our VM open access environment in some particular cases.
⇒ question: could you give an example of a specific plug in or how a use case would look like? [separate meeting will follow]
ATLAS:
[work in progress, more details and pointers to come]
- as automated as possible
- “user clicks buttons to connect CAP with existing ATLAS databases and so content is automatically retrieved”
- possibly the process could be started from the shell
CMS:
- search functions,e.g. of intermediate data
- central access point for user created content (which is currently only available upon request, if you know where it is or whom to ask)
- a prominent use case for CMS are these “user generated files”. They could be made much more accessible and “reusable” if they are preserved and searchable via CAP
- we would like to have automatic retrieval of the main contents
- integration with with CRAB and CMS dashboard to start capturing information as early in the process as possible; that potentially means also capturing analysis steps that are not followed up
- cataloguing options
LHCb:
- Same as CMS.
- We would also like to archive (and then easily retrieve) the computing environment of the analysis (or at least of the final steps).
- It would be wonderful if I could click on “Rerun this analysis” and open a VM image with all the analysis environment (platform, LHCb code, ROOT version,...) properly setup, where I could launch the scripts/code prepared by the authors of the analysis (and archived in this framework).
ALICE:
- The access for committing to the site should be restricted to ALICE.
- For reading, it depends:
- for already published analysis (which is what I expect), the access should be for everyone.
- if we submit materials used in current analysis, access should be restricted.
ATLAS:
- Access to content is usually restricted.
- However, if CAP facilitates CERN-SSO and Grid-Certificates, this should be fairly compatible with existing ATLAS access schemes.
CMS:
- access restrictions apply to most content, tools and functionalities on DAPF
- possibly even access restrictions within the collaboration
- good to have easy option to push to CODP
LHCb:
- applies to most content (even if an analysis is published, the code, internal notes, etc usually are not public). Material which is already public (papers, talks) should instead be visible to everybody
ATLAS:
- should follow ATLAS analysis practices
- Auto-import functionalities are appreciated. Reduce burden to fill manually
CMS:
- Need to make sure there is no extra burden on the researcher and thus has to follow analysis practices. Integrate as much as possible into the research and approval tools and workflows
- Autocomplete whenever possible, including more complex routines, i.e. when selecting xyz, the following fields already pre-select certain elements…
- Show controlled vocabularies if available
LHCb:
- The interface needs to follow the the flow of the analysis, otherwise it is difficult to compile and for a user to reproduce.
- LHCb example:
- step 1 selection of events on DST
- step 2 train BDT and add BDT variables to the ntuple
- step 3 fit
- Step n +1 depend on the specific analysis, framework should be enough flexible to accommodate this. They will have the same structure. User shall be able to add them upon request (+1)
- helpful to start early in the research process, e.g. when an analysis is started
- should be embedded into the normal research practices
- Need to investigate how the tool could be embedded into individual analysis or publication approval procedures
- there is a lot of meta-information in the experiments’ tools which can be used to pre-populate CAP; e.g. CMS-DAS, ALICE-LEGO, LHCb-?, ATLAS-AMI or Derivation Framework?
- challenge: map data models
- automated extraction?
- versioning?
- search interface needed to have “extra” service on top of existing databases
- i.e to know who has done what with the data/software/MC
- i.e. for data derived from primary datasets
- VM: everyone needs/wants a snapshot of current VM. How to implement?
In conclusion, CAP needs to highlight “value added” on top of existing tools: powerful and usable search, making "user generated" content accessible and findable - should be easy to track from primary datasets to user generated content linking primary data with user generated content, potentially feeding this back to the experiments specific tools
⇒ large improvement for discoverability
easier approval process potentially: all the relevant information needed available via "one click"
ALICE LEGO: need to understand what can be extracted automatically and what needs to be added manually by the user or us [functionality similar to CMS DAS? - this will be studied in a separate meeting with M.
LHCb [bookkeeping database or train system] : need to understand what can be extracted automatically and what needs to be added manually by the user or us
- is the functionality similar to CMS DAS?
- Root everywhere - but where exactly? Interest to have “common” visualization tools on top?
- Where do we make public content accessible? Via CODP or via an open search on CAP?
- API: need to investigate how “pipes” can be built
- technical (IT-CIS)
- metadata (GS-SIS): standard example from each product/pipe needed
- once we investigate the pipes in more detail, we will do the round of submission form adjustment
- what will be prefilled, how?
- check dropdowns (how to update the controlled vocabularies?), check the order [for each of the experiments] [early summer]
- run first usability tests with power users and “preservation experts” on clickable prototypes [mid summer]. Note that this does not mean that this will a production-ready system by then.
- test with first “new” users from the experiments on clickable prototypes [late summer/autumn]
- specification of back end functionalities and corresponding tasks.
In particular, it would be great to have additional support from you on the following tasks in the next weeks
- what are the core elements of the analysis (steps) across experiments? [spring, with DASPOS and DPHEP?]
- are there simple “core elements” which are common across the collaborations?
- what is different?
- develop data models, which take into account the above [late spring, with DASPOS?]
- provide feedback on data models [with collaborations]
- workshop to build and discuss the core analysis elements and data model?
- API questions and potential adjustments [DPHEP?]
We are happy to support any internal discussions in regard to the CAP service.