Fifth CAP meeting

CERN Analysis Preservation: 5th meeting

CERN Analysis Preservation fifth meeting took place on Thursday 12th of May 2016, 4-6 PM CEST/10-12 AM EDT

https://indico.cern.ch/event/526921/

Notes from the meeting

Demonstration of the prototype

Prototype available at: https://analysis-preservation-qa.cern.ch/

With focus on the following features and interfaces:

New backend (Invenio 3) underlying CAP
Revised analysis Schemas (JSON) for LHCb, CMS and ATLAS. Testing in progress.
Integration with more experiments’ databases for autocomplete functionality
New functionalities around an analysis: sub-schemas, permissions (works with CERN e-groups), files tab to understand quickly the preservation readiness
Search (results) and API
Workflow integration with RECAST

General discussion of roadmap

CAP will be rolled out this year. Different pillars need to worked on:

1st pillar: Connecting the content sources, standardizing it

Increase JSON schema testing with targeted working groups to refine the (sub)schemas and use cases. Meet the evolution of an analysis with a usable interface and JSON schema.
Standardize JSON metadata schema (Ontology).
Connect the databases from the experiments.

2nd pillar: Aggregate and grab the content

Grab files/content from experiments’ databases, GitHub, GitLab.
Intelligent search.

3rd pillar

Containerize analyses by using Docker with various workflow engines and running environments e.g. GitLab CI (LHCb).
Rerun analyses on OpenStack Magnum, allow first steps towards reproducible research.

The long-term goal is to capture the analysis, the descriptive metadata, the physics information, and all the relevant data, containers, and software so that an analysis can be reproduced locally. This is why the 3rd pillar needs the 2nd pillar to build upon.

The discussion showed a preference to start working on these pillars in parallel and not in consecutive steps. In particular, a fast track for the execution of workflows should be facilitated. There was overall consensus to do that. It needs to be understood, however, how to enable that fast track. Facilitating execution of workflows without hosting the content can be done faster, so that the two pillars can be worked on in parallel. A proper balance between short-term goals and long-term goals is needed.

Val. (CMS) and others request a policy document with a roadmap. The CAP team agrees that this should be prepared, i.e. to lay out the release plans more clearly. This will be helpful to enable the internal discussions for integrations, application and approval procedures. A draft of such a document will be circulated to this group for comments by the end of June. Please note also that the CAP team will refine its development milestones on GitHub.

General discussion of concept

Discussion of containerization vs having all the (metadata)information to see how the analysis was/is done. How is CAP set up:

CAP as a container of full analysis or
CAP as a store for the detailed components of an analysis, i.e. the JSON schemas and additional information

There is consensus that this is a discussion that happens across disciplines. CAP, however, facilitates both approaches.

Serving the “long tail of files”: Need of a generic file store for additional files, such as plots or other supplementary materials (see HEPData who does that already). General consensus to support this.

CAP integration with HEPData : CAP should smoothly integrate with HEPData (an open data publishing platform), that lowers the submission burden on the researchers/experiments’ side. General consensus to support this.

Discussion of open challenges

Access to API:

Open for now but later we should see the access granularity, as more content will have been stored
Different use cases for each experiment

Edit/Deletion of an analysis:

Versioning in CAP allows “fall back options”
CMS needs flexibility during the active analysis steps
Agreement: after an approval “stamp” has been given to an analysis, it should not be possible to alter/delete it.

Representation

What is a good representation of an analysis
DASPOS has done work on that and can share their experience

Updates from LHC experiments

ATLAS

Internal policy documents are entering the final phase
Interested to use CAP when analysis is in a more final phase, i.e. before publication approval.
Technical problem to solve: Need of a reference/link from a Glance record to an AMI record. Both are needed for CAP, but information is not linked.

CMS

Interested to use CAP from the beginning of the analysis and throughout the progress to capture all the needed information in all important steps. This means it is expected to be used before the approval phase.
Started to test the interfaces of the analysis submission interface with specific WG, i.e. with the Heavy-Ion groups. The results: additions to the general schema. See for example: https://github.com/cernanalysispreservation/analysis-preservation.cern.ch/issues/132

LHCb

Interested to use CAP as part of the publication approval procedures
Interested to know how a Docker container (and post n-tuple analysis steps run via GitLab CI - S. N. is working on this) can be integrated into CAP.
Asking feedback from its working groups. Testing is under way.
Next collaboration meeting in June where prototype will be presented

Update from partnering initiatives

DASPOS: Collaboration with CAP planned, focus on computational workflows to rerun analysis and ontology descriptions.

RECAST: Update: The execution backup will be more like CERN’s new container technology (Open Stack Magnum).

EVERWARE: friendly to run workflows with several containers.

Next steps

Forthcoming weeks: Testing analysis forms. We need more detailed use cases and examples from each experiment so that we can develop workflows accordingly. CMS and LHCb are already reaching to the working groups. Once ATLAS is set up and integrated with their platforms, testing should begin there as well.
Need for clarification in regards to working on the 3rd pillar without requiring 2nd pillar (see above): Ask RECAST if they want to use CAP only to store their JSON schemas and run the analyses on their side or they will give us access to their data and CAP will be able to rerun the analyses.
G. to circulate DASPOS examples of analysis representation
CAP team to refine milestones on GitHub (publicly) and prepare policy document for circulation in this group

For reference

Link to GitHub repo: https://github.com/cernanalysispreservation

Previous meeting notes: Fourth CAP meeting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fifth CAP meeting

CERN Analysis Preservation: 5th meeting

Notes from the meeting

Demonstration of the prototype

General discussion of roadmap

General discussion of concept

Discussion of open challenges

Updates from LHC experiments

Update from partnering initiatives

Next steps

For reference

HOME

Overview

Meeting notes

Workshops

Requirements

Use cases

User stories

Roadmap

Clone this wiki locally