Skip to content

Requirements

Nicolas Harraudeau edited this page Jun 2, 2017 · 11 revisions

Introduction

Requirement Gathering

These requirements where gathered from different Invenio based services, i.e.:

  • EUDAT B2SHARE
  • CERN Document Server
  • Zenodo
  • Archivematica/SIP Store integration

Goal

These requirements cover all mentioned statistics related features. The goal is not to implement everything listed here, but rather to keep a record of all this input so that different teams can contribute to this project without loosing any past knowledge.

This document will evolve with time as more input comes in.

Some features will be marked as "out of scope", which means that they will not be implemented in this module. They might however be implemented in another module.

Rules

Features should be prioritized as best as possible. For each feature the services expecting them should be listed with their deadline. When no deadline is given the priority will be "nice to have".

Requirements

Required Statistics

Aggregated statistics like counters, "Top N" lists, maps... High level statistics need to be computed from low level events. This processing can be done either at run time or with partial preprocessing.

Examples:

Dashboard Statistic CDS B2SHARE Zenodo Archivematica integration
Record # page views of the record page () ASAP ASAP ASAP
Record # download of the record (counting all files download as 1 download) Nice
File # downloads per file in the record. ASAP ASAP Medium
Collection # new records for the entire collection. Medium
Collection # new submissions for the entire collection. Medium
Collection Particular features usage (e.g. # of comments, # of alerts, usage of recommendations, etc.) August?
Collection # record views for the entire collection Medium
Collection # file download for the entire collection Medium
Collection Top uploaders in collection ASAP
Collection Open access vs closed access ASAP
Collection World map with where users are coming from Medium
User # record views for all the users records Medium
Circulation # renewals, loans, overdue... Medium
SIP Store #/% of SIP packages created, globally or within a ‘collection’ Low
SIP STORE #/% of records/files sent to Archivematica store Low
SIP STORE #/% of records with failures reported by Archivematica Low
SIP STORE Average delay for the processing of single SIP Low
SIP STORE Total used size of the AM store for one Invenio instance / multiple collections ; history of the numbers interesting to preserve to see the evolution of the Archivematica store. Low

Note: Collection ~= community

Some services would need a history of the statistic. Example: #File downloads per day/month, since one year, possibly since the beginning.

This history would need to be queried, filtered so that only a given range is displayed.

New statistics would be added from time to time, but this won't happen very often.

The format of the statistics and event should change very rarely.

Comments by Comment
Archivematica The 3 first SIP Store statistics can be derived direclty from the Archivematica-Invenio table. Delay and total size would probably need more logging.

Use cases based on fine grained events

Creating a dedicated UI presenting fine grained events is OUT OF SCOPE. The querying could be done by another module or directly via Kibana.

Admin investigation

The goal is to help administrators investigate low level events.

Example of data: One record view event with information about the user who performed it, the time and date, geoip, etc...

The current version of CDS server (based on Invenio 1) pushes all events in a "statistics/log" elasticsearch server. These events can later be investigated. The resulting cluster has about 750GB of data.

Priority Medium

Audit Log

Zenodo mentioned a need for an audit log enabling the following kind of queries:

  • List events that a user has performed.
  • List events that are performed on an organisation.
Priority OUT OF SCOPE

Use case: record ranking

Use the statistics to change the ranking of records.

Needed by Zenodo
Priority Low

Presentation/Access

Dashboard

Widgets showing multiple statistics from the point of view of a Record, a Community, a User... Those widgets would be on a dedicated page or added on existing pages like "Community page", "Record page"

Needed by CDS, Zenodo, Archivematica integration
Priority Blocking.

Dedicated REST API via AJAX queries

AJAX queries to the REST API returning statistics. This would make statistics filtering more dynamic from the UI point of view.

Needed by Zenodo, B2SHARE and CDS
Priority Important but not blocking

Statistics added to existing endpoints

Statistics could be added when returning resources on existing endpoints. Example:

$ curl -XGET /api/records/123
{
  "metadata": {...},
  "links": {...},
  "stats": {
    "views": 42,
    "downloads": 10
  }
}

This would be done as a first approach by custom B2SHARE code, thus not part of this module.

Priority OUT OF SCOPE

Kibana/Elasticsearch

Statistics would be accessed via Kibana. This is only for administrators.

Needed by CDS
Priority Nothing needs to be done as long as the statistics are in elasticsearch

Features

Access control

Access to some statistics could be restricted to some users.

Needed by CDS
Priority Blocking for CDS
Comments Zenodo and B2SHARE don't need access control for now. Every statistic is public.

Removing old data

Automatic removal of old events or statistics.

Needed by B2SHARE, Zenodo
Priority Low. The data can be deleted manually in the mean time.

Preprocessing and aggregation of statistics.

Automatic aggregation of low level statistics and events into higher level of statistics.

Example: aggregate record view events in "record views per day documents".

This would improve the performance as old events can be removed more easily.

Needed by B2SHARE, Zenodo
Priority Medium

COUNTER compatibility

Be compatible with COUNTER (see https://www.projectcounter.org/guides/)

Needed by Zenodo
Priority Low

Sending anonymized events to OpenAIRE piwik HTTP API instance

Needed by Zenodo
Priority Low

Constraints

Downtime

By downtime we mean that the statistics are not available but Invenio is still up. This can happen for long processing or migrations.

For all services a long downtime seems acceptable.

Performance and resources

Amount of data which will be stored:

CDS Our current ES cluster holds 750GB of data / 2’414 Mil objects (including apache logs). Just the statistics should be close to ~100GB of data.
Zenodo Small - it’s in Piwik

Deadlines

Here is the list of deadlines per project:

What CDS Zenodo B2SHARE Archivematica
invenio-stats alpha July, a student working on Frontend for statistics for 10 weeks End of June
invenio-stats integrated in service Q3 July Only a nice to Have
Clone this wiki locally