-
Notifications
You must be signed in to change notification settings - Fork 25
Requirements
These requirements where gathered from different Invenio based services, i.e.:
- EUDAT B2SHARE
- CERN Document Server
- Zenodo
- Archivematica/SIP Store integration
These requirements cover all mentioned statistics related features. The goal is not to implement everything listed here, but rather to keep a record of all this input so that different teams can contribute to this project without loosing any past knowledge.
This document will evolve with time as more input comes in.
Some features will be marked as "out of scope", which means that they will not be implemented in this module. They might however be implemented in another module.
Features should be prioritized as best as possible. For each feature the services expecting them should be listed with their deadline. When no deadline is given the priority will be "nice to have".
Aggregated statistics like counters, "Top N" lists, maps... High level statistics need to be computed from low level events. This processing can be done either at run time or with partial preprocessing.
Examples:
Dashboard | Statistic | CDS | B2SHARE | Zenodo | Archivematica integration |
---|---|---|---|---|---|
Record | # page views of the record page () | ASAP | ASAP | ASAP | |
Record | # download of the record (counting all files download as 1 download) | Nice | |||
File | # downloads per file in the record. | ASAP | ASAP | Medium | |
Collection | # new records for the entire collection. | Medium | |||
Collection | # new submissions for the entire collection. | Medium | |||
Collection | Particular features usage (e.g. # of comments, # of alerts, usage of recommendations, etc.) | August? | |||
Collection | # record views for the entire collection | Medium | |||
Collection | # file download for the entire collection | Medium | |||
Collection | Top uploaders in collection | ASAP | |||
Collection | Open access vs closed access | ASAP | |||
Collection | World map with where users are coming from | Medium | |||
User | # record views for all the users records | Medium | |||
Circulation | # renewals, loans, overdue... | Medium | |||
SIP Store | #/% of SIP packages created, globally or within a ‘collection’ | Low | |||
SIP STORE | #/% of records/files sent to Archivematica store | Low | |||
SIP STORE | #/% of records with failures reported by Archivematica | Low | |||
SIP STORE | Average delay for the processing of single SIP | Low | |||
SIP STORE | Total used size of the AM store for one Invenio instance / multiple collections ; history of the numbers interesting to preserve to see the evolution of the Archivematica store. | Low |
Note: Collection ~= community
Some services would need a history of the statistic. Example: #File downloads per day/month, since one year, possibly since the beginning.
This history would need to be queried, filtered so that only a given range is displayed.
New statistics would be added from time to time, but this won't happen very often.
The format of the statistics and event should change very rarely.
Comments by | Comment |
---|---|
Archivematica | The 3 first SIP Store statistics can be derived direclty from the Archivematica-Invenio table. Delay and total size would probably need more logging. |
Creating a dedicated UI presenting fine grained events is OUT OF SCOPE. The querying could be done by another module or directly via Kibana.
The goal is to help administrators investigate low level events.
Example of data: One record view event with information about the user who performed it, the time and date, geoip, etc...
The current version of CDS server (based on Invenio 1) pushes all events in a "statistics/log" elasticsearch server. These events can later be investigated. The resulting cluster has about 750GB of data.
Priority | Medium |
---|
Zenodo mentioned a need for an audit log enabling the following kind of queries:
- List events that a user has performed.
- List events that are performed on an organisation.
Priority | OUT OF SCOPE |
---|
Use the statistics to change the ranking of records.
Needed by | Zenodo |
---|---|
Priority | Low |
Widgets showing multiple statistics from the point of view of a Record, a Community, a User... Those widgets would be on a dedicated page or added on existing pages like "Community page", "Record page"
Needed by | CDS, Zenodo, Archivematica integration |
---|---|
Priority | Blocking. |
AJAX queries to the REST API returning statistics. This would make statistics filtering more dynamic from the UI point of view.
Needed by | Zenodo, B2SHARE and CDS |
---|---|
Priority | Important but not blocking |
Statistics could be added when returning resources on existing endpoints. Example:
$ curl -XGET /api/records/123
{
"metadata": {...},
"links": {...},
"stats": {
"views": 42,
"downloads": 10
}
}
This would be done as a first approach by custom B2SHARE code, thus not part of this module.
Priority | OUT OF SCOPE |
---|
Statistics would be accessed via Kibana. This is only for administrators.
Needed by | CDS |
---|---|
Priority | Nothing needs to be done as long as the statistics are in elasticsearch |
Access to some statistics could be restricted to some users.
Needed by | CDS |
---|---|
Priority | Blocking for CDS |
Comments | Zenodo and B2SHARE don't need access control for now. Every statistic is public. |
Automatic removal of old events or statistics.
Needed by | B2SHARE, Zenodo |
---|---|
Priority | Low. The data can be deleted manually in the mean time. |
Automatic aggregation of low level statistics and events into higher level of statistics.
Example: aggregate record view events in "record views per day documents".
This would improve the performance as old events can be removed more easily.
Needed by | B2SHARE, Zenodo |
---|---|
Priority | Medium |
Be compatible with COUNTER (see https://www.projectcounter.org/guides/)
Needed by | Zenodo |
---|---|
Priority | Low |
Needed by | Zenodo |
---|---|
Priority | Low |
By downtime we mean that the statistics are not available but Invenio is still up. This can happen for long processing or migrations.
For all services a long downtime seems acceptable.
Amount of data which will be stored:
CDS | Our current ES cluster holds 750GB of data / 2’414 Mil objects (including apache logs). Just the statistics should be close to ~100GB of data. |
---|---|
Zenodo | Small - it’s in Piwik |
Here is the list of deadlines per project:
What | CDS | Zenodo | B2SHARE | Archivematica |
---|---|---|---|---|
invenio-stats alpha | July, a student working on Frontend for statistics for 10 weeks | End of June | ||
invenio-stats integrated in service | Q3 | July | Only a nice to Have |