TIP: Collect statistics on how people are using it #679

odeimaiz · 2022-09-07T18:01:46Z

Usage time
Amount of permutations for the optimizer
Reports created
Number of times s4l is used
...

related to #670

sanderegg · 2022-09-09T09:56:33Z

Prometheus/Grafana

track log in/log out requests
track open request per study per user
track close request per study per user
track computational resources of the optimizer [per user]
--> very nice plots

mrnicegyu11 · 2022-10-05T08:34:36Z

I have done some preliminary work on this and will follow this up with some thoughts and promQL snippets here for now. I guess this shall be discussed briefly at the review or in the team before we move forward and wrap these into grafana dashboard etc.

These PromQL queries can be tried on monitoring.tip.itis.swiss/prometheus/

Point by point:

track [the number of] log in/log out requests
This is possible with the PromQL query sum(http_requests_total{endpoint="/v0/auth/login"}) and sum(http_requests_total{endpoint=~"/v0/auth/logout"}). It does not resolve by userID, so it is not possible to track this by user atm, only in total. Note that the underlying scrape of the webserver was affected by the same bug (not taking swarm services' replica into account) as osparc.io until early this morning (05Okt2022)
track open request per study per user / track close request per study per user
This is currently not possible and deserves some words. It is possible to track the open/close requests per study in total, via sum(http_requests_total{endpoint="/v0/projects/{project_id}:open"}) and sum(http_requests_total{endpoint="/v0/projects/{project_id}:close"}). But, this does not allow resolving the open/close requests by user. This is by design, in the past the simcore webserver would expose userID and projectUUIDs, which resulted in high prometheus cardinality due to one timeseries being newly created for each unique combination of userID & projectUUID. Earlier this year, this grinded prometheus to a halt due to huge memory usage. For this reason, the userIDs/UUIDs have been omitted.
In graylog, the calls to webserver endpoints such as endpoint="/v0/projects/{project_id}:open are in principle logged fully. So the necessary information for tracking this metric is (in principle) in graylog. Graylog allows one to create aggregations (i.e. metrics) from logs, but for clear performance reasons does not allow aggregations over the log message or mutation via regex, only aggregations over the log's metadata is allowed. So one super hacky way I would not recommend to get these metrics is to export logs from graylog and write some script that counts and aggregates the calls to the endpoints, resolved by user and projects.
If one would want to solve this problem in a sustainable way with maintainable code that scales, I suggest:
An prometheus export in simcore (like the webserver, or potentially a dedicated exporter microservice) could expose the metrics:
projects_open{groupid=111}
projects_closed{groupid=555,wasGarbageCollected=True/False}
projects_created{group_id=123}
projects_shared_with{groupid=987}
Since our users and their groups are (presumably) finite, this might be a high-cardinality metric, but it would not grow on it's own, only when there are more users of the platform
These metrics would contain the information requested here, but some of these likely require querying the postgres database, which is slow and a costly/ potentially impactful operation.
track computational resources of the optimizer [per user]
This is traceable as "percentage of one CPU" (so, 100 = 100% means 1 full CPU used, 400 means 4 full CPUs used) using the PromQL query sum by (service_name,instance) (label_replace(rate(container_cpu_usage_seconds_total{image=~"^registry.tip.itis.swiss.*comp.*opt.*", id=~"/docker/.*"}[3m]), "service_name", "$1", "image", "^registry.*/simcore/services/(.*)")) * 100 OR on() vector(0)
However, this is again not resolved by user, and this is good w.r.t. keeping prometheus happy and labels' cardinality small. If the insight per user is truly needed, this requires some work. For example, the user_id could be added to the optimizer container as a docker container label, for example by the dask-scheduler. cAdvisor exposes every docker container label as a timeseries to Prometheus, so if this is the case we are easily able to aggregate or filter by user.
Currently, it seems that only very negligible CPU resources are used by the optimizer. Here is the graph for the last 2 days:

It has to be noted that currently we run tip.itis.swiss fully distributed over 4 manager machines, so it is hard to clearly state how much "free" CPU resources we have for the optimizer at any point in time, since the optimizer compeeds also with platform and ops services on the tip-machines/

If you want please share your thoughts on this, potentially before the review @elisabettai @sanderegg @pcrespov @Surfict @odeimaiz

mrnicegyu11 · 2022-10-06T09:34:06Z

Using a more hacky approach, the number of optimizer runs can be determined using the PromQL query

sum(label_replace(sum without(cpu) (rate(container_cpu_usage_seconds_total{image=~"^registry.tip.itis.swiss.*comp.*opt.*", id=~"/docker/.*"}[3m])), "service_name", "$1", "image", "^registry.*/simcore/services/(.*)") OR on() vector(0))

only in grafana, choosing the Stat visualization with options as such:

Then, the number of optimizer runs multiplied by two is printed. This is a bit hacky and will likely stop working if many asynchroneous overlapping optimizer runs are used, or if the time interval analyzed is too long

Note: To the best of my knowledge Prometheus offers no robust and good way to count the number of times a timeseries has values, or does not have values, which would be needed here. Using manual visual inspection, the number of runs can be deduced from the PromQL Query given earlier

Konohana0608 · 2023-10-18T13:13:57Z

this function will also be very useful to evaluate the responses of the feature voting functionality ( #1146 ) results. Thank you for the work done until now~~

odeimaiz added enhancement Feedback Feedback through frontend labels Sep 7, 2022

odeimaiz changed the title ~~Collect statistics on how people are using it~~ TIP: Collect statistics on how people are using it Sep 7, 2022

sanderegg assigned mrnicegyu11 Sep 9, 2022

pcrespov mentioned this issue Oct 5, 2022

TIP - FollowUp Items #670

Closed

5 tasks

mrnicegyu11 mentioned this issue Mar 2, 2023

basic monitoring functionality #881

Open

2 tasks

Konohana0608 added TIP Temporal Interference Planning PO issue Created by Product owners labels Oct 18, 2023

elisabettai removed the enhancement label Nov 3, 2023

SCA-ZMT mentioned this issue Apr 4, 2024

TIP Improvements #1345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIP: Collect statistics on how people are using it #679

TIP: Collect statistics on how people are using it #679

odeimaiz commented Sep 7, 2022

sanderegg commented Sep 9, 2022

mrnicegyu11 commented Oct 5, 2022 •

edited

Loading

mrnicegyu11 commented Oct 6, 2022 •

edited

Loading

Konohana0608 commented Oct 18, 2023

TIP: Collect statistics on how people are using it #679

TIP: Collect statistics on how people are using it #679

Comments

odeimaiz commented Sep 7, 2022

sanderegg commented Sep 9, 2022

Prometheus/Grafana

mrnicegyu11 commented Oct 5, 2022 • edited Loading

mrnicegyu11 commented Oct 6, 2022 • edited Loading

Konohana0608 commented Oct 18, 2023

mrnicegyu11 commented Oct 5, 2022 •

edited

Loading

mrnicegyu11 commented Oct 6, 2022 •

edited

Loading