Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIP: Collect statistics on how people are using it #679

Open
Tracked by #1345
odeimaiz opened this issue Sep 7, 2022 · 4 comments
Open
Tracked by #1345

TIP: Collect statistics on how people are using it #679

odeimaiz opened this issue Sep 7, 2022 · 4 comments
Assignees
Labels
Feedback Feedback through frontend PO issue Created by Product owners TIP Temporal Interference Planning

Comments

@odeimaiz
Copy link
Member

odeimaiz commented Sep 7, 2022

  • Usage time
  • Amount of permutations for the optimizer
  • Reports created
  • Number of times s4l is used
  • ...

related to #670

@odeimaiz odeimaiz added enhancement Feedback Feedback through frontend labels Sep 7, 2022
@odeimaiz odeimaiz changed the title Collect statistics on how people are using it TIP: Collect statistics on how people are using it Sep 7, 2022
@sanderegg
Copy link
Member

Prometheus/Grafana

  • track log in/log out requests
  • track open request per study per user
  • track close request per study per user
  • track computational resources of the optimizer [per user]
    --> very nice plots

@mrnicegyu11
Copy link
Member

mrnicegyu11 commented Oct 5, 2022

I have done some preliminary work on this and will follow this up with some thoughts and promQL snippets here for now. I guess this shall be discussed briefly at the review or in the team before we move forward and wrap these into grafana dashboard etc.

These PromQL queries can be tried on monitoring.tip.itis.swiss/prometheus/

Point by point:

  • track [the number of] log in/log out requests
    This is possible with the PromQL query sum(http_requests_total{endpoint="/v0/auth/login"}) and sum(http_requests_total{endpoint=~"/v0/auth/logout"}). It does not resolve by userID, so it is not possible to track this by user atm, only in total. Note that the underlying scrape of the webserver was affected by the same bug (not taking swarm services' replica into account) as osparc.io until early this morning (05Okt2022)

  • track open request per study per user / track close request per study per user
    This is currently not possible and deserves some words. It is possible to track the open/close requests per study in total, via sum(http_requests_total{endpoint="/v0/projects/{project_id}:open"}) and sum(http_requests_total{endpoint="/v0/projects/{project_id}:close"}). But, this does not allow resolving the open/close requests by user. This is by design, in the past the simcore webserver would expose userID and projectUUIDs, which resulted in high prometheus cardinality due to one timeseries being newly created for each unique combination of userID & projectUUID. Earlier this year, this grinded prometheus to a halt due to huge memory usage. For this reason, the userIDs/UUIDs have been omitted.
    In graylog, the calls to webserver endpoints such as endpoint="/v0/projects/{project_id}:open are in principle logged fully. So the necessary information for tracking this metric is (in principle) in graylog. Graylog allows one to create aggregations (i.e. metrics) from logs, but for clear performance reasons does not allow aggregations over the log message or mutation via regex, only aggregations over the log's metadata is allowed. So one super hacky way I would not recommend to get these metrics is to export logs from graylog and write some script that counts and aggregates the calls to the endpoints, resolved by user and projects.
    If one would want to solve this problem in a sustainable way with maintainable code that scales, I suggest:
    An prometheus export in simcore (like the webserver, or potentially a dedicated exporter microservice) could expose the metrics:

  • projects_open{groupid=111}

  • projects_closed{groupid=555,wasGarbageCollected=True/False}

  • projects_created{group_id=123}

  • projects_shared_with{groupid=987}
    Since our users and their groups are (presumably) finite, this might be a high-cardinality metric, but it would not grow on it's own, only when there are more users of the platform
    These metrics would contain the information requested here, but some of these likely require querying the postgres database, which is slow and a costly/ potentially impactful operation.

  • track computational resources of the optimizer [per user]
    This is traceable as "percentage of one CPU" (so, 100 = 100% means 1 full CPU used, 400 means 4 full CPUs used) using the PromQL query sum by (service_name,instance) (label_replace(rate(container_cpu_usage_seconds_total{image=~"^registry.tip.itis.swiss.*comp.*opt.*", id=~"/docker/.*"}[3m]), "service_name", "$1", "image", "^registry.*/simcore/services/(.*)")) * 100 OR on() vector(0)
    However, this is again not resolved by user, and this is good w.r.t. keeping prometheus happy and labels' cardinality small. If the insight per user is truly needed, this requires some work. For example, the user_id could be added to the optimizer container as a docker container label, for example by the dask-scheduler. cAdvisor exposes every docker container label as a timeseries to Prometheus, so if this is the case we are easily able to aggregate or filter by user.
    Currently, it seems that only very negligible CPU resources are used by the optimizer. Here is the graph for the last 2 days:
    image

It has to be noted that currently we run tip.itis.swiss fully distributed over 4 manager machines, so it is hard to clearly state how much "free" CPU resources we have for the optimizer at any point in time, since the optimizer compeeds also with platform and ops services on the tip-machines/

If you want please share your thoughts on this, potentially before the review @elisabettai @sanderegg @pcrespov @Surfict @odeimaiz

@pcrespov pcrespov mentioned this issue Oct 5, 2022
5 tasks
@mrnicegyu11
Copy link
Member

mrnicegyu11 commented Oct 6, 2022

Using a more hacky approach, the number of optimizer runs can be determined using the PromQL query

sum(label_replace(sum without(cpu) (rate(container_cpu_usage_seconds_total{image=~"^registry.tip.itis.swiss.*comp.*opt.*", id=~"/docker/.*"}[3m])), "service_name", "$1", "image", "^registry.*/simcore/services/(.*)") OR on() vector(0))

only in grafana, choosing the Stat visualization with options as such:
image

Then, the number of optimizer runs multiplied by two is printed. This is a bit hacky and will likely stop working if many asynchroneous overlapping optimizer runs are used, or if the time interval analyzed is too long

image

Note: To the best of my knowledge Prometheus offers no robust and good way to count the number of times a timeseries has values, or does not have values, which would be needed here. Using manual visual inspection, the number of runs can be deduced from the PromQL Query given earlier

@Konohana0608 Konohana0608 added TIP Temporal Interference Planning PO issue Created by Product owners labels Oct 18, 2023
@Konohana0608
Copy link
Contributor

this function will also be very useful to evaluate the responses of the feature voting functionality ( #1146 ) results. Thank you for the work done until now~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feedback Feedback through frontend PO issue Created by Product owners TIP Temporal Interference Planning
Projects
None yet
Development

No branches or pull requests

5 participants