Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sim4life.io - WP3: Tracking of resource usage #922

Closed
62 of 64 tasks
Tracked by #878
drniiken opened this issue Apr 4, 2023 · 17 comments
Closed
62 of 64 tasks
Tracked by #878

sim4life.io - WP3: Tracking of resource usage #922

drniiken opened this issue Apr 4, 2023 · 17 comments
Assignees
Labels
PO issue Created by Product owners s4l:web sim4life product in osparc.io
Milestone

Comments

@drniiken
Copy link
Member

drniiken commented Apr 4, 2023

Description

To ensure transparency and accuracy in billing, it is essential to collect and keep track of all resource usage from users, including compute time for simulation hours and S3 storage per user as well as egress costs for download of files. This information must be synchronized with what the customer has paid for, and the data must be stored permanently to accommodate the pay-per-use model.

To make this data accessible to all relevant parties, it will be integrated into multiple systems, including the webpage, product, and API, as well as potentially the finance department for billing.

Tasks

Preview Give feedback
  1. YuryHrytsuk mrnicegyu11
  2. matusdrobuliak66
  3. t:enhancement
    matusdrobuliak66 sanderegg
  4. matusdrobuliak66
  5. observability t:enhancement wontfix
    mrnicegyu11

Schoggilebe

Preview Give feedback
  1. matusdrobuliak66
  2. 3 of 4
    GitHK matusdrobuliak66
    odeimaiz

ThisIsSparta

Preview Give feedback
  1. matusdrobuliak66
  2. matusdrobuliak66

Kobayashi Maru

Preview Give feedback
  1. matusdrobuliak66
  2. matusdrobuliak66

Microhistory

Preview Give feedback
  1. t:enhancement
    matusdrobuliak66 sanderegg
  2. matusdrobuliak66
  3. a:resource-usage-tracker
    sanderegg
  4. 1 of 1
    a:dask-service a:director-v2 a:dynamic-sidecar a:webserver
    GitHK matusdrobuliak66
    sanderegg
  5. bisgaard-itis
  6. matusdrobuliak66

Quilmes

Preview Give feedback
  1. matusdrobuliak66
  2. matusdrobuliak66
  3. matusdrobuliak66

Baklava

Preview Give feedback
  1. matusdrobuliak66
  2. GitHK matusdrobuliak66
    sanderegg
  3. 1 of 1
    PO issue y6
    YuryHrytsuk matusdrobuliak66
    mrnicegyu11
  4. matusdrobuliak66

Sundae

Preview Give feedback
  1. matusdrobuliak66
  2. matusdrobuliak66 sanderegg
  3. a:director-v2
    matusdrobuliak66 sanderegg

Watermelon

Preview Give feedback
  1. matusdrobuliak66
  2. matusdrobuliak66
  3. matusdrobuliak66
  4. a:dask-service a:director-v2
    matusdrobuliak66 sanderegg
  5. a:director-v2 a:models-library a:webserver
    sanderegg
  6. 6 of 6
    t:enhancement t:maintenance
    GitHK matusdrobuliak66
    mrnicegyu11 pcrespov sanderegg
  7. a:frontend changelog:✨new-feature
    odeimaiz
  8. 3 of 3
    observability t:enhancement
    mrnicegyu11
@drniiken drniiken changed the title Billing & Tracking/ User Metrics Billing & Tracking: User Metrics Apr 4, 2023
@drniiken drniiken added PO issue Created by Product owners s4l:web sim4life product in osparc.io labels Apr 4, 2023
@drniiken drniiken mentioned this issue Apr 4, 2023
@sanderegg sanderegg added this to the Jelly Beans milestone Apr 5, 2023
@sanderegg
Copy link
Member

sanderegg commented Apr 6, 2023

Goal for sprint Jelly Beans

  • Use jsmash e2e to collect data (cpu/gpu hours, s3 storage, egress network)
    • see what is missing.

@mrnicegyu11
Copy link
Member

mrnicegyu11 commented Apr 26, 2023

Update for sprint Jelly Beans

We have a very rough PoC python script that:

  • gets service running time per user per service
  • gets service actual cpu usage per user per service
  • get service CPU reservation/limitation per user per service

Prometheus is used as a datasource. Potentially, this can serve as a basis for a new osparc-service that collects and exposes resource usage.

To make it robust, we need:

  • Wrap PoC in robust tested osparc code
  • Add a secondary Prometheus which is robustly set up (federated or separate, best managed prometheus from aws)
  • Add regular dumps of usage reports to file / S3 / backup for billing and storing these numbers at Z43. This needs to be reliable and tested
  • Potentially interface with postgres to get list of valid userids or resolve userID to email, etc.

The PoC script is found here ITISFoundation/osparc-simcore#4168

@mguidon mguidon changed the title Billing & Tracking: User Metrics sim4life.io - Tracking of resource usage May 9, 2023
@mguidon mguidon changed the title sim4life.io - Tracking of resource usage sim4life.io - WP3: Tracking of resource usage May 9, 2023
@mrnicegyu11
Copy link
Member

mrnicegyu11 commented May 11, 2023

Question from my side for the sprint planning PastelDeNata (I will be absent on PM2a):

  • Who is billed in case of unscheduled or scheduled maintenances? Imagine a solver job that is long running and killed due to an incident or scheduled downtime. How do we detect it inside our (to be devised) cost-system?

  • Is it truly necessary that we track additional metrics apart from simulation container seconds? It is at least conceivable to me to have a business model we charge the users a "flatrate" prices based only on simulation hours, that includes some additional charge for egress and S3 costs on our side as well. It would make the whole pipeline much smoother. I dont see value in devising a complicated system to track for example egress costs per user at this point (Personal opinion) . If a user needs to download a file for their scientific project, they are gonna do it anyway. A straight forward billing model might be easier to comprehend for end users. I doubt scientist want to even think about whether they cause egress or not. ---> I propose a flatrate charge based only on simulation seconds.

@mguidon
Copy link
Member

mguidon commented May 12, 2023

  • Failed jobs will not be billed (we will need to make sure that we disallow starting jobs prior to a downtime). For iSolve it might be possible to estimate the runtime
  • We will have two scenarios:
    1. Cloud users will be billed by simulation hours
    2. desktop users that submit simluations will be billed by 100 + X % of our costs (ec2 + egress, egress could be tracked by API)
      In general, even if we do not bill it, I would like to know how much S3 storage a user consumes. Also I would like to know how much egress cost the video streaming costs (no on a per user basis). The latter is to get an idea of the cost and whether we should go for a AWS solution for webrtc.

@pcrespov
Copy link
Member

pcrespov commented May 12, 2023

Goal for sprint Pastel de Nata

  • Have a script that collects all container seconds per user group for jupyter-smash in AWS prod @mrnicegyu11
  • Do the same with iSolve @mrnicegyu11

@pcrespov pcrespov self-assigned this May 12, 2023
@pcrespov pcrespov modified the milestones: Jelly Beans, Pastel de Nata May 12, 2023
@sanderegg
Copy link
Member

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Jun 23, 2023

Update Watermelon

Working comments

Notes & Discussions:

  • as promql doesn't support pagination when querying, we need to think of a list of regex expressions that would cover containers we would like to monitor (which will fetch a reasonable amount of data) and we will run more background tasks for each item in the list
    • OPEN DISCUSSION (with @mrnicegyu11): should we run all background tasks in one container, or should each background task run as a separate container in our docker swarm.
  • NOTE: potentially introduce a daily aggregated table which will aggregate the data needed for billing service (therefore we would be able to keep the resource_tracker_container table data amount under control.
  • NOTE: point of failure
    • resource tracking service is unavailable -> The scheduled task in this service can run even backward, the only important thing is that the Prometheus and its database is available
    • Prometheus outage -> There is a time range where we do not have data because Prometheus was not scraping them -> We will run at least 2 replicas of Prometheus in different nodes
    • There should be another source of data! -> We will store when services were started and stopped from the application.

@matusdrobuliak66
Copy link
Contributor

Update Baklava

Done:

  • User center dashboard
  • Collecting and storing events/resources now from all running services (Computational & Dynamic)
  • Created core backend for credit computation/mapping credits to pricing plans
  • Expose top-up credit endpoint to Payment Service

In progress:

  • Expose available credit information to the frontend
  • Provide wallet_id (where to bill) and pricing_plan_id (what to bill) when starting a service via UI or API.

To do:

  • Add a mechanism to close the running status of the service, if the resource tracker will stop recieving heartbeat events
  • Stop the running service if there are not enough credits
  • Connect simcore to AWS-managed RabbitMQ service
  • Connect resource usage tracker to new AWS-managed Postgres database
    ...

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Oct 5, 2023

Update Quilmes / TheNameless

Done:

  • Showing real credits in the wallet endpoint
  • Sending updates on current available credits in wallets to frontend via websocket
  • Wallet constrained to the specific product (example: osparc, s4l, ...)
  • User default wallet
  • Defining and listing pricing plans

In progress/To do:

  • API for storing the selected pricing plan for user service
  • Extend computation backend to use selected pricing plans
  • Introduce the lost heartbeat mechanism in RUT (to close the transaction if we lose the heartbeat)
  • e2e
  • Send to the RabbitMQ exchange that there are no credit left.

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Oct 31, 2023

Update Microhistory

Done and working

  • Selection of default wallet and pricing plan in the backend if not provided
  • Emit message when 0 credits are reached
  • Do not allow to start services when 0 credits are reached
  • Default wallet endpoint in the webserver
  • one-time-payment

should work in 3 weeks

  • Mechanism for handling lost heartbeat (almost done)
  • Billing data in the second database (if there is time left)
  • Admin endpoint for creation of pricing plans/units (if there is time left)

should not be available in 3 weeks

  • Optimization to have "Incremental SUM of credits table" so we do not need to always compute the whole history, which long term may get slow

@matusdrobuliak66
Copy link
Contributor

Update ThisIsSparta

Done

Improvements to be done (not blocking release)

  • Create api for creation of pricing plan / pricing detail for "ADMIN" user #1133
  • "Incremental SUM" of credits (Currently when computing credits on the fly, we are counting the whole history of transactions, in the long term this is unsustainable and we need to compute credits incrementally)
  • Continually bill while the transaction is running (Currently we consider the transaction billed only when it is successfully finished)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PO issue Created by Product owners s4l:web sim4life product in osparc.io
Projects
None yet
Development

No branches or pull requests

10 participants