Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-test/health-check for Stratum-1's #151

Open
ocaisa opened this issue May 23, 2023 · 18 comments
Open

Speed-test/health-check for Stratum-1's #151

ocaisa opened this issue May 23, 2023 · 18 comments

Comments

@ocaisa
Copy link
Member

ocaisa commented May 23, 2023

Last week I gave an EESSI tutorial and running the examples on a vanilla instance on AWS was lightning fast from a cold start. In contrast, my runs inside a fresh Magic Castle cluster I brought up today were very slow, it took 10 minutes for the initial run of Tensorflow (and 36s when repeating the run).

The main difference I can think of is the response time from the different S1's. Is there any way we can do a speedcheck for our Stratum 1's to make sure they are operating as fast as we expect them to?

@ocaisa
Copy link
Member Author

ocaisa commented May 24, 2023

There is some discussion ongoing for this in Slack, and the (unsurprising) conclusion is that closer you are to the S1 you use, the faster things are. In the case of AWS, we have an S1 in the same zone so we get nice fast speeds. For my Magic Castle instance it has to make plenty of hops to get to RUG...and there may be limitations being imposed by the network.

The Alliance configuration uses a CDN for cases like Magic Castle, and we should probably do something similar. We may even want multiple CDNs: one for use inside Azure, one for AWS, one for everyone else (Cloudflare). Managing CDNs will help us control any associated costs (and boost speed where we can).

@terjekv
Copy link
Member

terjekv commented May 24, 2023

Some timings. AWS VM to AWS S1:

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    0m20.198s
user    0m21.628s
sys     0m3.143s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    0m19.488s
user    0m21.364s
sys     0m3.317s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    0m19.732s
user    0m21.638s
sys     0m3.138s

AWS VM trying to talk to RUG:

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    8m52.507s
user    0m22.010s
sys     0m3.103s

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    19m14.079s
user    0m21.985s
sys     0m3.117s

Nethogs here reports 10KB/s:

 NetHogs version 0.8.7-23-gf281ca3

    PID USER     PROGRAM               DEV         SENT      RECEIVED      
  27525 cvmfs    /usr/bin/cvmfs2       ens5        0.564      10.464 KB/s
  29331 ec2-us.. sshd: ec2-user@pts/1  ens5        0.252       0.103 KB/s
      ? root     unknown TCP           0.000       0.000 KB/s

  TOTAL                                                                                                                                                     0.816      10.567 KB/s  

Targeting S0 (also RUG) we also see poor performance. This is after 5m and the cache has been populated with 20MB...

[ec2-user@ip-172-31-1-106 ~]$ cvmfs_config stat

Running /usr/bin/cvmfs_config stat cvmfs-config.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27078 48 31996 22 1 1 20735 10240000 0 130560 0 2 33.333 43 318 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/cvmfs-config.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat pilot.eessi-hpc.org:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27525 40 39020 492 1 3 20735 10240000 43 130560 0 280 47.350 9882 55 http://cvmfs-s0.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1

@ocaisa
Copy link
Member Author

ocaisa commented May 24, 2023

Result from RUG S0 (after S1 tests):

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    50m41.673s
user    0m21.844s
sys     0m3.271s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                       PROXY   ONLINE
2.10.1.0  27525  93         41144   492       3           42          796371       10240001     11       130560   0        3208    12.031      211025  63          http://cvmfs-s0.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1

@ocaisa
Copy link
Member Author

ocaisa commented May 24, 2023

Test from the UiB S1 show consistent results:

[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)  SPEED(K/S)  HOST                                                                    PROXY   ONLINE
2.10.1.0  27525  96         39948   492       3           2           23402        10240000     11       130560   0        267     83.806      12568  3863        http://bgo-no.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m10.015s
user    0m21.881s
sys     0m3.107s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m9.182s
user    0m22.165s
sys     0m2.852s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m11.770s
user    0m22.078s
sys     0m2.867s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                                    PROXY   ONLINE
2.10.1.0  27525  104        41316   492       1           42          796371       10240001     11       130560   0        3208    12.031      211017  1642        http://bgo-no.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1

so it does appear to point to something shaping the network traffic at RUG

@ocaisa
Copy link
Member Author

ocaisa commented May 24, 2023

The total cache needed to run the Tensorflow example is about 800MB

@ocaisa
Copy link
Member Author

ocaisa commented May 24, 2023

@bedroge We need to identify what is causing the traffic issues at RUG as this (likely) also impacts the speed of updates to our S1. Also another reason to push eessi.io so we can start configuring CDN

@boegel
Copy link
Contributor

boegel commented May 24, 2023

@ocaisa Can you also mention how you enforce using a particular Stratum-1, just in case others want to do some testing too?

@ocaisa
Copy link
Member Author

ocaisa commented May 25, 2023

Basically, I am editing the EESSI configuration to only give one option and then reconfiguring CVMFS:

# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo vi /etc/cvmfs/domain.d/eessi-hpc.org.conf
# Reconfigure CVMFS 
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

You can then check the (averaged) bandwidth and cache usage with

cvmfs_config stat pilot.eessi-hpc.org | column -t

@terjekv
Copy link
Member

terjekv commented May 25, 2023

I was thinking if we could add this to some monitoring. I'll see what I can come up with, but it's tricky when the jobs take tens of minutes within a container. They may be non-trivial to have timeouts on.

@ocaisa
Copy link
Member Author

ocaisa commented May 25, 2023

We would just do this in the eessi-demo repo with GitHub actions. We do one job per S1, run the example three times and give a time limit to the jobs (say 12 minutes for 3 runs). We just run that every couple of days.

@terjekv
Copy link
Member

terjekv commented May 25, 2023

I did look at that, but from https://docs.github.com/en/site-policy/github-terms/github-terms-for-additional-products-and-features#5-actions-and-packages:

Actions and any elements of the Actions product or service may not be used in violation of the Agreement, the GitHub Acceptable Use Polices, or the GitHub Actions service limitations set forth in the Actions documentation. Additionally, regardless of whether an Action is using self-hosted runners, Actions should not be used for:

My emphasis. This includes:

  • any activity that places a burden on our servers, where that burden is disproportionate to the benefits provided to users (for example, don't use Actions as a content delivery network or as part of a serverless application, but a low benefit Action could be ok if it’s also low burden); or
  • if using GitHub-hosted runners, any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.

See https://medium.com/average-coder/can-you-use-github-actions-for-monitoring-e9c6cfe79ef4 for a neat idea though.

@bedroge
Copy link
Collaborator

bedroge commented May 25, 2023

Basically, I am editing the EESSI configuration to only give one option and then reconfiguring CVMFS:

# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo vi /etc/cvmfs/domain.d/eessi-hpc.org.conf
# Reconfigure CVMFS 
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

You can then check the (averaged) bandwidth and cache usage with

cvmfs_config stat pilot.eessi-hpc.org | column -t

Probably a bit cleaner/easier: you can also make a local config file, in this case that would be /etc/cvmfs/domain.d/eessi-hpc.local (instead of .conf), where you can override that server list parameter (or anything else).

@ocaisa
Copy link
Member Author

ocaisa commented May 25, 2023

Update on the instructions to reproduce:

# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
# Reconfigure CVMFS 
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

You can then check the (averaged) bandwidth and cache usage with

cvmfs_config stat pilot.eessi-hpc.org | column -t

@ocaisa
Copy link
Member Author

ocaisa commented May 25, 2023

Something appears to have changed at RUG today and I am no longer seeing performance issues on either S0 or S1 (indeed performance is significantly better than previous best case scenarios):

[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m14.691s
user    0m21.103s
sys     0m3.339s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m14.965s
user    0m21.537s
sys     0m2.890s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m14.755s
user    0m21.270s
sys     0m3.228s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                PROXY   ONLINE
2.10.1.0  26515  6          39500   492       2           42          796371       10240001     11       130560   0        3208    12.031      211017  3186        http://ssr4cc.hpc.rug.nl/cvmfs/pilot.eessi-hpc.org  DIRECT  1
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m18.020s
user    0m21.655s
sys     0m2.996s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m9.987s
user    0m21.722s
sys     0m2.761s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m12.603s
user    0m21.473s
sys     0m3.354s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                                    PROXY   ONLINE
2.10.1.0  26515  12         41504   492       1           42          796304       10239934     11       130560   0        3208    12.031      211017  3275        http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1

A previous result for the RUG S1 was taken just an hour before these:

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m3.672s
user    0m21.503s
sys     0m3.282s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    26m44.860s
user    0m21.737s
sys     0m3.181s

so natural to assume that something changed on the RUG side when the issue was raised with the network team there...just need to figure out what.

EDIT: Assumption about networking at RUG seems to be wrong, leaving us in the unfortunate position of not having a clue why things improved 😞

@ocaisa
Copy link
Member Author

ocaisa commented May 26, 2023

As it happens, there seems to have been DDoS attack on CVMFS services around the time we were seeing reduced performance, this may be connected.

@boegel
Copy link
Contributor

boegel commented May 26, 2023

@ocaisa You saw the bad performance again today though, right? So it wasn't a temporary fluke?

@ocaisa
Copy link
Member Author

ocaisa commented May 26, 2023

Yes problem reoccurred today.

@ocaisa
Copy link
Member Author

ocaisa commented Jun 8, 2023

PR for this open in EESSI/eessi-demo#24 (not sure if it's the right location, but it works for now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants