-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-test/health-check for Stratum-1's #151
Comments
There is some discussion ongoing for this in Slack, and the (unsurprising) conclusion is that closer you are to the S1 you use, the faster things are. In the case of AWS, we have an S1 in the same zone so we get nice fast speeds. For my Magic Castle instance it has to make plenty of hops to get to RUG...and there may be limitations being imposed by the network. The Alliance configuration uses a CDN for cases like Magic Castle, and we should probably do something similar. We may even want multiple CDNs: one for use inside Azure, one for AWS, one for everyone else (Cloudflare). Managing CDNs will help us control any associated costs (and boost speed where we can). |
Some timings. AWS VM to AWS S1:
AWS VM trying to talk to RUG:
Nethogs here reports 10KB/s:
Targeting S0 (also RUG) we also see poor performance. This is after 5m and the cache has been populated with 20MB...
|
Result from RUG S0 (after S1 tests):
|
Test from the UiB S1 show consistent results:
so it does appear to point to something shaping the network traffic at RUG |
The total cache needed to run the Tensorflow example is about 800MB |
@bedroge We need to identify what is causing the traffic issues at RUG as this (likely) also impacts the speed of updates to our S1. Also another reason to push |
@ocaisa Can you also mention how you enforce using a particular Stratum-1, just in case others want to do some testing too? |
Basically, I am editing the EESSI configuration to only give one option and then reconfiguring CVMFS: # Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo vi /etc/cvmfs/domain.d/eessi-hpc.org.conf
# Reconfigure CVMFS
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null You can then check the (averaged) bandwidth and cache usage with
|
I was thinking if we could add this to some monitoring. I'll see what I can come up with, but it's tricky when the jobs take tens of minutes within a container. They may be non-trivial to have timeouts on. |
We would just do this in the eessi-demo repo with GitHub actions. We do one job per S1, run the example three times and give a time limit to the jobs (say 12 minutes for 3 runs). We just run that every couple of days. |
I did look at that, but from https://docs.github.com/en/site-policy/github-terms/github-terms-for-additional-products-and-features#5-actions-and-packages:
My emphasis. This includes:
See https://medium.com/average-coder/can-you-use-github-actions-for-monitoring-e9c6cfe79ef4 for a neat idea though. |
Probably a bit cleaner/easier: you can also make a local config file, in this case that would be |
Update on the instructions to reproduce: # Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
# Reconfigure CVMFS
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null You can then check the (averaged) bandwidth and cache usage with
|
Something appears to have changed at RUG today and I am no longer seeing performance issues on either S0 or S1 (indeed performance is significantly better than previous best case scenarios): [EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m14.691s
user 0m21.103s
sys 0m3.339s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m14.965s
user 0m21.537s
sys 0m2.890s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m14.755s
user 0m21.270s
sys 0m3.228s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 26515 6 39500 492 2 42 796371 10240001 11 130560 0 3208 12.031 211017 3186 http://ssr4cc.hpc.rug.nl/cvmfs/pilot.eessi-hpc.org DIRECT 1
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m18.020s
user 0m21.655s
sys 0m2.996s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m9.987s
user 0m21.722s
sys 0m2.761s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m12.603s
user 0m21.473s
sys 0m3.354s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 26515 12 41504 492 1 42 796304 10239934 11 130560 0 3208 12.031 211017 3275 http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1 A previous result for the RUG S1 was taken just an hour before these: [EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 2m3.672s
user 0m21.503s
sys 0m3.282s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 26m44.860s
user 0m21.737s
sys 0m3.181s so natural to assume that something changed on the RUG side when the issue was raised with the network team there...just need to figure out what. EDIT: Assumption about networking at RUG seems to be wrong, leaving us in the unfortunate position of not having a clue why things improved 😞 |
As it happens, there seems to have been DDoS attack on CVMFS services around the time we were seeing reduced performance, this may be connected. |
@ocaisa You saw the bad performance again today though, right? So it wasn't a temporary fluke? |
Yes problem reoccurred today. |
PR for this open in EESSI/eessi-demo#24 (not sure if it's the right location, but it works for now) |
Last week I gave an EESSI tutorial and running the examples on a vanilla instance on AWS was lightning fast from a cold start. In contrast, my runs inside a fresh Magic Castle cluster I brought up today were very slow, it took 10 minutes for the initial run of Tensorflow (and 36s when repeating the run).
The main difference I can think of is the response time from the different S1's. Is there any way we can do a speedcheck for our Stratum 1's to make sure they are operating as fast as we expect them to?
The text was updated successfully, but these errors were encountered: