-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyout might be dragging us behind on large dandisets #1549
Labels
performance
Improve performance of an existing feature
Comments
just notes: looking at the shared log, where process was 100% busy with just only with a handful of retries and errors overall reported❯ grep 'and retry' 2024.11.27-16.02.28Z-3234289.log
2024-12-04T23:47:56-0500 [DEBUG ] dandi 3234289:140581367957248 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Calretinin_SPIM.ome.zarr/1/0/32/28 - download failed on attempt #1: HTTPSConnectionPool(host='api.dandiarchive.org', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T00:54:30-0500 [DEBUG ] dandi 3234289:140581393135360 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-NeuN_SPIM.ome.zarr/0/0/35/68 - download failed on attempt #1: HTTPSConnectionPool(host='api.dandiarchive.org', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T06:09:59-0500 [DEBUG ] dandi 3234289:140583565764352 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Nuclei_SPIM.ome.zarr/0/0/65/79 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out., will sleep a bit and retry
2024-12-05T10:50:28-0500 [DEBUG ] dandi 3234289:140582978569984 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Somatostatin_SPIM.ome.zarr/0/0/88/46 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
2024-12-05T19:34:15-0500 [DEBUG ] dandi 3234289:140582450091776 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-NeuN_SPIM.ome.zarr/0/1/22/78 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T20:04:17-0500 [DEBUG ] dandi 3234289:140582450091776 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-NeuN_SPIM.ome.zarr/0/1/48/45 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T21:08:22-0500 [DEBUG ] dandi 3234289:140582450091776 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-NeuN_SPIM.ome.zarr/1/0/3/31 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-06T01:01:15-0500 [DEBUG ] dandi 3234289:140583590942464 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-Nuclei_SPIM.ome.zarr/1/0/19/14 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-08T19:29:04-0500 [DEBUG ] dandi 3234289:140580269037312 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-NeuN_SPIM.ome.zarr/0/0/32/24 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out., will sleep a bit and retry
2024-12-09T17:51:01-0500 [DEBUG ] dandi 3234289:140582986962688 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-Somatostatin_SPIM.ome.zarr/1/0/22/20 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
❯ grep Error 2024.11.27-16.02.28Z-3234289.log
2024-12-05T06:09:59-0500 [DEBUG ] dandi 3234289:140583565764352 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Nuclei_SPIM.ome.zarr/0/0/65/79.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out.
2024-12-05T10:50:27-0500 [DEBUG ] dandi 3234289:140582978569984 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Somatostatin_SPIM.ome.zarr/0/0/88/46.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2024-12-05T10:50:28-0500 [DEBUG ] dandi 3234289:140582978569984 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Somatostatin_SPIM.ome.zarr/0/0/88/46 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
2024-12-06T20:37:16-0500 [DEBUG ] dandi 3234289:140582466877184 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS28_stain-Nuclei_SPIM.ome.zarr/0/0/76/11.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.HTTPError'>, 500 Server Error: Internal Server Error for url: https://dandiarchive.s3.amazonaws.com/zarr/69fc58ee-92b8-4cfe-a9fa-7e3ce4c163fc/0/0/76/11?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20241207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20241207T013711Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=2cd25ddb59736529ff4b32489a809ab57ef87a9702c6d090ca1e89fb2c02d879
2024-12-06T20:37:16-0500 [DEBUG ] dandi 3234289:140582466877184 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS28_stain-Nuclei_SPIM.ome.zarr/0/0/76/11 - download failed: 500 Server Error: Internal Server Error for url: https://dandiarchive.s3.amazonaws.com/zarr/69fc58ee-92b8-4cfe-a9fa-7e3ce4c163fc/0/0/76/11?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20241207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20241207T013711Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=2cd25ddb59736529ff4b32489a809ab57ef87a9702c6d090ca1e89fb2c02d879
AssertionError
2024-12-08T19:29:04-0500 [DEBUG ] dandi 3234289:140580269037312 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-NeuN_SPIM.ome.zarr/0/0/32/24.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out.
2024-12-09T17:51:01-0500 [DEBUG ] dandi 3234289:140582986962688 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-Somatostatin_SPIM.ome.zarr/1/0/22/20.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2024-12-09T17:51:01-0500 [DEBUG ] dandi 3234289:140582986962688 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-Somatostatin_SPIM.ome.zarr/1/0/22/20 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
and there was an interesting singular AssertionError triggered -- so code path there needs analysis/ fixup
edit 1: overall, among 130sec of total time for 'main' we get only 30 sec for reading data (download etc), 69sec for interleave::get (! based on Own) and around 47sec in jsonschema validation (based on Total but there should be no children within our code there).sh -J 1:1 https://api.dandiarchive.org/api/assets/f0da6521-b9c7-4fa5-9405-c48527aebe90/download/' (python v3.12.7)
Total Samples 97317
GIL: 13.00%, Active: 21.00%, Threads: 2
%Own %Total OwnTime TotalTime Function (filename)
0.00% 17.00% 0.000s 131.2s main (click/core.py)
0.00% 17.00% 0.010s 131.2s download (dandi/download.py)
0.00% 17.00% 0.000s 131.2s download (dandi/cli/cmd_download.py)
0.00% 17.00% 0.000s 131.2s <module> (dandi)
0.00% 17.00% 0.000s 131.2s __call__ (click/core.py)
0.00% 17.00% 0.000s 131.2s wrapper (dandi/cli/base.py)
0.00% 17.00% 0.000s 131.2s invoke (click/core.py)
0.00% 17.00% 0.000s 131.2s new_func (click/decorators.py)
0.00% 4.00% 0.000s 70.35s _download_generator_guard (dandi/download.py)
0.00% 4.00% 0.010s 70.35s _download_zarr (dandi/download.py)
0.00% 4.00% 0.000s 70.35s download_generator (dandi/download.py)
0.00% 4.00% 0.770s 70.35s <genexpr> (dandi/download.py)
0.00% 4.00% 0.000s 69.09s __next__ (interleave/__init__.py)
4.00% 4.00% 69.08s 69.09s get (interleave/__init__.py)
0.00% 13.00% 0.010s 60.76s wrapped (pyout/interface.py)
0.00% 13.00% 0.000s 60.75s __call__ (pyout/interface.py)
0.00% 13.00% 0.020s 60.52s _write (pyout/interface.py)
0.00% 13.00% 0.020s 60.43s _write_update (pyout/interface.py)
0.00% 13.00% 0.050s 59.70s update (pyout/common.py)
0.00% 13.00% 0.060s 59.11s render (pyout/common.py)
0.00% 13.00% 0.010s 54.55s _render (pyout/common.py)
0.00% 13.00% 0.000s 54.54s join (pyout/common.py)
0.00% 9.00% 0.100s 47.53s _proc_group (pyout/common.py)
0.00% 4.00% 0.000s 47.46s _bootstrap (threading.py)
0.00% 4.00% 0.000s 47.46s run (threading.py)
0.00% 4.00% 0.000s 47.46s _bootstrap_inner (threading.py)
0.00% 4.00% 0.010s 47.46s _worker (concurrent/futures/thread.py)
0.00% 4.00% 0.010s 47.45s run (concurrent/futures/thread.py)
0.00% 4.00% 0.030s 47.43s _process (interleave/__init__.py)
0.00% 4.00% 0.020s 47.39s pairing (dandi/download.py)
2.00% 4.00% 10.13s 47.37s _download_file (dandi/download.py)
0.00% 9.00% 0.010s 47.30s validate (pyout/elements.py)
0.00% 9.00% 0.060s 47.29s validate (jsonschema/validators.py)
0.00% 9.00% 0.090s 46.83s iter_errors (jsonschema/validators.py)
0.00% 8.00% 5.09s 46.48s descend (jsonschema/validators.py)
0.00% 8.00% 0.340s 45.32s additionalProperties (jsonschema/_keywords.py)
2.00% 9.00% 2.26s 45.13s properties (jsonschema/_keywords.py)
0.00% 8.00% 1.04s 44.88s ref (jsonschema/_keywords.py)
0.00% 9.00% 0.010s 38.98s check_schema (jsonschema/validators.py)
0.00% 2.00% 0.000s 32.52s downloader (dandi/dandiapi.py)
0.00% 1.00% 0.040s 29.69s readinto (socket.py)
1.00% 1.00% 29.65s 29.65s read (ssl.py)
0.00% 1.00% 0.000s 29.65s recv_into (ssl.py) Numbers do not necessarily add up nicely but seems do need investigation on why
|
yarikoptic
added a commit
that referenced
this issue
Dec 11, 2024
Somehow, with some prior version of the code, but we did get this assertion triggered as shown in #1549 (comment) although it remains unclear how that could have happened if type checking is correct since we must not end up there with None or even 0. Unfortunately with a regular assert we just get an announcement of the fact and no details. After this change we hope to be able to see more details which might hint on the original issue.
This was referenced Dec 11, 2024
yarikoptic
added a commit
to pyout/pyout
that referenced
this issue
Dec 12, 2024
While troubleshooting slow "dandi download" we saw that considerable amount of time is spent on jsonschema validation (and logging, not considered here). See dandi/dandi-cli#1549 It is happening since we be providing styling per each row, which might differ for summary line etc. And we might be re-validating over and over again. The obvious solution, without trying to fix logic upstairs, is to use @lru_cache. But it could be not used as-is since style is a dict (mutable), nested, and can have function callbacks. Conversion of such style to `str` is relatively fast and should be good enough for jsonschema validation. Unfortunately an initial attempt to just wrap into some "Hashable" instance lead to jsonschema pukes. So I decided to wrap/unwrap the value and use has of md5 over str representation. Seems to work nicely and I hope it would eliminate some elevate performance issue a little (it is not an ultimate solution in original issue -- likely need to "slow down" records reporting).
yarikoptic
added a commit
to pyout/pyout
that referenced
this issue
Dec 12, 2024
While troubleshooting slow "dandi download" we saw that considerable amount of time is spent on jsonschema validation (and logging, not considered here). See dandi/dandi-cli#1549 It is happening since we be providing styling per each row, which might differ for summary line etc. And we might be re-validating over and over again. The obvious solution, without trying to fix logic upstairs, is to use @lru_cache. But it could be not used as-is since style is a dict (mutable), nested, and can have function callbacks. Conversion of such style to `str` is relatively fast and should be good enough for jsonschema validation. Unfortunately an initial attempt to just wrap into some "Hashable" instance lead to jsonschema pukes. So I decided to wrap/unwrap the value and use has of md5 over str representation. Seems to work nicely and I hope it would eliminate some elevate performance issue a little (it is not an ultimate solution in original issue -- likely need to "slow down" records reporting).
yarikoptic
added a commit
to pyout/pyout
that referenced
this issue
Dec 12, 2024
While troubleshooting slow "dandi download" we saw that considerable amount of time is spent on jsonschema validation (and logging, not considered here). See dandi/dandi-cli#1549 It is happening since we be providing styling per each row, which might differ for summary line etc. And we might be re-validating over and over again. The obvious solution, without trying to fix logic upstairs, is to use @lru_cache. But it could be not used as-is since style is a dict (mutable), nested, and can have function callbacks. Conversion of such style to `str` is relatively fast and should be good enough for jsonschema validation. Unfortunately an initial attempt to just wrap into some "Hashable" instance lead to jsonschema pukes. So I decided to wrap/unwrap the value and use has of md5 over str representation. Seems to work nicely and I hope it would eliminate some elevate performance issue a little (it is not an ultimate solution in original issue -- likely need to "slow down" records reporting).
yarikoptic
added a commit
to pyout/pyout
that referenced
this issue
Dec 13, 2024
While troubleshooting slow "dandi download" we saw that considerable amount of time is spent on jsonschema validation (and logging, not considered here). See dandi/dandi-cli#1549 It is happening since we be providing styling per each row, which might differ for summary line etc. And we might be re-validating over and over again. The obvious solution, without trying to fix logic upstairs, is to use @lru_cache. But it could be not used as-is since style is a dict (mutable), nested, and can have function callbacks. Conversion of such style to `str` is relatively fast and should be good enough for jsonschema validation. Unfortunately an initial attempt to just wrap into some "Hashable" instance lead to jsonschema pukes. So I decided to wrap/unwrap the value and use has of md5 over str representation. Seems to work nicely and I hope it would eliminate some elevate performance issue a little (it is not an ultimate solution in original issue -- likely need to "slow down" records reporting).
yarikoptic
added a commit
that referenced
this issue
Dec 16, 2024
…r file or when done This is primarily to resolve the problem of pyout "dragging us behind" which slows down entire download process. #1549
yarikoptic
added a commit
that referenced
this issue
Dec 16, 2024
…r file or when done This is primarily to resolve the problem of pyout "dragging us behind" which slows down entire download process. #1549
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While troubleshooting with @Kevancic the download of 000026 at MGH we observed that process is at 100% while not downloading much
py-spy top gave following
invocation was
note: we still were logging at DEBUG level in our log file, there were no pyout (?) logs there, but still were spending time validating using jsonschema every bloody record...
In general it should not be a problem but I think it is that within long running process and some "summary" or alike events start to pile up.
The text was updated successfully, but these errors were encountered: