Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyout might be dragging us behind on large dandisets #1549

Open
yarikoptic opened this issue Dec 11, 2024 · 1 comment
Open

pyout might be dragging us behind on large dandisets #1549

yarikoptic opened this issue Dec 11, 2024 · 1 comment
Labels
performance Improve performance of an existing feature

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Dec 11, 2024

While troubleshooting with @Kevancic the download of 000026 at MGH we observed that process is at 100% while not downloading much

py-spy top gave following

image

invocation was

DANDI_DEVEL_AGGRESSIVE_RETRY=1 dandi --log-level=INFO download -e refresh -J 1:1 https://dandiarchive.org/dandiset/000026/draft/files?location=sub-I48

note: we still were logging at DEBUG level in our log file, there were no pyout (?) logs there, but still were spending time validating using jsonschema every bloody record...

In general it should not be a problem but I think it is that within long running process and some "summary" or alike events start to pile up.

@yarikoptic yarikoptic added the performance Improve performance of an existing feature label Dec 11, 2024
@yarikoptic
Copy link
Member Author

yarikoptic commented Dec 11, 2024

just notes: looking at the shared log, where process was 100% busy with just -J 1:1 we see one individual file in zarr at a time about each second

only with a handful of retries and errors overall reported
❯ grep 'and retry' 2024.11.27-16.02.28Z-3234289.log
2024-12-04T23:47:56-0500 [DEBUG   ] dandi 3234289:140581367957248 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Calretinin_SPIM.ome.zarr/1/0/32/28 - download failed on attempt #1: HTTPSConnectionPool(host='api.dandiarchive.org', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T00:54:30-0500 [DEBUG   ] dandi 3234289:140581393135360 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-NeuN_SPIM.ome.zarr/0/0/35/68 - download failed on attempt #1: HTTPSConnectionPool(host='api.dandiarchive.org', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T06:09:59-0500 [DEBUG   ] dandi 3234289:140583565764352 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Nuclei_SPIM.ome.zarr/0/0/65/79 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out., will sleep a bit and retry
2024-12-05T10:50:28-0500 [DEBUG   ] dandi 3234289:140582978569984 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Somatostatin_SPIM.ome.zarr/0/0/88/46 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
2024-12-05T19:34:15-0500 [DEBUG   ] dandi 3234289:140582450091776 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-NeuN_SPIM.ome.zarr/0/1/22/78 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T20:04:17-0500 [DEBUG   ] dandi 3234289:140582450091776 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-NeuN_SPIM.ome.zarr/0/1/48/45 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-05T21:08:22-0500 [DEBUG   ] dandi 3234289:140582450091776 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-NeuN_SPIM.ome.zarr/1/0/3/31 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-06T01:01:15-0500 [DEBUG   ] dandi 3234289:140583590942464 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS27_stain-Nuclei_SPIM.ome.zarr/1/0/19/14 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out. (read timeout=30), will sleep a bit and retry
2024-12-08T19:29:04-0500 [DEBUG   ] dandi 3234289:140580269037312 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-NeuN_SPIM.ome.zarr/0/0/32/24 - download failed on attempt #1: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out., will sleep a bit and retry
2024-12-09T17:51:01-0500 [DEBUG   ] dandi 3234289:140582986962688 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-Somatostatin_SPIM.ome.zarr/1/0/22/20 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
❯ grep Error 2024.11.27-16.02.28Z-3234289.log
2024-12-05T06:09:59-0500 [DEBUG   ] dandi 3234289:140583565764352 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Nuclei_SPIM.ome.zarr/0/0/65/79.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out.
2024-12-05T10:50:27-0500 [DEBUG   ] dandi 3234289:140582978569984 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Somatostatin_SPIM.ome.zarr/0/0/88/46.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2024-12-05T10:50:28-0500 [DEBUG   ] dandi 3234289:140582978569984 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS26_stain-Somatostatin_SPIM.ome.zarr/0/0/88/46 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
2024-12-06T20:37:16-0500 [DEBUG   ] dandi 3234289:140582466877184 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS28_stain-Nuclei_SPIM.ome.zarr/0/0/76/11.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.HTTPError'>, 500 Server Error: Internal Server Error for url: https://dandiarchive.s3.amazonaws.com/zarr/69fc58ee-92b8-4cfe-a9fa-7e3ce4c163fc/0/0/76/11?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20241207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20241207T013711Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=2cd25ddb59736529ff4b32489a809ab57ef87a9702c6d090ca1e89fb2c02d879
2024-12-06T20:37:16-0500 [DEBUG   ] dandi 3234289:140582466877184 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS28_stain-Nuclei_SPIM.ome.zarr/0/0/76/11 - download failed: 500 Server Error: Internal Server Error for url: https://dandiarchive.s3.amazonaws.com/zarr/69fc58ee-92b8-4cfe-a9fa-7e3ce4c163fc/0/0/76/11?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20241207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20241207T013711Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=2cd25ddb59736529ff4b32489a809ab57ef87a9702c6d090ca1e89fb2c02d879
AssertionError
2024-12-08T19:29:04-0500 [DEBUG   ] dandi 3234289:140580269037312 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-NeuN_SPIM.ome.zarr/0/0/32/24.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Read timed out.
2024-12-09T17:51:01-0500 [DEBUG   ] dandi 3234289:140582986962688 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-Somatostatin_SPIM.ome.zarr/1/0/22/20.dandidownload - entered __exit__ with position 0 with exception: <class 'requests.exceptions.ConnectionError'>, ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2024-12-09T17:51:01-0500 [DEBUG   ] dandi 3234289:140582986962688 sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS25_stain-Somatostatin_SPIM.ome.zarr/1/0/22/20 - download failed on attempt #1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will sleep a bit and retry
and there was an interesting singular AssertionError triggered -- so code path there needs analysis/ fixup
2024-12-06T20:59:54-0500 [ERROR   ] dandi 3234289:140586896269824 Caught while downloading sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS28_stain-Nuclei_SPIM.ome.zarr:
Traceback (most recent call last):
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/dandi/download.py", line 372, in _download_generator_guard
    yield from generator
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/dandi/download.py", line 974, in _download_zarr
    for path, status in it:
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/interleave/__init__.py", line 346, in __next__
    return self.get()
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/interleave/__init__.py", line 380, in get
    self._end()
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/interleave/__init__.py", line 400, in _end
    e.get()
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/interleave/__init__.py", line 110, in get
    raise e.with_traceback(tb) from None
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/interleave/__init__.py", line 274, in _process
    x = next(it)
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/dandi/download.py", line 1117, in pairing
    for d in gen:
  File "/autofs/cluster/exvivo3/dandi_env_2024/lib/python3.9/site-packages/dandi/download.py", line 761, in _download_file
    assert isinstance(attempts_allowed_or_not, int)
AssertionError
edit 1: overall, among 130sec of total time for 'main' we get only 30 sec for reading data (download etc), 69sec for interleave::get (! based on Own) and around 47sec in jsonschema validation (based on Total but there should be no children within our code there).
sh -J 1:1 https://api.dandiarchive.org/api/assets/f0da6521-b9c7-4fa5-9405-c48527aebe90/download/' (python v3.12.7)
Total Samples 97317
GIL: 13.00%, Active: 21.00%, Threads: 2

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                            
  0.00%  17.00%   0.000s    131.2s   main (click/core.py)
  0.00%  17.00%   0.010s    131.2s   download (dandi/download.py)
  0.00%  17.00%   0.000s    131.2s   download (dandi/cli/cmd_download.py)
  0.00%  17.00%   0.000s    131.2s   <module> (dandi)
  0.00%  17.00%   0.000s    131.2s   __call__ (click/core.py)
  0.00%  17.00%   0.000s    131.2s   wrapper (dandi/cli/base.py)
  0.00%  17.00%   0.000s    131.2s   invoke (click/core.py)
  0.00%  17.00%   0.000s    131.2s   new_func (click/decorators.py)
  0.00%   4.00%   0.000s    70.35s   _download_generator_guard (dandi/download.py)
  0.00%   4.00%   0.010s    70.35s   _download_zarr (dandi/download.py)
  0.00%   4.00%   0.000s    70.35s   download_generator (dandi/download.py)
  0.00%   4.00%   0.770s    70.35s   <genexpr> (dandi/download.py)
  0.00%   4.00%   0.000s    69.09s   __next__ (interleave/__init__.py)
  4.00%   4.00%   69.08s    69.09s   get (interleave/__init__.py)
  0.00%  13.00%   0.010s    60.76s   wrapped (pyout/interface.py)
  0.00%  13.00%   0.000s    60.75s   __call__ (pyout/interface.py)
  0.00%  13.00%   0.020s    60.52s   _write (pyout/interface.py)
  0.00%  13.00%   0.020s    60.43s   _write_update (pyout/interface.py)
  0.00%  13.00%   0.050s    59.70s   update (pyout/common.py)
  0.00%  13.00%   0.060s    59.11s   render (pyout/common.py)
  0.00%  13.00%   0.010s    54.55s   _render (pyout/common.py)
  0.00%  13.00%   0.000s    54.54s   join (pyout/common.py)
  0.00%   9.00%   0.100s    47.53s   _proc_group (pyout/common.py)
  0.00%   4.00%   0.000s    47.46s   _bootstrap (threading.py)
  0.00%   4.00%   0.000s    47.46s   run (threading.py)
  0.00%   4.00%   0.000s    47.46s   _bootstrap_inner (threading.py)
  0.00%   4.00%   0.010s    47.46s   _worker (concurrent/futures/thread.py)
  0.00%   4.00%   0.010s    47.45s   run (concurrent/futures/thread.py)
  0.00%   4.00%   0.030s    47.43s   _process (interleave/__init__.py)
  0.00%   4.00%   0.020s    47.39s   pairing (dandi/download.py)
  2.00%   4.00%   10.13s    47.37s   _download_file (dandi/download.py)
  0.00%   9.00%   0.010s    47.30s   validate (pyout/elements.py)
  0.00%   9.00%   0.060s    47.29s   validate (jsonschema/validators.py)
  0.00%   9.00%   0.090s    46.83s   iter_errors (jsonschema/validators.py)
  0.00%   8.00%    5.09s    46.48s   descend (jsonschema/validators.py)
  0.00%   8.00%   0.340s    45.32s   additionalProperties (jsonschema/_keywords.py)
  2.00%   9.00%    2.26s    45.13s   properties (jsonschema/_keywords.py)
  0.00%   8.00%    1.04s    44.88s   ref (jsonschema/_keywords.py)
  0.00%   9.00%   0.010s    38.98s   check_schema (jsonschema/validators.py)
  0.00%   2.00%   0.000s    32.52s   downloader (dandi/dandiapi.py)
  0.00%   1.00%   0.040s    29.69s   readinto (socket.py)
  1.00%   1.00%   29.65s    29.65s   read (ssl.py)
  0.00%   1.00%   0.000s    29.65s   recv_into (ssl.py)

Numbers do not necessarily add up nicely but seems do need investigation on why

  • interleave::get is "expensive" #1551
  • and jsonschema consuming so much for pyout. May be we could/should completely disable validation there since records are not obtained from external sources

yarikoptic added a commit that referenced this issue Dec 11, 2024
Somehow, with some prior version of the code, but we did
get this assertion triggered as shown in
#1549 (comment)
although it remains unclear how that could have happened if type checking
is correct since we must not end up there with None or even 0.

Unfortunately with a regular assert we just get an announcement of the fact
and no details.  After this change we hope to be able to see more details
which might hint on the original issue.
yarikoptic added a commit to pyout/pyout that referenced this issue Dec 12, 2024
While troubleshooting slow "dandi download" we saw
that considerable amount of time is spent on jsonschema
validation (and logging, not considered here). See
dandi/dandi-cli#1549

It is happening since we be providing styling per each row, which might
differ for summary line etc. And we might be re-validating over and
over again. The obvious solution, without trying to fix logic upstairs,
is to use @lru_cache.  But it could be not used as-is since style is a
dict (mutable), nested, and can have function callbacks.  Conversion of
such style to `str` is relatively fast and should be good enough for
jsonschema validation.  Unfortunately an initial attempt to just wrap
into some "Hashable" instance lead to jsonschema pukes.

So I decided to wrap/unwrap the value and use has of md5 over str
representation. Seems to work nicely and I hope it would eliminate some
elevate performance issue a little (it is not an ultimate solution
in original issue -- likely need to "slow down" records reporting).
yarikoptic added a commit to pyout/pyout that referenced this issue Dec 12, 2024
While troubleshooting slow "dandi download" we saw
that considerable amount of time is spent on jsonschema
validation (and logging, not considered here). See
dandi/dandi-cli#1549

It is happening since we be providing styling per each row, which might
differ for summary line etc. And we might be re-validating over and
over again. The obvious solution, without trying to fix logic upstairs,
is to use @lru_cache.  But it could be not used as-is since style is a
dict (mutable), nested, and can have function callbacks.  Conversion of
such style to `str` is relatively fast and should be good enough for
jsonschema validation.  Unfortunately an initial attempt to just wrap
into some "Hashable" instance lead to jsonschema pukes.

So I decided to wrap/unwrap the value and use has of md5 over str
representation. Seems to work nicely and I hope it would eliminate some
elevate performance issue a little (it is not an ultimate solution
in original issue -- likely need to "slow down" records reporting).
yarikoptic added a commit to pyout/pyout that referenced this issue Dec 12, 2024
While troubleshooting slow "dandi download" we saw
that considerable amount of time is spent on jsonschema
validation (and logging, not considered here). See
dandi/dandi-cli#1549

It is happening since we be providing styling per each row, which might
differ for summary line etc. And we might be re-validating over and
over again. The obvious solution, without trying to fix logic upstairs,
is to use @lru_cache.  But it could be not used as-is since style is a
dict (mutable), nested, and can have function callbacks.  Conversion of
such style to `str` is relatively fast and should be good enough for
jsonschema validation.  Unfortunately an initial attempt to just wrap
into some "Hashable" instance lead to jsonschema pukes.

So I decided to wrap/unwrap the value and use has of md5 over str
representation. Seems to work nicely and I hope it would eliminate some
elevate performance issue a little (it is not an ultimate solution
in original issue -- likely need to "slow down" records reporting).
yarikoptic added a commit to pyout/pyout that referenced this issue Dec 13, 2024
While troubleshooting slow "dandi download" we saw
that considerable amount of time is spent on jsonschema
validation (and logging, not considered here). See
dandi/dandi-cli#1549

It is happening since we be providing styling per each row, which might
differ for summary line etc. And we might be re-validating over and
over again. The obvious solution, without trying to fix logic upstairs,
is to use @lru_cache.  But it could be not used as-is since style is a
dict (mutable), nested, and can have function callbacks.  Conversion of
such style to `str` is relatively fast and should be good enough for
jsonschema validation.  Unfortunately an initial attempt to just wrap
into some "Hashable" instance lead to jsonschema pukes.

So I decided to wrap/unwrap the value and use has of md5 over str
representation. Seems to work nicely and I hope it would eliminate some
elevate performance issue a little (it is not an ultimate solution
in original issue -- likely need to "slow down" records reporting).
yarikoptic added a commit that referenced this issue Dec 16, 2024
…r file or when done

This is primarily to resolve the problem of pyout "dragging us behind" which
slows down entire download process.

#1549
yarikoptic added a commit that referenced this issue Dec 16, 2024
…r file or when done

This is primarily to resolve the problem of pyout "dragging us behind" which
slows down entire download process.

#1549
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant