Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent https failures can cause large-scale multi-fov starfish runs to crash #1277

Closed
ambrosejcarr opened this issue Apr 25, 2019 · 3 comments · Fixed by spacetx/slicedimage#99 or #1343
Assignees
Labels
bug An issue with an existing feature
Milestone

Comments

@ambrosejcarr
Copy link
Member


  0%|          | 0/16 [00:00<?, ?it/s]
 12%|█▎        | 2/16 [00:00<00:04,  3.17it/s]
 19%|█▉        | 3/16 [00:01<00:04,  2.62it/s]
 25%|██▌       | 4/16 [00:01<00:05,  2.24it/s]
 31%|███▏      | 5/16 [00:02<00:05,  2.13it/s]
 38%|███▊      | 6/16 [00:03<00:06,  1.62it/s]
 44%|████▍     | 7/16 [00:03<00:05,  1.68it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/slicedimage/backends/_caching.py", line 48, in __enter__
    file_data = self.cache.read(cache_key)
  File "/usr/local/lib/python3.7/site-packages/diskcache/core.py", line 1066, in read
    raise KeyError(key)
KeyError: 'v1-71e214583641581807584a857a585b156c22d3c60923dd00290ba810df724e38'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 13, in <module>
  File "/src/starfish/core/experiment/experiment.py", line 183, in get_image
    return ImageStack.from_tileset(self._images[item], crop_parameters=crop_params)
  File "/src/starfish/core/imagestack/imagestack.py", line 238, in from_tileset
    return cls(tile_data)
  File "/src/starfish/core/imagestack/imagestack.py", line 170, in __init__
    data = tile.numpy_array
  File "/src/starfish/core/imagestack/parser/crop.py", line 204, in numpy_array
    return self.cropping_parameters.crop_image(self.backing_tile_data.numpy_array)
  File "/src/starfish/core/imagestack/parser/tileset/_parser.py", line 49, in numpy_array
    self._load()
  File "/src/starfish/core/imagestack/parser/tileset/_parser.py", line 37, in _load
    self._numpy_array = self._wrapped_tile.numpy_array
  File "/usr/local/lib/python3.7/site-packages/slicedimage/_tile.py", line 49, in numpy_array
    result = self._numpy_array_future()
  File "/usr/local/lib/python3.7/site-packages/slicedimage/io.py", line 234, in _actual_future
    with _source_fh_contextmanager as fh:
  File "/usr/local/lib/python3.7/site-packages/slicedimage/backends/_caching.py", line 52, in __enter__
    self.name, self.checksum_sha256) as sfh:
  File "/usr/local/lib/python3.7/site-packages/slicedimage/backends/_http.py", line 27, in __enter__
    resp.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://d2nhj9g34unfro.cloudfront.net/xiaoyan_qian/ISS_human_HCA_07_MultiFOV/main_files/primary_images-fov_248-Z0-H3-C1.tiff

In this example, 538 of 539 fields of view successfully ran. The last one choked because of a random download failure.

It may make sense for slicedimage/starfish to retry in this case. I will also follow-up with cromwell.

@ambrosejcarr
Copy link
Member Author

ambrosejcarr commented Apr 26, 2019

@rexwangcc suggested tenacity for these kinds of retries:

https://github.com/jd/tenacity

@ttung
Copy link
Collaborator

ttung commented Apr 26, 2019

python-requests has retries, and has the ability to ignore certain requests codes for retries. retrying a 404, for instance, is unlikely to ever make sense.

@ambrosejcarr
Copy link
Member Author

I've followed up with the cromwell team; these types of retries are not things they cover. Their software will retry if data localization that is governed by specifying a cromwell File type as an input fails. Since we do the localization inside the command via requests, this is not covered by the workflow engine.

@rexwangcc please correct me if I have this wrong.

Given this, Tony's solution is the correct and simplest path. In case it's useful, we ran into one failure over hundreds of thousands of images.

@neuromusic neuromusic added feature New work bug An issue with an existing feature and removed feature New work labels May 7, 2019
@shanaxel42 shanaxel42 added this to the SpaceTX milestone May 7, 2019
ttung pushed a commit to spacetx/slicedimage that referenced this issue May 10, 2019
In the HTTP backend, we create a urllib3's retry policy and attach it to a python-requests session.

We test this by monkeypatching the code to retry on 404 errors (which we normally do not).  We attempt to fetch a file that's not present, but start a thread that creates the file after a short delay.  It should initially not find the file, and then eventually succeed.

Fixes spacetx/starfish#1277
ttung pushed a commit to spacetx/slicedimage that referenced this issue May 13, 2019
In the HTTP backend, we create a urllib3's retry policy and attach it to a python-requests session.

We test this by monkeypatching the code to retry on 404 errors (which we normally do not).  We attempt to fetch a file that's not present, but start a thread that creates the file after a short delay.  It should initially not find the file, and then eventually succeed.

Fixes spacetx/starfish#1277
ttung pushed a commit to spacetx/slicedimage that referenced this issue May 13, 2019
In the HTTP backend, we create a urllib3's retry policy and attach it to a python-requests session.

We test this by monkeypatching the code to retry on 404 errors (which we normally do not).  We attempt to fetch a file that's not present, but start a thread that creates the file after a short delay.  It should initially not find the file, and then eventually succeed.

Fixes spacetx/starfish#1277
ttung pushed a commit that referenced this issue May 20, 2019
This deploys the fix in spacetx/slicedimage#99 to resolve #1277
ttung pushed a commit that referenced this issue May 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with an existing feature
Projects
None yet
4 participants