Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[goes-glm]: Try to fix with socket timeout #183

Merged
merged 1 commit into from
Apr 24, 2023

Conversation

TomAugspurger
Copy link
Contributor

In a recent run, we experience the hanging goes-glm task again. Here's the backtrace from py-spy on the hanging process.

Thread 1 (active): "MainThread"
    read (ssl.py:1134)
    recv_into (ssl.py:1278)
    readinto (socket.py:706)
    read (http/client.py:466)
    _fp_read (urllib3/response.py:533)
    read (urllib3/response.py:567)
    stream (urllib3/response.py:628)
    generate (requests/models.py:816)
    __next__ (core/pipeline/transport/_requests_basic.py:173)
    process_content (blob/_download.py:52)
    _initial_request (blob/_download.py:435)
    __init__ (blob/_download.py:349)
    download_blob (blob/_blob_client.py:848)
    wrapper_use_tracer (core/tracing/decorator.py:76)
    <lambda> (core/storage/blob.py:514)
    with_backoff (core/utils/backoff.py:142)
    download_file (core/storage/blob.py:513)
    create_item (goes_glm.py:38)
    create_items (dataset/items/task.py:117)
    run (dataset/items/task.py:153)
    parse_and_run (task/task.py:53)
    run_task (task/run.py:138)
    run_cmd (task/_cli.py:32)
    run_cmd (task/cli.py:50)
    new_func (click/decorators.py:26)
    invoke (click/core.py:760)
    invoke (click/core.py:1404)
    invoke (click/core.py:1657)
    invoke (click/core.py:1657)
    main (click/core.py:1055)
    __call__ (click/core.py:1130)
    cli (cli/cli.py:140)
    <module> (pctasks:8)
Thread 13 (idle): "fsspecIO"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1884)
    run_forever (asyncio/base_events.py:607)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 14 (active): "asyncio_0"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

I think we only care about the MainThread. Setting the timeout_seconds on the download_blob request was apparently not sufficient. I don't know if setting the default socket timeout is sufficient, but it's worth a shot.

In a recent run, we experience the hanging goes-glm task again.
Here's the backtrace from py-spy on the hanging process.

```
Thread 1 (active): "MainThread"
    read (ssl.py:1134)
    recv_into (ssl.py:1278)
    readinto (socket.py:706)
    read (http/client.py:466)
    _fp_read (urllib3/response.py:533)
    read (urllib3/response.py:567)
    stream (urllib3/response.py:628)
    generate (requests/models.py:816)
    __next__ (core/pipeline/transport/_requests_basic.py:173)
    process_content (blob/_download.py:52)
    _initial_request (blob/_download.py:435)
    __init__ (blob/_download.py:349)
    download_blob (blob/_blob_client.py:848)
    wrapper_use_tracer (core/tracing/decorator.py:76)
    <lambda> (core/storage/blob.py:514)
    with_backoff (core/utils/backoff.py:142)
    download_file (core/storage/blob.py:513)
    create_item (goes_glm.py:38)
    create_items (dataset/items/task.py:117)
    run (dataset/items/task.py:153)
    parse_and_run (task/task.py:53)
    run_task (task/run.py:138)
    run_cmd (task/_cli.py:32)
    run_cmd (task/cli.py:50)
    new_func (click/decorators.py:26)
    invoke (click/core.py:760)
    invoke (click/core.py:1404)
    invoke (click/core.py:1657)
    invoke (click/core.py:1657)
    main (click/core.py:1055)
    __call__ (click/core.py:1130)
    cli (cli/cli.py:140)
    <module> (pctasks:8)
Thread 13 (idle): "fsspecIO"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1884)
    run_forever (asyncio/base_events.py:607)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 14 (active): "asyncio_0"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

```

I think we only care about the `MainThread`. Setting the
`timeout_seconds` on the `download_blob` request was apparently not
sufficient. I don't know if setting the default socket timeout is
sufficient, but it's worth a shot.
@TomAugspurger TomAugspurger marked this pull request as ready for review April 24, 2023 16:06
@TomAugspurger TomAugspurger merged commit f68832c into main Apr 24, 2023
@TomAugspurger TomAugspurger deleted the tom/fix/goes-glm-timeout-again branch April 24, 2023 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant