Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelXbrl.load doesn't handle concurrency well #1128

Open
jdangerx opened this issue Mar 19, 2024 · 3 comments
Open

ModelXbrl.load doesn't handle concurrency well #1128

jdangerx opened this issue Mar 19, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jdangerx
Copy link

What happened?

OS: Mac OS 14 Sonoma (which is missing from your issue template, FYI), Ubuntu 22.04 (which is what I think the GHA ubuntu-latest images correspond to),

Hello! Thanks for all the work you've done to make XBRL parsing straightforward, it's been a great help to us at Catalyst!

Unfortunately, when running some automated XBRL ingestion processes we found that the taxonomy loading seems to run into a race condition when run with concurrency.

Specifically, the error occurs in these lines of WebCache.py:

        if reload or not filepathExists:
            return filepath if self._downloadFile(url, filepath) else None
  1. P1 and P2 both see that not filepathExists, so they initialize the file download
  2. P1 successfully downloads the file first
  3. P2 tries to download the file and finds that there is already a file in the cache! Oops.

I think we can work around it in our use cases (catalyst-cooperative/pudl#3449 and catalyst-cooperative/pudl-archiver#285) but I wonder if this would be a good thing to fix upstream!

Traceback inside ✨
[Errno 2] No such file or directory: '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/schedules/ScheduleTemporaryCashInvestments/sched-105_2
022-01-01_pre.xml.tmp' -> '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/schedules/ScheduleTemporaryCashInvestments/sched-105_2022-01-01_p
re.xml'                                                                                                                                                                                         
Unsuccessful renaming of downloaded file to active file /Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/schedules/ScheduleTemporaryCashInves
tments/sched-105_2022-01-01_pre.xml                                                                                                                                                             
Please remove with file manager.                                                                                                                                                                
concurrent.futures.process._RemoteTraceback:                                                                                                                                                    
"""                                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                              
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker                                                            
    r = call_item.fn(*call_item.args, **call_item.kwargs)                                                                                                                                       
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/concurrent/futures/process.py", line 210, in _process_chunk                                                             
    return [fn(*args) for args in chunk]                                                                                                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/concurrent/futures/process.py", line 210, in <listcomp>
    return [fn(*args) for args in chunk]
            ^^^^^^^^^                           
  File "/Users/dazhong-catalyst/scratch/test_arelle.py", line 11, in load_tax
    taxonomy = ModelXbrl.load(model_manager, taxonomy_url)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/ModelXbrl.py", line 88, in load
    modelXbrl.modelDocument = ModelDocument.load(modelXbrl, url, base, isEntry=True, **kwargs)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/ModelDocument.py", line 108, in load
    filepath = modelXbrl.modelManager.cntlr.webCache.getfilename(mappedUri, reload=reloadCache, checkModifiedTime=kwargs.get("checkModifiedTime",False))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/WebCache.py", line 426, in getfilename
    return filepath if self._downloadFile(url, filepath) else None
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/WebCache.py", line 508, in _downloadFile
    os.makedirs(filedir)
  File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60'

Here's a snippet that I've been using to reproduce it fairly reliably:

from arelle import Cntlr, ModelManager, ModelXbrl, WebCache

from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp


def load_tax(_i):
    cntlr = Cntlr.Cntlr()
    model_manager = ModelManager.initialize(cntlr)
    taxonomy_url = "https://eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60/form-60_2022-01-01.xsd"
    taxonomy = ModelXbrl.load(model_manager, taxonomy_url)
    return


if __name__ == '__main__':
    cntlr = Cntlr.Cntlr()
    cache = WebCache.WebCache(cntlr, None)
    cache.clear()
    with ProcessPoolExecutor(max_workers=10, mp_context=mp.get_context('fork')) as executor:
        taxonomies = [t for t in executor.map(load_tax, range(5))]

Documents

No response

If running from the command line, what command did you run?

No response

Interface

Python library (pip install)

Version

2.23.13

Download

pip install

Operating System

Other (please specify)

@jdangerx jdangerx added the bug Something isn't working label Mar 19, 2024
@github-project-automation github-project-automation bot moved this to Triage in Arelle Mar 19, 2024
@austinmatherne-wk
Copy link
Contributor

Hi @jdangerx,

Thanks for reporting this and glad to hear you're getting good use out of Arelle!

Regarding concurrency and filesystem operations. We'll work a ticket to avoid the race condition you mentioned, but there will still be other scenarios where multiple Arelle processes can run into race conditions writing to the web cache.

There are two ways to avoid this issues:

  1. Pre-populate the web cache so files don't have to be downloaded.
  2. Load standard XBRL taxonomy packages so files don't have to be downloaded and the web cache isn't required.

Of these two, taxonomy packages are typically preferred because most major taxonomies provide a package that can be loaded by XBRL processors. For both security and performance reasons, most production users of Arelle operate in offline mode, and download filings to the local system before processing them with Arelle using one of the two methods mentioned above.

@austinmatherne-wk
Copy link
Contributor

austinmatherne-wk commented Mar 25, 2024

Ticket requirements:

The Arelle web cache doesn't handle concurrency well. Multiple Arelle processes can run into race conditions downloading and populating the web cache with the same files.

To avoid these issues, we should:

  • Instead of appending ".tmp" to the filename of downloads, we should use a unique NamedTemporaryFile (with delete=False and dir set to the cache directory) and then move it into the cache location once downloaded.
  • Avoid TOCTOU race conditions when creating or deleting files and directories within the cache (os.makedirs(filedir, exist_ok=True) instead of if not os.path.exists(filedir): os.makedirs(filedir), etc.)

@austinmatherne-wk austinmatherne-wk moved this from Triage to Accepted in Arelle Mar 25, 2024
@brettkail-wk
Copy link
Contributor

Instead of appending ".tmp" to the filename of downloads, we should use a unique NamedTemporaryFile and then move it into the cache location once downloaded.

I suspect drives/mounts will cause complications (e.g., Windows temp directory on C: but cache on D:, Linux tmpfs vs cache directory, etc.), so you'll probably want to use dir to target the same directory anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Accepted
Development

No branches or pull requests

3 participants