ModelXbrl.load doesn't handle concurrency well #1128

jdangerx · 2024-03-19T21:22:43Z

What happened?

OS: Mac OS 14 Sonoma (which is missing from your issue template, FYI), Ubuntu 22.04 (which is what I think the GHA ubuntu-latest images correspond to),

Hello! Thanks for all the work you've done to make XBRL parsing straightforward, it's been a great help to us at Catalyst!

Unfortunately, when running some automated XBRL ingestion processes we found that the taxonomy loading seems to run into a race condition when run with concurrency.

Specifically, the error occurs in these lines of WebCache.py:

        if reload or not filepathExists:
            return filepath if self._downloadFile(url, filepath) else None

P1 and P2 both see that not filepathExists, so they initialize the file download
P1 successfully downloads the file first
P2 tries to download the file and finds that there is already a file in the cache! Oops.

I think we can work around it in our use cases (catalyst-cooperative/pudl#3449 and catalyst-cooperative/pudl-archiver#285) but I wonder if this would be a good thing to fix upstream!

Traceback inside ✨

[Errno 2] No such file or directory: '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/schedules/ScheduleTemporaryCashInvestments/sched-105_2
022-01-01_pre.xml.tmp' -> '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/schedules/ScheduleTemporaryCashInvestments/sched-105_2022-01-01_p
re.xml'                                                                                                                                                                                         
Unsuccessful renaming of downloaded file to active file /Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/schedules/ScheduleTemporaryCashInves
tments/sched-105_2022-01-01_pre.xml                                                                                                                                                             
Please remove with file manager.                                                                                                                                                                
concurrent.futures.process._RemoteTraceback:                                                                                                                                                    
"""                                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                              
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker                                                            
    r = call_item.fn(*call_item.args, **call_item.kwargs)                                                                                                                                       
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/concurrent/futures/process.py", line 210, in _process_chunk                                                             
    return [fn(*args) for args in chunk]                                                                                                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/concurrent/futures/process.py", line 210, in <listcomp>
    return [fn(*args) for args in chunk]
            ^^^^^^^^^                           
  File "/Users/dazhong-catalyst/scratch/test_arelle.py", line 11, in load_tax
    taxonomy = ModelXbrl.load(model_manager, taxonomy_url)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/ModelXbrl.py", line 88, in load
    modelXbrl.modelDocument = ModelDocument.load(modelXbrl, url, base, isEntry=True, **kwargs)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/ModelDocument.py", line 108, in load
    filepath = modelXbrl.modelManager.cntlr.webCache.getfilename(mappedUri, reload=reloadCache, checkModifiedTime=kwargs.get("checkModifiedTime",False))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/WebCache.py", line 426, in getfilename
    return filepath if self._downloadFile(url, filepath) else None
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/arelle/WebCache.py", line 508, in _downloadFile
    os.makedirs(filedir)
  File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60'

Here's a snippet that I've been using to reproduce it fairly reliably:

from arelle import Cntlr, ModelManager, ModelXbrl, WebCache

from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp


def load_tax(_i):
    cntlr = Cntlr.Cntlr()
    model_manager = ModelManager.initialize(cntlr)
    taxonomy_url = "https://eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60/form-60_2022-01-01.xsd"
    taxonomy = ModelXbrl.load(model_manager, taxonomy_url)
    return


if __name__ == '__main__':
    cntlr = Cntlr.Cntlr()
    cache = WebCache.WebCache(cntlr, None)
    cache.clear()
    with ProcessPoolExecutor(max_workers=10, mp_context=mp.get_context('fork')) as executor:
        taxonomies = [t for t in executor.map(load_tax, range(5))]

Documents

No response

If running from the command line, what command did you run?

No response

Interface

Python library (pip install)

Version

2.23.13

Download

pip install

Operating System

Other (please specify)

The text was updated successfully, but these errors were encountered:

austinmatherne-wk · 2024-03-25T16:31:07Z

Hi @jdangerx,

Thanks for reporting this and glad to hear you're getting good use out of Arelle!

Regarding concurrency and filesystem operations. We'll work a ticket to avoid the race condition you mentioned, but there will still be other scenarios where multiple Arelle processes can run into race conditions writing to the web cache.

There are two ways to avoid this issues:

Pre-populate the web cache so files don't have to be downloaded.
Load standard XBRL taxonomy packages so files don't have to be downloaded and the web cache isn't required.

Of these two, taxonomy packages are typically preferred because most major taxonomies provide a package that can be loaded by XBRL processors. For both security and performance reasons, most production users of Arelle operate in offline mode, and download filings to the local system before processing them with Arelle using one of the two methods mentioned above.

austinmatherne-wk · 2024-03-25T16:42:07Z

Ticket requirements:

The Arelle web cache doesn't handle concurrency well. Multiple Arelle processes can run into race conditions downloading and populating the web cache with the same files.

To avoid these issues, we should:

Instead of appending ".tmp" to the filename of downloads, we should use a unique NamedTemporaryFile (with delete=False and dir set to the cache directory) and then move it into the cache location once downloaded.
Avoid TOCTOU race conditions when creating or deleting files and directories within the cache (os.makedirs(filedir, exist_ok=True) instead of if not os.path.exists(filedir): os.makedirs(filedir), etc.)

brettkail-wk · 2024-03-25T18:51:13Z

Instead of appending ".tmp" to the filename of downloads, we should use a unique NamedTemporaryFile and then move it into the cache location once downloaded.

I suspect drives/mounts will cause complications (e.g., Windows temp directory on C: but cache on D:, Linux tmpfs vs cache directory, etc.), so you'll probably want to use dir to target the same directory anyway.

jdangerx added the bug Something isn't working label Mar 19, 2024

saasops-wk added this to Arelle Mar 19, 2024

github-project-automation bot moved this to Triage in Arelle Mar 19, 2024

austinmatherne-wk moved this from Triage to Accepted in Arelle Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelXbrl.load doesn't handle concurrency well #1128

ModelXbrl.load doesn't handle concurrency well #1128

jdangerx commented Mar 19, 2024

austinmatherne-wk commented Mar 25, 2024

austinmatherne-wk commented Mar 25, 2024 •

edited

Loading

brettkail-wk commented Mar 25, 2024

ModelXbrl.load doesn't handle concurrency well #1128

ModelXbrl.load doesn't handle concurrency well #1128

Comments

jdangerx commented Mar 19, 2024

What happened?

Documents

If running from the command line, what command did you run?

Interface

Version

Download

Operating System

austinmatherne-wk commented Mar 25, 2024

austinmatherne-wk commented Mar 25, 2024 • edited Loading

brettkail-wk commented Mar 25, 2024

austinmatherne-wk commented Mar 25, 2024 •

edited

Loading