Prefer newer JSON cache files to older #524

takluyver · 2024-06-07T14:31:01Z

The 'run map' JSON cache is used to speed up opening a run when possible, by listing the sources & trains to be found in each file. There are a couple of possible locations for this: it can be inside the run folder, where only privileged processes can write, or in a hidden world-writable folder under proposal scratch. In master, a cache in the run folder is preferred both for reading & writing.

In some cases, the calibration pipeline creates a partial cache file in the run folder, by opening a subset of files with the include= parameter. When someone later opens the whole run folder, a complete cache file is written in scratch, but then this is never used, because the one in the run folder takes priority.

This change looks for the newest available run map file when reading.

Testing: I was pointed to p5733 r120 as an example. Using the script below, I get a minimum of about 0.4 s to open the run on master, and around 0.1 s on this branch.

Using strace, I can also see that it opens 320 .h5 files on master, and 0 on this branch, because the information is all cached.

import time

from extra_data import open_run

t0 = time.perf_counter()
run = open_run(5733, 120, data='proc')
t1 = time.perf_counter()

print(f"{len(run.train_ids)} trains")
print(f"{t1 - t0:.3f} s to open")

extra_data/run_files_map.py

+            try:
+                st = os.stat(path)
+                paths_mtimes.append((path, st.st_mtime))
+            except (FileNotFoundError, PermissionError):


extra_data/run_files_map.py

philsmt · 2024-06-10T10:36:34Z

I can imagine this was a tricky one. LGTM for the implementation apart from aesthetic question.

Should be avoid creating the partial cache in calibration, since it hardly serves a purpose?

takluyver · 2024-06-10T13:53:25Z

Should be avoid creating the partial cache in calibration, since it hardly serves a purpose?

I think it still speeds up the first full open, because it should use the cache for the files in it. But it would arguably be tidier not to create an incomplete cache file.

Prefer newer JSON cache files to older

abe1712

takluyver added the bug Something isn't working label Jun 7, 2024

takluyver added this to the 1.18 milestone Jun 7, 2024

github-advanced-security bot found potential problems Jun 7, 2024

View reviewed changes

extra_data/run_files_map.py

try:

st = os.stat(path)

paths_mtimes.append((path, st.st_mtime))

except (FileNotFoundError, PermissionError):

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

philsmt reviewed Jun 10, 2024

View reviewed changes

extra_data/run_files_map.py Show resolved Hide resolved

takluyver merged commit 44fbe8b into master Jun 10, 2024
8 checks passed

takluyver deleted the fix/newest-json-cache branch June 10, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer newer JSON cache files to older #524

Prefer newer JSON cache files to older #524

takluyver commented Jun 7, 2024

philsmt commented Jun 10, 2024

takluyver commented Jun 10, 2024

Prefer newer JSON cache files to older #524

Prefer newer JSON cache files to older #524

Conversation

takluyver commented Jun 7, 2024

philsmt commented Jun 10, 2024

takluyver commented Jun 10, 2024