Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer newer JSON cache files to older #524

Merged
merged 1 commit into from
Jun 10, 2024
Merged

Conversation

takluyver
Copy link
Member

The 'run map' JSON cache is used to speed up opening a run when possible, by listing the sources & trains to be found in each file. There are a couple of possible locations for this: it can be inside the run folder, where only privileged processes can write, or in a hidden world-writable folder under proposal scratch. In master, a cache in the run folder is preferred both for reading & writing.

In some cases, the calibration pipeline creates a partial cache file in the run folder, by opening a subset of files with the include= parameter. When someone later opens the whole run folder, a complete cache file is written in scratch, but then this is never used, because the one in the run folder takes priority.

This change looks for the newest available run map file when reading.


Testing: I was pointed to p5733 r120 as an example. Using the script below, I get a minimum of about 0.4 s to open the run on master, and around 0.1 s on this branch.

Using strace, I can also see that it opens 320 .h5 files on master, and 0 on this branch, because the information is all cached.

import time

from extra_data import open_run

t0 = time.perf_counter()
run = open_run(5733, 120, data='proc')
t1 = time.perf_counter()

print(f"{len(run.train_ids)} trains")
print(f"{t1 - t0:.3f} s to open")

@takluyver takluyver added the bug Something isn't working label Jun 7, 2024
@takluyver takluyver added this to the 1.18 milestone Jun 7, 2024
try:
st = os.stat(path)
paths_mtimes.append((path, st.st_mtime))
except (FileNotFoundError, PermissionError):

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
@philsmt
Copy link
Contributor

philsmt commented Jun 10, 2024

I can imagine this was a tricky one. LGTM for the implementation apart from aesthetic question.

Should be avoid creating the partial cache in calibration, since it hardly serves a purpose?

@takluyver
Copy link
Member Author

Should be avoid creating the partial cache in calibration, since it hardly serves a purpose?

I think it still speeds up the first full open, because it should use the cache for the files in it. But it would arguably be tidier not to create an incomplete cache file.

@takluyver takluyver merged commit 44fbe8b into master Jun 10, 2024
8 checks passed
@takluyver takluyver deleted the fix/newest-json-cache branch June 10, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants