Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renaming path before savers finished #373

Closed
JoranAngevaare opened this issue Jan 2, 2021 · 3 comments
Closed

Renaming path before savers finished #373

JoranAngevaare opened this issue Jan 2, 2021 · 3 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@JoranAngevaare
Copy link
Contributor

JoranAngevaare commented Jan 2, 2021

On run 011534 (eb4) we had renamed the directory before the savers finished. If these kind of errors persist we might have to rethink some logic.

This is most likely relates to bad performance of the disk being used.

The traceback is:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/xedaq/miniconda/envs/py38/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/xedaq/software/npshmex/npshmex.py", line 193, in shm_wrap_f
    result = f(*args, **kwargs)
  File "/home/xedaq/software/strax/strax/plugin.py", line 950, in do_compute
    s.save(chunk=results[d], chunk_i=chunk_i)
  File "/home/xedaq/software/strax/strax/storage/common.py", line 614, in save
    bonus_info, future = self._save_chunk(
  File "/home/xedaq/software/strax/strax/storage/files.py", line 297, in _save_chunk
    filesize = strax.save_file(fn, **kwargs)
  File "/home/xedaq/software/strax/strax/io.py", line 77, in save_file
    with open(temp_fn, mode='wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/xenonnt_processed/011534-raw_records_he-rfzvpzj4mf_temp/raw_records_he-rfzvpzj4mf-000595_temp'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/xedaq/software/straxen/bin/bootstrax", line 1209, in run_strax
    st_make()
  File "/home/xedaq/software/straxen/bin/bootstrax", line 1187, in st_make
    st.make(run_id, 'multiple',
  File "/home/xedaq/software/strax/strax/context.py", line 1099, in make
    for _ in self.get_iter(run_ids[0], targets,
  File "/home/xedaq/software/strax/strax/context.py", line 1008, in get_iter
    generator.throw(e)
  File "/home/xedaq/software/strax/strax/context.py", line 969, in get_iter
    for n_chunks, result in enumerate(strax.continuity_check(generator), 1):
  File "/home/xedaq/software/strax/strax/chunk.py", line 266, in continuity_check
    for s in chunk_iter:
  File "/home/xedaq/software/strax/strax/processor.py", line 270, in iter
    raise exc.with_traceback(traceback)
  File "/home/xedaq/software/strax/strax/mailbox.py", line 486, in divide_outputs
    result = next(source)
  File "/home/xedaq/software/strax/strax/mailbox.py", line 401, in _read
    res = msg.result(timeout=self.timeout)
  File "/home/xedaq/miniconda/envs/py38/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/xedaq/miniconda/envs/py38/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '/data/xenonnt_processed/011534-raw_records_he-rfzvpzj4mf_temp/raw_records_he-rfzvpzj4mf-000595_temp'
@JelleAalbers
Copy link
Member

JelleAalbers commented Jan 2, 2021

I remember these annoying race conditions, I thought/hoped we had eliminated them...

Is there is an error like this somewhere in the traceback? Then perhaps some saver took more than 300 seconds to save a chunk because your disk is overloaded. If not, something is wrong with how futures are being passed around / waited upon. (Still it would be odd even if it crashed with that error -- how did it rename the folder then? Were multiple processes trying to close one saver?)

@JoranAngevaare
Copy link
Contributor Author

JoranAngevaare commented Jan 3, 2021

Thanks Jelle, we are seeing the linked traceback every now and then but it wasn't part of the traceback that got logged (can be that it was part of the traceback but wasn't written to the log). I will keep an eye of to see if we get more of this specific issue kind of messages, it wasn't reproducible so we will simply have to keep an eye out for these kind of errors.

Were multiple processes trying to close one saver?

Not as far as I know. There was nothing other than bootstrax running.

@JoranAngevaare JoranAngevaare added the bug Something isn't working label Feb 6, 2021
@JoranAngevaare JoranAngevaare added the wontfix This will not be worked on label Apr 10, 2021
@JoranAngevaare
Copy link
Contributor Author

Seems mostly solved in #394 and related to bad disk performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants