-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datarate dependent compressor #358
Changes from 6 commits
919e541
a033c1b
62309d4
9eb0466
96364e9
0b50666
b9c6220
ee5277a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,7 +73,7 @@ what bootstrax is thinking of at the moment. | |
- **disk_used**: used part of the disk whereto this bootstrax instance | ||
is writing to (in percent). | ||
""" | ||
__version__ = '1.0.3' | ||
__version__ = '1.0.4' | ||
|
||
import argparse | ||
from datetime import datetime, timedelta, timezone | ||
|
@@ -161,8 +161,8 @@ print(f'---\n bootstrax version {__version__}\n---') | |
|
||
# The folder that can be used for testing bootstrax (i.e. non production | ||
# mode). It will be written to: | ||
test_data_folder = ('/nfs/scratch/bootstrax/' if | ||
os.path.exists('/nfs/scratch/bootstrax/') | ||
test_data_folder = ('/data/test_processed/' if | ||
os.path.exists('/data/test_processed/') | ||
else './bootstrax/') | ||
|
||
# Timeouts in seconds | ||
|
@@ -202,7 +202,9 @@ timeouts = { | |
# Bootstrax writes it's state to the daq-database. To have a backlog we store this | ||
# state using a TTL collection. To prevent too many entries in this backlog, only | ||
# create new entries if the previous entry is at least this old (in seconds). | ||
'min_status_interval': 60 | ||
'min_status_interval': 60, | ||
# Minimum time we can take to can infer the datarate (s). | ||
'max_data_rate_infer_time': 30, | ||
} | ||
|
||
# The disk that the eb is writing to may fill up at some point. The data should | ||
|
@@ -709,19 +711,24 @@ def infer_mode(rd): | |
uncompressed redax rate. Estimating save parameters for running | ||
bootstrax from: | ||
https://xe1t-wiki.lngs.infn.it/doku.php?id=xenon:xenonnt:dsg:daq:eb_speed_tests_2021update | ||
:returns: dictionary of how many cores and max_messages should be used based on an | ||
estimated data rate. | ||
:returns: dictionary of how many cores, max_messages and compressor | ||
should be used based on an estimated data rate. | ||
""" | ||
# Get data rate from dispatcher | ||
try: | ||
docs = ag_stat_coll.aggregate([ | ||
{'$match': {'number': rd['number']}}, | ||
{'$group': {'_id': '$detector', 'rate': {'$max': '$rate'}}} | ||
]) | ||
data_rate = int(sum([d['rate'] for d in docs])) | ||
data_rate = None | ||
started_looking = time.time() | ||
while data_rate is None: | ||
docs = ag_stat_coll.aggregate([ | ||
{'$match': {'number': rd['number']}}, | ||
{'$group': {'_id': '$detector', 'rate': {'$max': '$rate'}}} | ||
]) | ||
data_rate = int(sum([d['rate'] for d in docs])) | ||
if time.time() - started_looking > timeouts['max_data_rate_infer_time']: | ||
raise RuntimeError | ||
except Exception as e: | ||
log_warning(f'infer_mode ran into {e}. Cannot infer mode, using default mode.', | ||
run_id=f'{rd["number"]:06}', priority='info') | ||
log_warning(f'infer_mode ran into {e}. Cannot infer datarate, using default mode.', | ||
run_id=f'{rd["number"]:06}', priority='warning') | ||
data_rate = None | ||
|
||
# Find out if eb is new (eb3-eb5): | ||
|
@@ -754,16 +761,48 @@ def infer_mode(rd): | |
if n_fails: | ||
# Exponentially lower resources & increase timeout | ||
result = dict( | ||
cores=np.clip(result['cores']/(1.1**n_fails), 4, 40), | ||
max_messages=np.clip(result['max_messages']/(1.1**n_fails), 4, 100), | ||
timeout=np.clip(result['timeout']*(1.1**n_fails), 500, 3600) | ||
cores=np.clip(result['cores']/(1.1**n_fails), 4, 40).astype(int), | ||
max_messages=np.clip(result['max_messages']/(1.1**n_fails), 4, 100).astype(int), | ||
timeout=np.clip(result['timeout']*(1.1**n_fails), 500, 3600).astype(int), | ||
) | ||
log_warning(f'infer_mode::\tRepeated failures on {rd["number"]}@{hostname}. ' | ||
f'Lowering to {result}', | ||
priority='info', | ||
run_id=f'{rd["number"]:06}') | ||
else: | ||
result = {k: int(v) for k, v in result.items()} | ||
result['records_compressor'] = infer_records_compressor(rd, data_rate, n_fails) | ||
log.info(f'infer_mode::\tInferred mode for {rd["number"]}\t{result}') | ||
return {k: int(v) for k, v in result.items()} | ||
return result | ||
|
||
|
||
def infer_records_compressor(rd, datarate, n_fails): | ||
""" | ||
Get a compressor for the (raw)records. This takes two things in consideration: | ||
1. Do we store the data fast enough (high write speed) | ||
2. Does the data fit into the buffer | ||
|
||
Used compressors: | ||
bz2: slow but very good compression -> use for low datarate | ||
zstd: fast & decent compression, max chunk size of ??? GB | ||
lz4: fast & not no chunk size limit, use if all ese fails | ||
""" | ||
if n_fails or datarate is None: | ||
# Cannot infer datarate or failed before, go for fast & safe | ||
return 'lz4' if n_fails > 1 else 'zstd' | ||
|
||
chunk_length = (rd['daq_config']['strax_chunk_overlap'] + | ||
rd['daq_config']['strax_chunk_length']) | ||
chunk_size_mb = datarate*chunk_length | ||
if datarate < 50: | ||
# Low datarate, we can do very large compression | ||
return 'bz2' | ||
if chunk_size_mb > 1000: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a very conservative value. We should be able to go to 1.8G and still have 15% overhead between us and the 31-bit issue. Given that zstd is squeezier (and has higher throughput), I think we should try to use that as much as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair, you are right: import strax
a = np.zeros(int(2e8), dtype=np.int64)
print(f'Buffer of {a.nbytes/(1e9)} GB')
strax.save_file('test.test', a, compressor='zstd') However, we need to keep in mind that we don't want to be running into issues where just one chunk is more chunky than the others, thereby disallowing us to save the file. Nevertheless, I agree, let's set it to 1.8 GB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Solved in |
||
# Extremely large chunks, let's use LZ4 because we know that it | ||
# can handle this. | ||
return 'lz4' | ||
# High datarate and reasonable chunk size. | ||
return 'zstd' | ||
|
||
|
||
## | ||
|
@@ -1169,7 +1208,7 @@ def manual_fail(*, mongo_id=None, number=None, reason=''): | |
def run_strax(run_id, input_dir, targets, readout_threads, compressor, | ||
run_start_time, samples_per_record, cores, max_messages, timeout, | ||
daq_chunk_duration, daq_overlap_chunk_duration, post_processing, | ||
debug=False): | ||
records_compressor, debug=False): | ||
# Check mongo connection | ||
ping_dbs() | ||
# Clear the swap memory used by npshmmex | ||
|
@@ -1193,6 +1232,10 @@ def run_strax(run_id, input_dir, targets, readout_threads, compressor, | |
timeout=timeout, | ||
targets=targets) | ||
|
||
for t in ('raw_records', 'records'): | ||
# Set the (raw)records processor to the inferred one | ||
st._plugin_class_registry[t].compressor = records_compressor | ||
|
||
# Make a function for running strax, call the function to process the run | ||
# This way, it can also be run inside a wrapper to profile strax | ||
def st_make(): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrylmasson I think this is why we might have seen more than usual failures at the eb lately. Before this nice aggregation we had a check to see if the data_rate actually returned something. Now we should again be in the good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately, if we make sure the run hasn't started within the last 10 or 15 seconds, we can be sure that the dispatcher has been through a few update cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is what I had first but then decided against it because it would lead to unnecessary waiting time: 96364e9
This was also because I set the time to 1 minute rather than 10 s :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved in
b9c6220