Reduce memory consumption during SEG-Y export #34

tasansal · 2022-09-01T20:47:11Z

The distributed workers flatten the chunks along the first dimension to write to SEG-Y.

Huge files >2TB use a lot of memory during export.

The output sharding strategy needs to be optimized:

Lines 177 to 241 in 03b9e4f

    
           # We must unify chunks with "trc_chunks" here because 
        
           # headers and live mask may have different chunking. 
        
           # We don't take the time axis for headers / live 
        
           # Still lazy computation 
        
           traces_seq = traces.rechunk(seq_trc_chunks) 
        
           headers_seq = headers.rechunk(seq_trc_chunks[:-1]) 
        
           live_seq = live_mask.rechunk(seq_trc_chunks[:-1]) 
        
           # Build a Dask graph to do the computation 
        
           # Name of task. Using uuid1 is important because 
        
           # we could potentially generate these from different machines 
        
           task_name = "block-to-sgy-part-" + str(uuid.uuid1()) 
        
           trace_keys = flatten(traces_seq.__dask_keys__()) 
        
           header_keys = flatten(headers_seq.__dask_keys__()) 
        
           live_keys = flatten(live_seq.__dask_keys__()) 
        
           all_keys = zip(trace_keys, header_keys, live_keys) 
        
           # tmp file root 
        
           out_dir = path.dirname(output_segy_path) 
        
           task_graph_dict = {} 
        
           block_file_paths = [] 
        
           for idx, (trace_key, header_key, live_key) in enumerate(all_keys): 
        
               block_file_name = f".{idx}_{uuid.uuid1()}._segyblock" 
        
               block_file_path = path.join(out_dir, block_file_name) 
        
               block_file_paths.append(block_file_path) 
        
               block_args = ( 
        
                   block_file_path, 
        
                   trace_key, 
        
                   header_key, 
        
                   live_key, 
        
                   num_samp, 
        
                   sample_format, 
        
                   endian, 
        
               ) 
        
               task_graph_dict[(task_name, idx)] = (write_block_to_segy,) + block_args 
        
           # Make actual graph 
        
           task_graph = HighLevelGraph.from_collections( 
        
               task_name, 
        
               task_graph_dict, 
        
               dependencies=[traces_seq, headers_seq, live_seq], 
        
           ) 
        
           # Note this doesn't work with distributed. 
        
           tqdm_kw = dict(unit="block", dynamic_ncols=True) 
        
           block_progress = TqdmCallback(desc="Step 1 / 2 Writing Blocks", **tqdm_kw) 
        
           with block_progress: 
        
               block_exists = compute_as_if_collection( 
        
                   cls=Array, 
        
                   dsk=task_graph, 
        
                   keys=list(task_graph_dict), 
        
                   scheduler=client, 
        
               ) 
        
           merge_args = [output_segy_path, block_file_paths, block_exists] 
        
           if client is not None: 
        
               _ = client.submit(merge_partial_segy, *merge_args).result() 
        
           else: 
        
               merge_partial_segy(*merge_args)

and

mdio-python/src/mdio/segy/creation.py

Line 111 in 03b9e4f

def merge_partial_segy(output_segy_path, block_file_paths, block_exists):

tasansal · 2022-10-07T15:34:43Z

ref Dask Community Post

tasansal · 2022-11-03T21:21:30Z

This has significant improvements to memory usage:

dask/distributed#7128

import dask
import distributed

with dask.config.set({"distributed.scheduler.worker-saturation": "1.0"}):
    client = distributed.Client(...)

tasansal added the performance Performance label Sep 1, 2022

tasansal mentioned this issue Oct 11, 2022

Refactor SEG-Y export to optimize memory usage. #109

Merged

tasansal closed this as completed in #109 Nov 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory consumption during SEG-Y export #34

Reduce memory consumption during SEG-Y export #34

tasansal commented Sep 1, 2022

tasansal commented Oct 7, 2022

tasansal commented Nov 3, 2022

Reduce memory consumption during SEG-Y export #34

Reduce memory consumption during SEG-Y export #34

Comments

tasansal commented Sep 1, 2022

tasansal commented Oct 7, 2022

tasansal commented Nov 3, 2022