Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lofar performance improvements #218

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sstansill
Copy link
Collaborator

@sstansill sstansill commented Aug 5, 2024

Adds a new method read_col_conversion_dask that allows larger than memory columns to be converted. Various changes:

  1. xarray DataSet encoding has been cleaned up and adjusted to ignore DataArrays that are dask arrays
  2. lofar and lofar_read_size arguments added to convert_msv2_to_processing_set
  3. TableManager class has been added so that multi-thread/process conversion can happen without having to serialize casacore table objects. This replaces open_table_ro and open_query in convert_and_write_partition
  4. read_col_conversion_dask uses dask's map_blocks to create tasks for each chunk of a DataArray which reads data from a MSv2 column and reshapes it

This has been used to convert 9TB of lofar data in ~4.5 hours which was previously impossible unless a compute node with >9TB of memory is used

The else clause is never reached, it is handled in line 2
Ensure chunk encoding isn't specified when the DataArray is a dask array
Create TableManager class so casacore tables are opened on each thread/process as needed, avoids serialization issues
Adds read_col_conversion_dask which creates a lazy representation of DataArray. Each lazy chunk contains a column read and a reshaping step. The input argument lofar_read_size is used to control both the size of contiguous reads from the measurement set and the size of chunks on the zarr store
@sstansill sstansill requested a review from Jan-Willem August 5, 2024 14:23
@sstansill sstansill added enhancement New feature or request optimisation The computation time has been decreased labels Aug 5, 2024
@sstansill sstansill marked this pull request as ready for review August 5, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request optimisation The computation time has been decreased
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants