Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(convert): Full refactor for the usage of disk during memory expansion [all tests ci] #1185

Merged
merged 53 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
e2a0029
short-circuit _read_datagrams to save all power/angle data to parser.…
leewujung Sep 27, 2023
2d33b9c
add tests/convert/conftest.py with mock ping_data_dict
leewujung Sep 27, 2023
89dba5b
add examples to generate different types of ping_data_dict
leewujung Sep 27, 2023
f5039e8
update conftest mock ping data generation
leewujung Sep 27, 2023
50fb839
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 27, 2023
12528b5
add mock_ping_data_dict_complex
leewujung Sep 29, 2023
29d460b
refactor(convert): Initial full refactor of the swap feature
lsetiawan Oct 2, 2023
e8f1e50
refactor: More updates for parse base
lsetiawan Oct 2, 2023
3104d8d
docs: Add more comments to code
lsetiawan Oct 2, 2023
13bc47e
refactor: Change to direct array creation and clean up
lsetiawan Oct 2, 2023
061b8a9
refactor: Merge remote-tracking branch 'leewujung/p2z-mock-data' into…
lsetiawan Oct 2, 2023
f752f1c
fix: Fix bug in retrieving the expanded shapes
lsetiawan Oct 3, 2023
94db308
refactor: Bring back the echodata destructor
lsetiawan Oct 3, 2023
04a18c0
refactor: Utilize python built-in for temp files
lsetiawan Oct 4, 2023
608aa3b
fix: Fix getting OS temp directory
lsetiawan Oct 4, 2023
617f510
fix: Fix ping_time dimension size conflicts
lsetiawan Oct 9, 2023
86bcab3
refactor: Merge branch 'dev' into refac_swap
lsetiawan Oct 9, 2023
f64c014
fix: Fix type hint for ping_data_dict
lsetiawan Oct 10, 2023
7422dfe
test: Add test for ParseBase Class
lsetiawan Oct 10, 2023
6b603e9
refactor(testing): Move common testing utilities from tests
lsetiawan Oct 10, 2023
5dc7d20
chore(deps): Add pytest-mock dependency
lsetiawan Oct 11, 2023
f359a8c
refactor: Merge branch 'dev' into refac_swap
lsetiawan Oct 25, 2023
9ae7605
fix: Fix imaginary part values being 0 not NaN
lsetiawan Oct 26, 2023
9b7c818
test: Improve test for parser rectangularize data
lsetiawan Oct 26, 2023
4e8e16f
refactor: Remove _calc_max_dim_shape, not needed
lsetiawan Oct 27, 2023
cd53204
fix: Fix dest_storage_options failure and add test
lsetiawan Oct 27, 2023
65bc882
test: Add no zarr_root test for parse and pad
lsetiawan Oct 27, 2023
932ea7b
fix: Fix compute max shapes from tuples
lsetiawan Oct 27, 2023
deaac11
test: Add shape testing for swap and no swap
lsetiawan Oct 27, 2023
7a44dc1
test: Remove simple mock object
lsetiawan Oct 27, 2023
97dbf88
test: Add test_api for convert
lsetiawan Oct 27, 2023
5b9c8ba
test: Add more integration test files and remove P2Z test
lsetiawan Oct 27, 2023
0a8c6cd
refactor: Update to only use disk swap
lsetiawan Oct 30, 2023
5fe6866
refactor: Update api to reflect use_swap changes
lsetiawan Oct 30, 2023
9a566bf
test: Update test to use_swap
lsetiawan Oct 30, 2023
207ea8b
refactor: Set reshape to be more upstream to parser
lsetiawan Oct 31, 2023
4a9b477
fix: Fix bug in calc final shape
lsetiawan Oct 31, 2023
5c2ecd7
fix: Don't reshape if there's no sector
lsetiawan Oct 31, 2023
d74e095
fix: Fix zero division error since there's no t-sectors
lsetiawan Oct 31, 2023
3ea3698
refactor: Update t-sector check to see if channel is there
lsetiawan Oct 31, 2023
77dae50
test: Add jitter ping time xr.concat test
lsetiawan Oct 31, 2023
fd39a98
Merge branch 'dev' into refac_swap
lsetiawan Nov 6, 2023
6b0a4d6
refac: Default open_raw use swap to False
lsetiawan Nov 6, 2023
b478854
docs: Improve api docstring
lsetiawan Nov 6, 2023
36b1223
refactor: Change __should_use_swap behaviour
lsetiawan Nov 6, 2023
f914046
docs: Add comment about power and complex pre-proc
lsetiawan Nov 6, 2023
2d5f56f
chore: Clean up old unused function
lsetiawan Nov 6, 2023
2ebc868
refactor: Move backscatter_r assign to rest of var_dict
lsetiawan Nov 6, 2023
1c67698
refactor: Renamed duplicate test names
lsetiawan Nov 7, 2023
731f096
test: Update echopype/tests/convert/test_convert_api.py
lsetiawan Nov 7, 2023
0a0aa2f
docs: Add TODO doc to move reshaping
lsetiawan Nov 7, 2023
3b50f3d
docs: Add comment about getting the chunk sizes
lsetiawan Nov 7, 2023
bf895a1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 26 additions & 80 deletions echopype/convert/api.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
from pathlib import Path
from typing import TYPE_CHECKING, Dict, Optional, Tuple
from typing import TYPE_CHECKING, Dict, Literal, Optional, Tuple, Union

import fsspec
from datatree import DataTree

# fmt: off
# black and isort have conflicting ideas about how this should be formatted
from ..core import SONAR_MODELS
from .parsed_to_zarr import Parsed2Zarr

if TYPE_CHECKING:
from ..core import EngineHint, PathHint, SonarModelsHint
Expand All @@ -17,7 +16,6 @@
from ..utils.coding import COMPRESSION_SETTINGS
from ..utils.log import _init_logger
from ..utils.prov import add_processing_level
from .utils.ek import should_use_swap

BEAM_SUBGROUP_DEFAULT = "Beam_group1"

Expand Down Expand Up @@ -315,11 +313,9 @@ def open_raw(
xml_path: Optional["PathHint"] = None,
convert_params: Optional[Dict[str, str]] = None,
storage_options: Optional[Dict[str, str]] = None,
use_swap: bool = False,
destination_path: Optional[str] = None,
destination_storage_options: Optional[Dict[str, str]] = None,
max_mb: int = 100,
) -> Optional[EchoData]:
use_swap: Union[bool, Literal["auto"]] = False,
max_chunk_size: str = "100MB",
) -> EchoData:
"""Create an EchoData object containing parsed data from a single raw data file.

The EchoData object can be used for adding metadata and ancillary data
Expand Down Expand Up @@ -347,19 +343,11 @@ def open_raw(
and need to be added to the converted file
storage_options : dict, optional
options for cloud storage
use_swap: bool
**DEPRECATED: This flag is ignored**
If True, variables with a large memory footprint will be
written to a temporary zarr store at ``~/.echopype/temp_output/parsed2zarr_temp_files``
destination_path: str
The path to a swap directory in a case of a large memory footprint.
This path can be a remote path like s3://bucket/swap_dir.
By default, it will create a temporary zarr store at
``~/.echopype/temp_output/parsed2zarr_temp_files`` if needed,
when set to "auto".
destination_storage_options: dict, optional
Options for remote storage for the swap directory ``destination_path``
argument.
use_swap: bool or "auto", default False
Flag to use disk swap in case of a large memory footprint.
When set to ``True`` (or when set to "auto" and large memory footprint is needed,
this function will create a temporary zarr store at the operating system's
temporary directory.
max_mb : int
The maximum data chunk size in Megabytes (MB), when offloading
variables with a large memory footprint to a temporary zarr store
Expand All @@ -382,23 +370,18 @@ def open_raw(

Notes
-----
In a case of a large memory footprint, the program will determine if using
In case of a large memory footprint, the program will determine if using
a temporary swap space is needed. If so, it will use that space
during conversion to prevent out of memory errors.
Users can override this behaviour by either passing ``"swap"`` or ``"no_swap"``
into the ``destination_path`` argument.
This feature is only available for the following
echosounders: EK60, ES70, EK80, ES80, EA640. Additionally, this feature
is currently in beta.
"""

# Initially set use_swap False
use_swap = False

# Set initial destination_path of "no_swap"
if destination_path is None:
destination_path = "no_swap"
Users can override this behaviour by either passing
``use_swap=True`` or ``use_swap=False``. If a keyword "auto" is
used for the ``use_swap`` parameter, echopype will determine the usage of
swap space automatically.

This feature is only available for the following
echosounders: EK60, ES70, EK80, ES80, EA640.
"""
if raw_file is None:
raise FileNotFoundError("The path to the raw data file must be specified.")

Expand Down Expand Up @@ -435,59 +418,25 @@ def open_raw(
else:
params = "ALL" # reserved to control if only wants to parse a certain type of datagram

# obtain dict associated with directly writing to zarr
dgram_zarr_vars = SONAR_MODELS[sonar_model]["dgram_zarr_vars"]

# Parse raw file and organize data into groups
parser = SONAR_MODELS[sonar_model]["parser"](
file_chk,
params=params,
storage_options=storage_options,
dgram_zarr_vars=dgram_zarr_vars,
sonar_model=sonar_model,
)

# Actually parse the raw datagrams from source file
parser.parse_raw()

# Direct offload to zarr and rectangularization only available for some sonar models
# No rectangularization for other sonar models not listed below
if sonar_model in ["EK60", "ES70", "EK80", "ES80", "EA640"]:
swap_map = {
"swap": True,
"no_swap": False,
}
if destination_path == "auto":
# Overwrite use_swap if it's True below
# Use local swap directory
use_swap = should_use_swap(parser.zarr_datagrams, dgram_zarr_vars, mem_mult=0.4)
elif destination_path in swap_map:
use_swap = swap_map[destination_path]
else:
# TODO: Add docstring about swap path
use_swap = True
if "://" in destination_path and destination_storage_options is None:
raise ValueError(
(
"Please provide storage options for remote destination. ",
"If access is already configured locally, ",
"simply put an empty dictionary.",
)
)

if use_swap:
# Create sonar_model-specific p2z object
p2z = SONAR_MODELS[sonar_model]["parsed2zarr"](parser)
p2z.datagram_to_zarr(
dest_path=destination_path,
dest_storage_options=destination_storage_options,
max_mb=max_mb,
)
else:
p2z = Parsed2Zarr(parser) # Create general p2z object
parser.rectangularize_data()

else:
# No rectangularization for other sonar models
p2z = Parsed2Zarr(parser) # Create general p2z object
# Perform rectangularization and offload to zarr
# if the data expansion is too large to fit in memory
parser.rectangularize_data(
use_swap=use_swap,
max_chunk_size=max_chunk_size,
)

setgrouper = SONAR_MODELS[sonar_model]["set_groups"](
parser,
Expand All @@ -496,7 +445,6 @@ def open_raw(
output_path=None,
sonar_model=sonar_model,
params=_set_convert_params(convert_params),
parsed2zarr_obj=p2z,
)

# Setup tree dictionary
Expand Down Expand Up @@ -547,9 +495,7 @@ def open_raw(
# Create tree and echodata
# TODO: make the creation of tree dynamically generated from yaml
tree = DataTree.from_dict(tree_dict, name="root")
echodata = EchoData(
source_file=file_chk, xml_path=xml_chk, sonar_model=sonar_model, parsed2zarr_obj=p2z
)
echodata = EchoData(source_file=file_chk, xml_path=xml_chk, sonar_model=sonar_model)
echodata._set_tree(tree)
echodata._load_tree()

Expand Down
Loading