Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging current refactored EchoPro code for software renaming #199

Merged
merged 56 commits into from
Mar 6, 2024

Conversation

brandynlucca
Copy link
Collaborator

No description provided.

brandynlucca and others added 30 commits January 5, 2024 13:14
Updated `core.py` to contain the expected nested dictionary data structures of
the imported data attributes.

The initialization of `Survey` includes importing
the configuration files (`initialization_config.yml` and `survey_year_2019_config.yml`)
and all associated data. This includes a utility function `populate_tree` that
maps out the current data attribute dictionary paths (e.g. where all of the
biological data is stored within the `Survey` class object). This is called by
the user via `Survey.summary`.

Minor adjustments were made to the configuration files.

Some support scripts have been added within `data_structure_utils.py` that
aid in pushing/pulling files from nested dictionaries.
- Reduced the number of recursive functions
-- Maintained the `populate_tree()` function solely for debugging purposes. Its functionality is completely independent of the rest of the module in its current state.
--- This concerns: `PARASITE_TREE`, `pushed_nested_dict`, `add_trees`, `populate_tree`

- Factored out the column validation
-- Now located at `EchoPro.utils.data_file_validation`
--- This concerns: `validate_data_columns`

- Factored out data type validation and import
--- This concerns: `read_validated_data`
--- This can probably be further consolidated with a little more hard-coding given that it started from a recursive-heavy state.
* `core.py`
--- Amended `LAYER_NAME_MAP` API to include a template with expected data attribute dictionary paths

* `survey.py`
--- Reframed the `load_survey_day` loop to iterate by expected data attribute rather than encountered file name
--- Updated arguments for `read_validated_data()`
--- Adjustments made for filepath handling with `load_configuration()`

* `data_file_validation`
--- Reworked argument inputs into `validate_data_columns()` to match the updated arguments for `read_validated_data()`
…oader_changes

Update core.py and begin refactoring data loader
* Built up the `strata_mean_sigma_bs` and `impute_missing_sigma_bs` functions
    - `self.strata_mean_sigma_bs`
        --- This adds the `sigma_bs` dictionary to `self.acoustics`
        --- Within this dictionary, three sets of values are  stored:
            1) `length_binned`: `sigma_bs` values from `specimen_df` and `length_df`
            2) `haul_mean`: mean `sigma_bs` across each region-specific haul ID
            3) `strata_mean`: mean `sigma_bs` across each stratum layer
    - `self.impute_missing_sigma_bs`
        --- This imputes the mean `sigma_bs` from the closest strata values in cases where strata are missing
The length and age bin parameters were originally moved within the
config attribute via `biometric_distributions`. This has now been
moved to `load_configuration`.
Amended typos, grammar, etc., in the function doc strings and in-line comments
Amended functions like `discretize_variable` to be simpler and directly describe
the actual outputs (or intended outputs) since thee functions themselves serve
very specific tasks. For instance, `quantize_variable` in fact describes what the
function broadly does, but the actual functionality is very narrow in scope. This also
aligns with the argument names.
There was an issue with how `sigma_bs_impute` was being constructed when
concatenating the original `strata_mean` dataframe with a newly generated
dataframe containing the missing strata with `np.nan` as place holders for the
missing `mean_sigma_bs` values.
Reformatted the "noun_modifier" formats for variable/column naming in
generated dataframes for consistency with the rest of the module.
…a-WIP-refactor-compute_transect_results

Brandynlucca-WIP-refactor-compute_transect_results
There was an issue where I had tested the code using an already defined
global version of a certain variable and function that allowed the code to run.
However, when running in a clean instance the code does not work (as expected).
This commit has amended that issue, specifically for `strata_age_binned_weight_proportions`.
…a-WIP-age_weighting

Refactor apportioning of weights/counts to age, sex, and intersecting age-sex bins
Incorporated the EPSG datum into initialization_config that is used for
defining the projection and other spatial features for georeferenced
NASC measurements.
* add test_data folder, pytest skip all existing tests

* add skeleton test_data_loader

* rename test to test_load_configuration

* add test_data/temp to .gitignore

* fix potential problem with test_data/temp not existing

* use pytest.tmp_path for temp re-written config_init

* note: test_data/input_files does not exist yet
* Create pr.yaml for running tests on PR

* update requirements to see how pip does

* remove nb_conda_kernels from requirements

* add scipy
* move all test_*.py out from subfolders

* rename old test modules with _OLD
A new function (`stretch`) has been added to `operations.py`
to reduce the amount of cluttered and repetitive code contained
within the `nasc_to_biomass_conversion` function. I expect this
function to be re-used elsewhere, as well. The `stretch` function
leverages the built-in `pandas.wide_to_long` 'gather/melt' method
that ultimately re-indexes the data by consolidating the separate
data columns (e.g `rho_a` for `male`, `female`, `unsexed`, and `total`)
into a single index (e.g. `sex`) and data (e.g. `rho_a`) column. This can
help provide a more intuitive way of filtering out specific groups/contrasts
in downstream functions and methods.
The previous commit/push missed the doc string defined for the
`stretch` function.
An additional utility function `group_merge` has been
added to reduce the amount of repetition in cases where
multiple dataframes are being merged in the same step/pipeline/chain.
This doesn't change the previous output/result of the code, but it is
expected to be used for later calculations/steps that will enable
more consistent formatting and ensuring that the grouped merges are
being performed in the same way every time. This is particularly
important so the 'how' and 'on' arguments are appropriately applied
and are less vulnerable to errant typos.
The `load_configuration` function was previously included as a
static method within `Survey`; however, this isn't necessary since
`load_configuration` never uses `self` as an argument. Consequently,
it has been moved to `EchoPro.utils.data_file_validation`.
Various changes were made to enable the INPFC strata
from the `INPFC` sheet to be validated (alongside `stratification1`), read,
and incorporated into the `Survey` object. This replaces the previous
hard-coded `pandas.DataFrame` that was generated in the `stratified_summary`
method. In `survry_year_2019_config.yml`, this is represented by
`sheetname: [ INPFC , stratification1 ]` associated with the
`geo_strata` configuration setting. So now the data validating and reading
functions can handle multiple `.xlsx` sheetnames from the same file.
As mentioned in Issue OSOceanAcoustics#177 that changes the location of
`load_configuration` within `EchoPro`. When ran locally,
the test passes. This commit also pushes changes to included
test-related files that worked from this branch.
The line `from functools import reduce` was missing from
`operations.py` to enable the `reduce(...)` function used
within `group_merge(...)`.
brandynlucca and others added 26 commits February 9, 2024 09:49
Renamed `calculate_bounds` to `calculate_start_end_coordinates` to reflect
that the function is not drawing a true geospatial boundary box/rectangle
around the transect coordinates.
Renamed dataframe the column with strata numbers within
 `self.biology[ 'weight' ][ 'weight_strata_df' ]` from `stratum` to
 `stratum_num`.
Code within the `nasc_to_biomass_conversion(...)` function were
refactored to create `index_sex_weight_proportions(...)` and
`index_transect_age_sex_proportions(...)`. These functions will
yield the following variables: `sex_indexed_weight_proportions` and
`nasc_fraction_total_df`.
Added preliminary doc strings to `index_sex_weight_proportions` and
`index_transect_age_sex_proportions`. Small edits were also made to the
corresponding `nasc_to_biomass_conversion(...)` code and imported
modules in `biology.py`.
Missing modules located in `EchoPro.computation.biology` were
appropriately added into `survey.py`.
Amended the doc string associated with `calculate_start_end_coordinates`
…a-nasc_to_biomass_plus_jolly_hampton

Add population-level calculations and stratified statistics
Added semivariogram functions defined in original Matlab committed
(cases 1 through 13 ). These have not been fully fleshed out and have
not yet been fully documented. WIP.
- Added folder for doc images
- Added placeholder markdown files for documentation discussing theoretical
information and mathematical equations
- Added example image to `core_data_structure` which was renamed
- `glossary.md` was added to contain a list of symbols and variable names
that can be found throughout the workflow of the software both programmatically
and in the mathematical equations
@leewujung
Copy link
Member

Thanks @brandynlucca -- awesome work. Super excited to get this merged!

@leewujung leewujung merged commit f18321c into OSOceanAcoustics:main Mar 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants