Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.8.0 bug fixes and doc string fixes #327

Merged
merged 21 commits into from
Jun 18, 2024

Conversation

akeeste
Copy link
Contributor

@akeeste akeeste commented May 14, 2024

This PR addresses several bug and doc string fixes in v0.8.0. I think it's critical that we make these fixes before our workshop next Thursday.

I tested all of our Python examples on a clean install of MHKiT with Python 3.11. There are several examples with errors

  • ADCP_Delft3D_TRTS_example.ipynb - Scipy interpolation error
  • cdip_example.ipynb - Error indexing by YYYY-MM. Passes in 3.9 but not 3.11
  • directional_waves.ipynb - Error reading cached data. Passes in 3.9 but not 3.11
  • metocean_example.ipynb - Pandas date indexing error. Passes in 3.9 but not 3.11
  • PacWave_resource_characterization_example.ipynb - Folium map error
  • qc_example.ipynb - mhkit.utils.index_to_datetime is removed/renamed
  • tidal_example.ipynb - Error slicing by time. Passes in 3.9 but not 3.11
  • wave_example.ipynb - Warning about pd.csv_read infering a date format. Error defining the "mean" statistic
  • WPTO_hindcast_example.ipynb - error getting the variable significant_wave_height_0

Incorporates several minor documentation fixes:

  • updates links in NOAA functions
  • fixes formatting in the mooring example
  • mitigate warnings due to doc string issues, spacing, etc --> this will be addressed in future linting

@akeeste akeeste added the documentation Improvements or additions to documentation label May 14, 2024
@akeeste akeeste changed the base branch from master to develop May 14, 2024 21:27
@akeeste
Copy link
Contributor Author

akeeste commented May 15, 2024

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

@akeeste
Copy link
Contributor Author

akeeste commented May 15, 2024

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

Fortunately I found a work around that allows us to get our documentation updated quickly before OREC. So we don't need these changes pulled in immediately. But long term we should still look at how our development workflows on each repo tie into doc updates.

@akeeste akeeste changed the title Minors fixes for doc strings items v0.8.0 bug fixes and doc string fixes May 17, 2024
@akeeste akeeste added the bug Something isn't working label May 17, 2024
@akeeste
Copy link
Contributor Author

akeeste commented May 17, 2024

@ssolson I identified issues within several python examples and work on getting these fixed ASAP. I reworked this PR to focus on bugs, not just minor documentation items. I'd like to make this a bugfix release as soon as possible.

The notebook vs script runs may have inadvertently been an environment issue (3.9 vs 3.11). Checking on that now.

@akeeste
Copy link
Contributor Author

akeeste commented May 17, 2024

Update on WPTO_hindcast_example.ipynb:
In #220 there was a slight change to the variable naming: variable_{i} --> {variable_{gid}. Github was not comparing files well in that PR and we probably missed that change. Unless there's a good reason to keep gid appended onto the name, we should revert back to appending i. The old tests reflect that. Unknown why they passed in that PR but broke because of that renaming later on.

@akeeste
Copy link
Contributor Author

akeeste commented May 17, 2024

Update on qc_example.ipynb:
In #276 index_to_datetime was accidentally removed from mhkit.utils.time_utils. It has been re-added, fixing the qc example

@ssolson
Copy link
Contributor

ssolson commented May 20, 2024

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

The pypi.yml release currently works exactly the way you request it to work here.

  • on push we create a release to test.pypi.org
  • on release we create a release on pypi.org

@ssolson ssolson mentioned this pull request May 20, 2024
3 tasks
@ssolson
Copy link
Contributor

ssolson commented May 20, 2024

  • directional_waves.ipynb works fine for me in python 3.11

I have cleared cache and re run multiple times.

@ssolson
Copy link
Contributor

ssolson commented May 20, 2024

PacWave_resource_characterization_example.ipynb - Folium map error

This is not an MHKiT error

You just need to pip install folium

folium is only used to add context to the example and is not a requirement of MHKiT

@ssolson
Copy link
Contributor

ssolson commented May 20, 2024

PacWave_resource_characterization_example.ipynb - Folium map error

This is not an MHKiT error

You just need to pip install folium

folium is only used to add context to the example and is not a requirement of MHKiT

PacWave_resource_characterization_example.ipynb - Folium map error

This is not an MHKiT error

You just need to pip install folium

folium is only used to add context to the example and is not a requirement of MHKiT

Apologies and Never mind.

I see the issue is the basemap was showing up gray. I have now fixed this in akeeste#4

@ssolson
Copy link
Contributor

ssolson commented May 21, 2024

I take responsibility for not catching this but I did not realize that the wave module xarray conversion made the functions so slow they are now unusable.

E.g. in the Pacwave resource characterization this block now takes greater than 30 minutes... unless you convert Tp back to use pandas which will only take 9 minutes.

@akeeste I know the wave module is still in a bit of transition but do you expect the time gains to be significantly reduced when completed? I am questioning our approach to always convert pandas to xarray.

# Initialize empty lists to store the results from each year
Hm0_list = []
Te_list = []
J_list = []
Tp_list = []
Tz_list = []

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    data_raw = ndbc_data[year]
    year_data = data_raw[data_raw != 999.0].dropna()
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    J_list.append(resource.energy_flux(year_data.T, h=399.0))

    # Tp_list.append(resource.peak_period(year_data.T))
    fp = year_data.T.idxmax(axis=0)
    Tp = 1/fp
    Tp = pd.DataFrame(Tp, index=year_data.T.columns, columns=["Tp"])
    Tp_list.append(Tp)

    Tz_list.append(resource.average_zero_crossing_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
J = pd.concat(J_list, axis=0)
Tz = pd.concat(Tz_list, axis=0)
data = pd.concat([Hm0, Te, Tp, J, Tz], axis=1)

# Calculate wave steepness
data["Sm"] = data.Hm0 / (9.81 / (2 * np.pi) * data.Tz**2)

# Drop any NaNs created from the calculation of Hm0 or Te
data.dropna(inplace=True)
# Sort the DateTime index
data.sort_index(inplace=True)

@akeeste
Copy link
Contributor Author

akeeste commented May 21, 2024

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

The pypi.yml release currently works exactly the way you request it to work here.

  • on push we create a release to test.pypi.org
  • on release we create a release on pypi.org

@ssolson Thanks for clarifying. I missed that when skimming the pypi.yml workflow

* fix cdip

* D3D TRTS eg fix

* fix pylint error

* pylint

* fix too many branches
@akeeste
Copy link
Contributor Author

akeeste commented May 21, 2024

@ssolson thanks for addressing the cdip and D3D examples in akeeste#4.

I was also able to get the directional_waves example to run when the cached was cleared. Maybe there's an issue using pickle to read data into Python 3.11 when it was cached in 3.9? That's the situation I had locally. Not something we need to address, but something to be aware of.

On the note of computational speed--
In my experience the delay is not coming from performing calculations on xarray data vs on pandas data, but rather from the actual conversion between pandas<-->xarray. Some methods are faster than others. I noticed that using xr.Dataset(data) is significantly faster than ds = data.to_xarray(). From digging into it briefly, the to_xarray method calls a large number of checks on the data being converted from pandas, which becomes expensive when there are lots of variables and timesteps--as is the case when we pull wave spectra over long periods of time.

I implemented a few changes to our type handling to address this, but perhaps there's still an instance that uses to_xarray() on pandas data. Have you noticed a specific part of the PacWave example that is slowing down?

@ssolson
Copy link
Contributor

ssolson commented May 21, 2024

@ssolson thanks for addressing the cdip and D3D examples in akeeste#4.

I was also able to get the directional_waves example to run when the cached was cleared. Maybe there's an issue using pickle to read data into Python 3.11 when it was cached in 3.9? That's the situation I had locally. Not something we need to address, but something to be aware of.

On the note of computational speed-- In my experience the delay is not coming from performing calculations on xarray data vs on pandas data, but rather from the actual conversion between pandas<-->xarray. Some methods are faster than others. I noticed that using xr.Dataset(data) is significantly faster than ds = data.to_xarray(). From digging into it briefly, the to_xarray method calls a large number of checks on the data being converted from pandas, which becomes expensive when there are lots of variables and timesteps--as is the case when we pull wave spectra over long periods of time.

I implemented a few changes to our type handling to address this, but perhaps there's still an instance that uses to_xarray() on pandas data. Have you noticed a specific part of the PacWave example that is slowing down?

Apologies my previous message was not clear.

Yes there are issues with the computation (not just the conversion) on the converted xarrays. Specifically with the calculation of Tp.

This morning I created the following script and found Tp to be the primary bottle neck. I commented out the built in MHKiT Tp and used MHKiT v0.7.0 pandas version inline and found significant speed increases ~60% faster.

Time results using built in Tp

  • Average time for Hm0: 3.2555 seconds
  • Average time for Te: 4.0896 seconds
  • Average time for J: 3.3126 seconds
  • Average time for Tp: 17.2465 seconds
  • Average time for Tz: 4.1233 seconds
  • Average time to calculate metrics in a year: 32.0306 seconds
  • Total time for all metrics: 160.1376 seconds

Time results using previous pandas Tp

Look at the results for the Tp calculation vs the others:

  • Average time for Hm0: 3.2567 seconds
  • Average time for Te: 4.0718 seconds
  • Average time for J: 3.2766 seconds
  • Average time for Tp: 0.0022 seconds
  • Average time for Tz: 4.1557 seconds
  • Average time to calculate metrics in a year: 14.7664 seconds
  • Total time for all metrics: 73.8152 seconds
import mhkit
from mhkit.wave import resource, performance, graphics, contours
from mhkit.wave.io import ndbc
import matplotlib.pyplot as plt
import numpy as np
import time
import xarray as xr
import pandas as pd


# Get buoy metadata
buoy_number = "46050"
buoy_metadata = ndbc.get_buoy_metadata(buoy_number)

# Spectral wave density for buoy 46050
parameter = "swden"

# Request list of available files
ndbc_available_data = ndbc.available_data(parameter, buoy_number)

# Pass file names to NDBC and request the data
filenames = ndbc_available_data["filename"]
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)


# Initialize empty lists to store the results from each year
Hm0_list = []
Te_list = []
J_list = []
Tp_list = []
Tz_list = []

# Dictionaries to store the time taken by each function
timing_data = {
    "Hm0": [],
    "Te": [],
    "J": [],
    "Tp": [],
    "Tz": []
}

# List to store the total time for each year
yearly_times = []

for year in list(ndbc_data.keys())[:5]:
    year_data_xr = xr.Dataset(year_data.T)
    year_start_time = time.time()  # Start time for the year

    data_raw = ndbc_data[year]
    year_data = data_raw[data_raw != 999.0].dropna()
    
    start_time = time.time()
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    timing_data["Hm0"].append(time.time() - start_time)
    
    start_time = time.time()
    Te_list.append(resource.energy_period(year_data.T))
    timing_data["Te"].append(time.time() - start_time)
    
    start_time = time.time()
    J_list.append(resource.energy_flux(year_data.T, h=399.0))
    timing_data["J"].append(time.time() - start_time)

    start_time = time.time()
    # Tp_list.append(resource.peak_period(year_data.T))
    fp = year_data.T.idxmax(axis=0)
    Tp = 1/fp
    Tp = pd.DataFrame(Tp, index=year_data.T.columns, columns=["Tp"])
    timing_data["Tp"].append(time.time() - start_time)
    
    start_time = time.time()
    Tz_list.append(resource.average_zero_crossing_period(year_data.T))
    timing_data["Tz"].append(time.time() - start_time)
   
    yearly_times.append(time.time() - year_start_time) 

# Print the timing data for each function
total_time = 0
for func_name, times in timing_data.items():
    avg_time = np.mean(times)
    sum_time = np.sum(times)
    print(f"Average time for {func_name}: {avg_time:.4f} seconds")
    # print(f"Total time for {func_name}: {sum_time:.4f} seconds")
    total_time += sum_time



# Calculate and print the average time to calculate metrics in a year
avg_yearly_time = np.mean(yearly_times)
print(f"Average time to calculate metrics in a year: {avg_yearly_time:.4f} seconds")

# Print the total time for all metrics
print(f"Total time for all metrics: {total_time:.4f} seconds")

@akeeste
Copy link
Contributor Author

akeeste commented May 21, 2024

@ssolson Okay understood. Seems odd as the methodology is essentially the same now, but applied on an xarray dataset (and with type handling). I wonder if the delay is mostly in the type conversion or xarray's idxmax function.

How does the expense compare if the spectra is already an xarray Dataset (type handling eliminated) and we calculate Tp with:

   fp = S.idxmax(dim=frequency_dimension)  # Hz
   Tp = 1 / fp

@ssolson
Copy link
Contributor

ssolson commented May 21, 2024

@ssolson Okay understood. Seems odd as the methodology is essentially the same now, but applied on an xarray dataset (and with type handling). I wonder if the delay is mostly in the type conversion or xarray's idxmax function.

How does the expense compare if the spectra is already an xarray Dataset (type handling eliminated) and we calculate Tp with:

   fp = S.idxmax(dim=frequency_dimension)  # Hz
   Tp = 1 / fp

That is what I found. Finding the argmax is a very expensive operation on the xarray.

I spent some time trying to fix this but I realized I was getting outside the scope of this PR pretty quickly.

So I wanted to circle back with you and see if time gains from a full implementation of xarray are to be expected or if we should reconsider our approach?

@akeeste
Copy link
Contributor Author

akeeste commented May 22, 2024

I don't expect a significant improvement in computational expense in the future just by using xarray. Let's discuss approaches at our next monthly meeting.

@akeeste akeeste marked this pull request as ready for review May 22, 2024 14:52
@ssolson
Copy link
Contributor

ssolson commented May 22, 2024

@akeeste can you merge the fixes in akeeste#5?

* fix cdip

* D3D TRTS eg fix

* fix pylint error

* pylint

* fix too many branches

* fix map and Tp

* folium fix & Tp using pandas

* up to python 3.10

* py version as string

* fix variable check for new xarray

* undo changes
Comment on lines 322 to 324
LM = convert_to_dataset(LM).to_array()
JM = convert_to_dataset(JM).to_array()
frequency = convert_to_dataset(frequency).to_array()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is causing the tests to fail. I'm looking into it but if you are around @akeeste I know you are more familiar with these changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akeeste can you tell me the reasoning behind this change? Is it simply a performance thing to go Dataset to array instead just to array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easiest solution is to skip the conversion if it is already a DataArray. Otherwise we are going to need to pass the name around to fix this.

Do you have a preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssolson I fixed these few lines. To be honest, I can't recall at the moment why I made this change, but I reverted it and the power_performance_workflow tests are passing now.

@jmcvey3
Copy link
Contributor

jmcvey3 commented Jun 4, 2024

If I can jump in, a part of this problem might be because yall are using Datasets for single variables rather than DataArrays. This line fp = S.idxmax(dim=frequency_dimension) will run a lot faster as a DataArray using fp = S["Frequency"].where(S==S.max("Frequency"), drop=True).

If you want to keep the DataSets that line gets more complicated, but is still a touch faster: fp = S["0"]["Frequency"].where(S["0"]==S["0"].max("Frequency"), drop=True)

@ssolson
Copy link
Contributor

ssolson commented Jun 11, 2024

If I can jump in, a part of this problem might be because yall are using Datasets for single variables rather than DataArrays. This line fp = S.idxmax(dim=frequency_dimension) will run a lot faster as a DataArray using fp = S["Frequency"].where(S==S.max("Frequency"), drop=True).

If you want to keep the DataSets that line gets more complicated, but is still a touch faster: fp = S["0"]["Frequency"].where(S["0"]==S["0"].max("Frequency"), drop=True)

@jmcvey3 thank you for jumping in. I have not tried this yet but looks promising.

From our meeting yesterday @akeeste is going to address the outstanding issues to pass tests and get these bug fixes through.

Because the proposed solution changes the general approach we are going to address this in a follow on PR focused on the larger xarray strategy.

@akeeste
Copy link
Contributor Author

akeeste commented Jun 13, 2024

There's still a couple tests failing. I'm not sure why the mooring test is failing since I only updated the docstring (for better formatting in the built docs). I'll continue looking into this and narrow in on an issue in the metocean example.

@akeeste
Copy link
Contributor Author

akeeste commented Jun 13, 2024

Part of matplotlib, unrelated to this PR, that is called in the mooring animation is depreciated in 3.9+, causing the mooring_test to fail here. I'm guessing that if we re-run tests on dev, it would fail there too. @hivanov-nrel any insight on how we can fix this?

@hivanov-nrel
Copy link
Contributor

Hi Adam, it looks like the issue is on line 89 and line 129. The inputs needs to be a sequence so the fix for this is putting brackets around them like so:

line.set_data([x_data], [y_data])

Do you mind implementing this on your PR or should I do this fix separately?

@akeeste
Copy link
Contributor Author

akeeste commented Jun 13, 2024

Hi Adam, it looks like the issue is on line 89 and line 129. The inputs needs to be a sequence so the fix for this is putting brackets around them like so:

line.set_data([x_data], [y_data])

Do you mind implementing this on your PR or should I do this fix separately?

Thanks for taking a quick look @hivanov-nrel. That fixed mooring_test, I'll add it to this PR.

Copy link
Contributor

@ssolson ssolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adam this looks good to me. If this is ready to merge it would help me bc #330 depends on these changes and I am being forced to run hindcast tests on that PR currently.

@ssolson ssolson merged commit 37254f7 into MHKiT-Software:develop Jun 18, 2024
20 checks passed
@ssolson ssolson mentioned this pull request Jun 28, 2024
ssolson added a commit that referenced this pull request Jun 30, 2024
# MHKiT v0.8.1
MHKiT v0.8.1, includes bug fixes in the example notebooks and fixes the dependencies to requirements updates prior to Numpy 2.0.0.

Fixes MHKIT v0.8.0 installation issues (#334) by fixing the dependencies.
- #335

Fixes bugs in MHKiT example notebooks
- #327
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants