Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch/spectrum bin peaks #19

Merged
merged 5 commits into from
Sep 3, 2024
Merged

Conversation

singjc
Copy link
Collaborator

@singjc singjc commented Aug 23, 2024

PR Type

enhancement


Description

  • Added peak binning functionality to enhance spectrum plots, allowing for better visualization and analysis.
  • Introduced new parameters bin_peaks and num_x_bins to control the binning process.
  • Implemented a new method _bin_peaks to perform binning based on x-axis values and calculate mean intensity within each bin.
  • Updated the _prepare_data method to integrate the binning process when required.

Changes walkthrough 📝

Relevant files
Enhancement
_core.py
Implement peak binning for spectrum plots                               

pyopenms_viz/_core.py

  • Added peak binning functionality for spectrum plots.
  • Introduced new parameters bin_peaks and num_x_bins.
  • Implemented _bin_peaks method to handle peak binning.
  • Updated _prepare_data method to incorporate peak binning.
  • +52/-2   

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link
    Contributor

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Potential Performance Issue
    The _bin_peaks method creates new DataFrame objects and performs multiple operations, which might be inefficient for large datasets. Consider optimizing this method for better performance.

    Incomplete Condition
    The condition for binning peaks in the _prepare_data method is incomplete. It checks if self.bin_peaks is True or "auto", but doesn't specify when "auto" should trigger binning.

    Copy link
    Contributor

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add a check for positive number of bins before performing peak binning

    The condition for binning peaks is incomplete. It should also check if
    self.num_x_bins > 0 to avoid potential division by zero or other issues with
    non-positive bin numbers.

    pyopenms_viz/_core.py [732-736]

     # Bin peaks if required
    -if self.bin_peaks == True or (self.bin_peaks == "auto"
    -):
    +if self.bin_peaks == True or (self.bin_peaks == "auto" and self.num_x_bins > 0):
         spectrum = self._bin_peaks(spectrum, x, y)
         if reference_spectrum is not None:
             reference_spectrum = self._bin_peaks(reference_spectrum, x, y)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: This suggestion addresses a potential issue by ensuring that the number of bins is positive, preventing division by zero or other errors, which is crucial for the robustness of the code.

    9
    Enhancement
    Implement a more robust method for handling missing values in binned data

    The _bin_peaks method is filling NaN values with 0, which might not be appropriate
    for all cases. Consider allowing the user to specify how to handle NaN values, or
    use a more robust method like interpolation.

    pyopenms_viz/_core.py [704-706]

     data[x] = data[x].apply(lambda interval: interval.mid).astype(float)
    -data = data.fillna(0)
    +data = data.interpolate().fillna(method='bfill').fillna(method='ffill')
     return data
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: This suggestion improves the handling of NaN values by using interpolation and backfill/forward fill methods, which is a significant enhancement for data integrity.

    8
    Use quantile-based discretization for more balanced bin distribution

    The _bin_peaks method uses cut function which may create bins with no data. Consider
    using qcut instead, which creates bins with equal frequencies, potentially providing
    a more balanced representation of the data.

    pyopenms_viz/_core.py [689-703]

    -data[x] = cut(data[x], bins=self.num_x_bins)
    +data[x] = pd.qcut(data[x], q=self.num_x_bins, duplicates='drop')
     if self.by is not None:
         # Group by x bin and by column and calculate the mean intensity within each bin
         data = (
             data.groupby([x, self.by], observed=True)
             .agg({y: "mean"})
             .reset_index()
         )
     else:
         # Group by x bins and calculate the mean intensity within each bin
         data = (
             data.groupby([x], observed=True)
             .agg({y: "mean"})
             .reset_index()
         )
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using qcut for quantile-based discretization can provide a more balanced representation of data, enhancing the method's effectiveness, but it is not a critical change.

    7
    Best practice
    Use more descriptive variable names for better code readability

    Consider using a more descriptive variable name instead of x and y. For example, mz
    for mass-to-charge ratio and intensity for intensity values. This would make the
    code more self-explanatory and easier to understand.

    pyopenms_viz/_core.py [596-606]

    -def plot(self, x, y, **kwargs):
    +def plot(self, mz, intensity, **kwargs):
         """Standard spectrum plot with m/z on x-axis, intensity on y-axis and optional mirror spectrum."""
         
         # Prepare data
         spectrum, reference_spectrum = self._prepare_data(
    -        self.data, x, y, self.reference_spectrum
    +        self.data, mz, intensity, self.reference_spectrum
         )
         kwargs.pop("fig", None)  # remove figure from **kwargs if exists
     
    -    entries = {"m/z": x, "intensity": y}
    +    entries = {"m/z": mz, "intensity": intensity}
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: The suggestion to use more descriptive variable names like mz and intensity improves code readability and self-explanatory nature, but it is not crucial for functionality.

    6

    @singjc singjc requested a review from axelwalter August 23, 2024 05:02
    Copy link
    Collaborator

    @axelwalter axelwalter left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Nice addition, would be nice if the user could choose the binning method.

    @@ -558,6 +558,8 @@ def __init__(
    reference_spectrum: DataFrame | None = None,
    mirror_spectrum: bool = False,
    relative_intensity: bool = False,
    bin_peaks: Union[Literal["auto"], bool] = "auto",
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Perhaps the user could choose the method to calculate number of bins.
    Adding a bin_method parameter such as Literal["none", "sturges", "freedman diaconis"] ?

    "auto" was useful in the peak map binning, where the y dimension was considered as well.

    @singjc singjc merged commit ab2e684 into OpenMS:main Sep 3, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants