Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Api rework #34

Merged
merged 64 commits into from
Feb 28, 2024
Merged

Conversation

KronosTheLate
Copy link
Contributor

@KronosTheLate KronosTheLate commented Sep 22, 2023

This draft PR represents an effort to rework the API of Peaks.jl, based on the disuccion in #24. As of the time of writing, the PR is mainly a proof of concept and test of the user facing API. A lot of changes are still missing, which have been listed below:

ToDo

  • Agree on API
    • Which functions should be exported, and what should they be named? Is possible to maintain compatability with old API for a transition phase?
    • Do we need a non-mutating API? I do not see the use-case, and if needed users can call peakproms!(deepcopy(pks)).
    • Should Peaks.jl provide minima-finding functionality? I feel like it should be sufficient to advise users to reverse the sign of the values.
  • Create great docstrings for all exported functions
    • Signatures, explanitory text, examples and references to other relevant functions for all
  • Clean up and optimize internals
    • Review the efficiency of algorithms.
    • Make checks on inputs to provide more helpful errors for imagined common erroneous function calls.
  • Split new and old API into seperate dosctrings?
  • Make new API appear last in docstrings
  • Test that new intended workflows work and don't produce errors
  • Add/update tests
    • New tests are needed to confirm that the new intended workflows work and don't produce errors.
    • The filterpeaks functions will need correctness tests.
  • Make all docstrings use # Optional keyword arguments instead of [].
  • Add keywords min and max to old API docstring
  • Because hasfield(::NamedTuple) is not defined, we should change field to property or feature everywhere. We could also tak of keys and values
    • hasfield(::NamedTuple, ::Symbol) exists/works from Julia v1.6?
  • Make a documentation page (Documenter.jl is great), ensuring that all relevant functions are discoverable.
    • Do you mean a general/tutorial page? There is already a documenter setup?
    • I imagine that there will be some confusion on wether or not the functions mutate the input. It is important to make clear that the named tuples can not be mutated, but the vectors stored as the values will be mutated.
  • Update readme

Copy link
Owner

@halleysfifthinc halleysfifthinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this; I like the new piping workflow this will enable.

A couple overall comments before addressing the bullet points you raised:

  • Where possible, please use/replicate the existing language/terminology I've chosen in the rest of the package, e.g peaks for referring to the indices of peaks, minprom/maxprom (instead of just min/max, which is less descriptive and shadows the min/max functions), etc.
  • My goals for this package/API are to be maximally flexible in how it can be and is used; this directs my responses to your questions.
  • Which functions should be exported, and what should they be named? Is possible to maintain compatability with old API for a transition phase?

The existing API should continue to be exported, for backwards compatibility and flexibility. Not everyone likes piping syntax, and the piping API depends on the existing functions, so there is no point to not exporting them. Additionally, some of the exported functions are not being "replaced" by the new piping functions (e.g. findnextmaxima, ismaxima, etc). The new functions can be added as methods to the existing functions.

  • Do we need a non-mutating API?

In the context of piping, I agree it doesn't seem useful. But in keeping with the existing API, I think it makes sense to have copying/non-mutating options.

  • Should Peaks.jl provide minima-finding functionality?

Yes. Although reversing the sign is easy and fairly obvious, that adds a copy/allocation that can be avoided with minima focused functions.

Create great docstrings for all exported functions

Signatures, explanations and examples should be added to the existing docstrings (since they will be methods of the main function names)

For tests, I suggest chaining together existing tests/values.

src/api_rework.jl Outdated Show resolved Hide resolved
src/api_rework.jl Outdated Show resolved Hide resolved
src/api_rework.jl Outdated Show resolved Hide resolved
src/api_rework.jl Outdated Show resolved Hide resolved
@KronosTheLate
Copy link
Contributor Author

Thanks for the thorough feedback. It is nice to have my suggestions taken seriously ^_^

I have (quite) a few responses to your initial comment 😅. I don't know if is the best format, but I will reply to them one by one. My suspicion is that for some points we will simply disagree. I am in more of a "tear it down and build a new one" mindset when it comes to the API, whereas you want to avoid breakage. Your approach is more sensible, but I would also find it a shame if the package will permanently make some suboptimal API choice(s) to avoid breakage. Particularly now that we will make significant changes anyway.

But let's discuss particulars, and see if we can reach a conclusion on the way. For the points raised so far, I believe we can have our cake and eat it too, reaching both our goals.

Responses to concrete points

Where possible, please use/replicate the existing language/terminology I've chosen in the rest of the package, e.g peaks for referring to the indices of peaks, minprom/maxprom (instead of just min/max, which is less descriptive and shadows the min/max functions), etc.

I changed the terminology intentionally, as I feel like I made improvement. I thought this would have to be significantly breaking, but see now that a non-breaking rework implementing my changes may actually be possible. I will now argue for why the changes you exemplified with are valuable improvements IMO:

  • I feel like a peak is characterized by more than its location (mainly height, but also prominence, width etc). It is therefore not intuitive to me that "peaks" would mean the indices of the peak location, particularly now that we will have it as a field of a named tuple. Conceptually the named tuple represents the peaks, so to me, pks.peaks would read like "the peaks of the peaks". I feel like a field inds, belonging to an object containing peaks, is extremely explicit and therefore better. This is mainly a matter of internal variable names and documentation, and will only reach the public API as the names of the fields of the returned named tuple. It will therefore be non-breaking to change peaks to inds internally in the same release that adds named tuple methods.
  • minprom/maxprom is more verbose than min/max, and when we are talking about the keyword arguments for a function findproms!, it is also clear that it is redundant information. When it comes to the functions by the same name, I do not see the issue. Programmatically keywords do not shadow function names, and mentally the context (kwarg vs function name) distinguishes them sufficiently. Finally, keyword arguments min/max can be the same for all the functions that calculate and filter peak features, meaning that the API is more consistent and streamlined this way. This can be made non-breaking by making a signature peaksproms!(pks; minprom=nothing, maxprom=nothing, min=minprom, max=maxprom), and using min and max internally. This would allow a smooth transition phase, and heck, it would even be free to keep both versions around. The only downside is that it may be confusing to the user to have multiple kwarg for the same thing, but I feel like the gain is greater than that cost.

My goals for this package/API are to be maximally flexible in how it can be and is used; this directs my responses to your questions.

My goals with the rework is to make it more intuitive and user friendly. If we both manage to reach our goals, this will be the best peak finding package the world has seen 😄

The existing API should continue to be exported, for backwards compatibility and flexibility.

I feel like the new function names will need to be the same as the old, as those are often the best names. But perhaps you are saying that we can add methods that take a named tuple as the first argument, and "unpack" the named tuple so into the old method of the same function? So that what is added here is mainly new methods, which would make it non-breaking? That sounds great for compatibility, so perfect for a transition phase.

I do feel like documenting two very different API's extra work, and likely to be confusing to new users. It will also complicate/duplicate the internals slightly, increasing maintenance burden. Do we agree that whatever API we agree on, the old one will be deprecated in the long term?

Not everyone likes piping syntax,

My proposal is not about the piping syntax, that was more an afterthought. In the example code I posted in the original issue, the first examples show a non-piping syntax, which was the original idea. The main point of the API rework is working with named tuples instead of individual vectors.

Yes. Although reversing the sign is easy and fairly obvious, that adds a copy/allocation that can be avoided with minima focused functions.

I would easily pay the price (on behalf of the user) of an extra allocation, if the alternative is the all functionality needs to be duplicated for the minima focus functions. I just do not feel motivated to write that much internal code and bloat the codebase to avoid a single initial allocation for a small fraction of uses. Furthermore, I can hardly see the application where it is critical to avoid that allocation. With that said, if you are motivated to put in that work, the code bloat is likely not a large problem, so I would not have an issue with having the functionality.

@halleysfifthinc
Copy link
Owner

I feel like a peak is characterized by more than its location (mainly height, but also prominence, width etc) ... Conceptually the named tuple represents the peaks, so to me, pks.peaks would read like "the peaks of the peaks".

I agree, that was always a oversimplification that I compromised on; a location specific word is more correct. However, I do prefer indices over inds. (An extra couple characters is not the end of the world)

Programmatically keywords do not shadow function names

That is incorrect.

julia> f(x; min) = min(x, min)
f (generic function with 1 method)

julia> f(2; min=1)
ERROR: MethodError: objects of type Int64 are not callable
Maybe you forgot to use an operator such as *, ^, %, / etc. ?
Stacktrace:
 [1] f(x::Int64; min::Int64)
   @ Main .\REPL[1]:1
 [2] top-level scope
   @ REPL[4]:1

However, peakproms is the only function where the shadowing would be a problem, so we can prefix with Base and fix the issue.

This can be made non-breaking by making a signature peaksproms!(pks; minprom=nothing, maxprom=nothing, min=minprom, max=maxprom)

Yes please. Way back in the git history, you can find an example of how I deprecated a kwarg (strict used to have a different, more verbose, name iirc).

But perhaps you are saying that we can add methods that take a named tuple as the first argument, and "unpack" the named tuple so into the old method of the same function?

Yes exactly.

Furthermore, I can hardly see the application where it is critical to avoid that allocation.

Any time peak-finding is done in a (relatively) hot loop, an extra allocation is an avoidable performance cost. As the maintainer, I accept the cost of dealing with the additional code. 🤪

Do we agree that whatever API we agree on, the old one will be deprecated in the long term?

No. While they are redundant in some sense, there is no runtime cost to the additional methods, and I don't think that the additional code will be too much work to maintain. I also don't think users will be overly confused by the documentation of the NamedTuple methods, because it is common for Julia to have functions with multiple methods where each method has slightly different argument types.

@KronosTheLate
Copy link
Contributor Author

Conclusions:

  • The field that contains the indices will be named indices
  • We will make the keyword arguments min and max
    • We need to remember to prefix the base function with Base. to avoid internal shadowing
    • We will use a signature like minprom=nothing, maxprom=nothing, min=minprom, max=maxprom to allow the old keyword arguments to work seamlessly
  • The old functions should not be changed. The new NamedTuple based API should only add methods to the original functions (both mutating and non-mutating).
  • The old API will stay around indefinitly. Extra care will have to be taken to properly document both API's in an understandable manner for users, in particular the docstrings.
    • Do we agree that the new API is an improvement, and should be "fronted"? I.e. it should appear first in docstrings and the documentation.
    • Perhaps the best is to define the docstrings seperately for the different methods, and advice users to give argument type information to find the correct docstring (? peakproms!(::NamedTuple<tab>).
  • This realease should all-in-all contain major changes, but be non breaking

@halleysfifthinc
Copy link
Owner

I'm ambivalent about method docstring order; I'm not sure how configurable that is. I guess it is order dependent? You are welcome to convert the examples in the readme to the new API.

The next release will need to be breaking, as the return type of the old/current API functions will need to be changed to NamedTuples. In practice, this should be unnoticed for most cases, since I expect users are already using syntax that will destructure the return values. But, it is technically breaking.

@KronosTheLate
Copy link
Contributor Author

The next release will need to be breaking, as the return type of the old/current API functions will need to be changed to NamedTuples. In practice, this should be unnoticed for most cases, since I expect users are already using syntax that will destructure the return values. But, it is technically breaking.

Wait, so you imagine that the functions will always return named tuples? I imagine this will be massively breaking, as

  • The NamedTuple will always return the data first, not indices
  • The order of the named tuple fields/values changes based on the order it called in
  • The named tuples will have variable length, based on how many fields are added thus far.

In other words neither the order nor the length of the named tuple will match the old returned tuple. This will essentially break all code as far I as can tell. We could try to reorder the named tuple in every function to match the old API, but that still has the issue of having variable length (old tuple is same length always, new named tuple varies depending on previously added fields). If we are going to this level of breakage, then the ideal/value of "don't break things" goes completely out the window, and we might as well change anything we feel is better.

What I has in mind was that if you pass in distrinct vectors, a tuple is returned, maintaining the old API. If you pass in a named tuple, a named tuple is returned. This should allow both API's to stay around side by side.

@halleysfifthinc
Copy link
Owner

Hmmmm yeah it seems I hadn't fully thought that through. NamedTuples iterate/destructure by order, but they can destructure by name.

The order of the NamedTuple should be consistent, and except for the peak heights/values atm, as the PR stands now, it has a required order for proms and widths of (:proms, :widths, :edges), since the width calculation requires the prominences.

Here's what I suggest:

You currently have data as the first field, but I suggest we standardize on an order of (inds, vals, data, proms, widths, edges).

Then we can update findmaxima/findminima to return a NamedTuple (;inds, vals, data). That is non breaking since it keeps the current order and the extra data field will be ignored by any current syntax (either slurping then indexing, where index 3 doesn't currently exist so won't conflict, or by destructuring, where slurping is opt in, e.g. idxs, vals = findmaxima(x, w; strict) destructures by order and ignores everything after the first two indices). This also solves the entry point for the NamedTuple API when saving findpeaks for a separate PR.

The remaining old API will be left as-is.

src/api_rework.jl Outdated Show resolved Hide resolved
@KronosTheLate
Copy link
Contributor Author

KronosTheLate commented Sep 27, 2023

I was very surprised to see that any remaining elements are simply ignored. That solves the problem I thought we had, which was that the length of the returned (named) tuple would matter to the variables being assigned by destructuring.

But there is another issue. The current signatures are as follows:

peakproms -> (peaks, proms)
peakwidths -> (peaks, widths, leftedge, rightedge)
peakheights -> (peaks, heights)

If we settle on an established order of (indices, heights, data, proms, widths, edges), only peakheights still works (the other cases would, with the new "established order", return heights where proms or widths, leftedge, rightedge was expected).
To fix this, we would have to change the order of the named tuple as needed to match the old API, which just seems a bit like bending the new API out of shape to match the old API. By "bending out of shape" is refer to the constant reordering, and the following 2 choices I don't like that we would have to make:

  1. Put data on the third index, in between different peak features. It just feels off.
  2. Keep leftedge and rightedge separate. I liked the change of putting them both into edges

It just seems more simple to just make it so that if you pass separate vectors in, you get a tuple of vectors out. And if you pass a NamedTuple in, you get a named tuple out. In other words, let the old and new API live side by side without crossover. It seems minimally disruptive to code using the old API, while allowing the new API to be "done right", or at least without historic constraints. Would that be so bad?

@KronosTheLate
Copy link
Contributor Author

Any new thoughts? Any options you feel very strongly about?

@halleysfifthinc
Copy link
Owner

I'm not following the first bit of your comment, so perhaps I wasn't clear when talking about a standard order; I was meaning only for the NamedTuple. So (;indices, heights, data, proms, widths, edges) for the NamedTuple API.

Other than being consistent (for type reasons), I really don't think the NamedTuple order is important, since the main interaction should normally be via named indexing or property access. Therefore, compromising by having data as the third argument1 allows transparently converting findmaxima/minima to return a NamedTuple (;inds, vals, data) like I suggested in my previous comment. And again, the rest of the old API can be left as is, and the new NamedTuple API can append the various new data following the new standard order.

Re: your point 2 on leaving the edges separate or zipping them, I'm ambivalent. On balance, it does seem likely that any use for edges will want both sides; but I've never needed the edges before so I don't have any personal experience to have more of an opinion.

Footnotes

  1. As I see it, the only reason to include data at all is to enable piping. I don't imagine the data field will be directly accessed by users (vs by the functions) very often.

@KronosTheLate
Copy link
Contributor Author

I think we may have talked partially past each other, but I think I understand now. The current suggestions as I understand it:

  1. findmaxima and findminima go from returning (idxs, vals) to returning (indices=idxs, heights=vals, data=data). Adding the field data is crucial for the named tuple based API. While technically breaking, it will not break usage as in retval = findmaxima(signal); retval[1]; retval[2], nor in a, b = findmaxima(signal), so breakage should be minimal.
  2. The functions return tuples when given individual vectors, and named tuples when given named tuples. In the showcase below, all arguments can be appended with "!" with the same signature:
    1. peakproms(peaks, x) -> (peaks, proms) and peakproms(pks::NamedTuple)->pks
    2. peakwidths(peaks, x, proms) -> (peaks, widths, leftedge, rightedge) and peakproms(pks::NamedTuple)->pks
    3. peakheights(peaks, heights) -> (peaks, heights) and peakheights(pks::NamedTuple)->pks
  3. The named tuple pks will always be initially created by findmaxima/findminima, initializing it with the fields in the order indices, heights, data. Subsequent functions will append fields "to the end", meaning that the order of the named tuple is dependent on the order in which other fields, such as proms, are added.

I like everything about this proposal. It is minimally breaking, while allowing both APIs for a smooth transition. I would like to define findpeaks, but it makes sense to separate that into its own PR, and agree not to implement it here.

@halleysfifthinc
Copy link
Owner

Yep that is all correct.

@KronosTheLate
Copy link
Contributor Author

Great! So that means that we are free to go ahead with implementation ^_^

@KronosTheLate
Copy link
Contributor Author

As a general note, we seem to be quite misaligned in what we think is the best docstring or code. I value readability quite a lot for both, where you seem to value technical precision and performance quite a bit over readability. I can just say that as a user, I was put off by the amount of documented functions in the API reference, and the length of each docstring. It made it seem like an hour-long project to learn how to calculate a peak, and filter by it's width. I think that it is actually very easy to get started, and that this is lost in the amount of very good technical documentation.

I also feel like a docstring does not have to inform the user about everything - that is what the online documentation is for. The docstring is, to me, a quick way to look up method signatures, and what each argument does exactly. It is more of a how to use, rather than how it works, and all finicky details. So simple and short language, and signatures without syntax that I only learned after more than a year of normal use, are really important in making a package accessible, the way I see it.

With that off my chest, I am ready to wrap this up, hoping that you can take some of it into consideration. I do not think that we will agree on the details, so discussing them is often a little frustrating and takes a lot of effort. I have already put about 5x more work into this than I was prepared for, and it has dragged on for more than half a year. This discussion is extremely scattered, and I am not really finding it that productive to work out every discussion. So feel free to wrap this up in any way you see fit - my goal is just to get a named tuple API, so that I do not have to juggle multiple vectors and pass them in correctly myself.

@KronosTheLate
Copy link
Contributor Author

It has been 24 days since my last comment. My goal with it was to allow this PR to reach some conclusion, and be merged. Are you okay driving it home from here, or would you need something more from me?

@halleysfifthinc
Copy link
Owner

Hi! Yes I can take it from here; I've just been quite busy recently. I really appreciate and respect your effort throughout this PR! Sorry that we have such diametrically opposed ideals regarding docstrings/documentation.

@KronosTheLate
Copy link
Contributor Author

Ah okay, no problem! After all, it is your package, and so if there are disagreements it is only fair that you have the final say. I feel like the discussion was productive, it just took more effort than I was prepared for xD

Yhea it seems like we have some different ideals, but I absolutely see where you come from, and so I have no issues letting you make the remaining calls from here. Good luck!

@halleysfifthinc halleysfifthinc marked this pull request as ready for review February 28, 2024 02:34
@halleysfifthinc halleysfifthinc self-requested a review February 28, 2024 02:34
@halleysfifthinc halleysfifthinc merged commit a3f2ddd into halleysfifthinc:master Feb 28, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants