Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify and fix parametrized metrics + add missing keywords to tests #135

Closed
wants to merge 7 commits into from

Conversation

devmotion
Copy link
Member

@devmotion devmotion commented May 28, 2019

As correctly noticed in #134 (review), the current implementation of PeriodicEuclidean is a bit inconsistent and can be improved, it seems.

As an example, before the last PR #134 was merged,

julia> evaluate(PeriodicEuclidean([4]), 1, 3)
2

julia> evaluate(PeriodicEuclidean([4]), [1], [3])
2.0

julia> evaluate(PeriodicEuclidean([4,5]), 1, 3)
2

julia> evaluate(PeriodicEuclidean([4,5]), [1], [3])
ERROR: DimensionMismatch("arrays have length 1 but parameters have length 2.")
Stacktrace:
 [1] evaluate(::PeriodicEuclidean{Array{Int64,1}}, ::Array{Int64,1}, ::Array{Int64,1}) at /home/david/.julia/dev/Distances/src/metrics.jl:183
 [2] top-level scope at none:0

and now currently on master

julia> evaluate(PeriodicEuclidean([4]), 1, 3)
2.0

julia> evaluate(PeriodicEuclidean([4]), [1], [3])
2.0

julia> evaluate(PeriodicEuclidean([4,5]), 1, 3)
2.0

julia> evaluate(PeriodicEuclidean([4,5]), [1], [3])
ERROR: DimensionMismatch("arrays have length 1 but parameters have length 2.")
Stacktrace:
 [1] evaluate(::PeriodicEuclidean{Array{Int64,1}}, ::Array{Int64,1}, ::Array{Int64,1}) at /home/david/.julia/dev/Distances/src/metrics.jl:183
 [2] top-level scope at none:0

So the incorrect result types were fixed but since the scalar evaluation just takes first(d.periods) even incorrect lengths of d.periods do not throw errors for scalar input whereas they do if arrays are used as inputs.

I tried to fix this problem in this PR by checking the size of d.periods also for scalar inputs:

julia> evaluate(PeriodicEuclidean([4]), 1, 3)
2.0

julia> evaluate(PeriodicEuclidean([4]), [1], [3])
2.0

julia> evaluate(PeriodicEuclidean([4,5]), 1, 3)
ERROR: DimensionMismatch("inputs are scalars but parameters have length 2.")
Stacktrace:
 [1] evaluate(::PeriodicEuclidean{Array{Int64,1}}, ::Int64, ::Int64) at /home/david/.julia/dev/Distances/src/metrics.jl:266
 [2] top-level scope at none:0

julia> evaluate(PeriodicEuclidean([4,5]), [1], [3])
ERROR: DimensionMismatch("arrays have length 1 but parameters have length 2.")
Stacktrace:
 [1] evaluate(::PeriodicEuclidean{Array{Int64,1}}, ::Array{Int64,1}, ::Array{Int64,1}) at /home/david/.julia/dev/Distances/src/metrics.jl:180
 [2] top-level scope at none:0

Adding a result_type implementation that specializes on PeriodicEuclidean and does not directly call evaluate with scalar inputs allows to get rid of the special case for empty periods as well (which, as mentioned in #134 (comment)), was the reason for setting p to oneunit(eltype(d.periods)) for empty periods before). Moreover, I removed all occurrences of parameters() since it was only used to check if d is of type PeriodicEuclidean and made these checks explicit. Maybe one could implement two methods of evaluate for arrays that dispatch on PeriodicEuclidean to make it even more explicit and easier to read - currently half of the implementation of these methods are just handling PeriodicEuclidean.

@ararslan ararslan requested a review from nalimilan May 28, 2019 19:56
@dkarrasch
Copy link
Member

I like this very much, but I'd ask to leave the parameters trait. This was designed for allowing also all weighted UnionMetrics being handled by the same evaluate method. It just hasn't been done yet. Along those lines, we could have a "unified" scalar evaluate for UnionMetrics, which branches over p === nothing. Similarly, one could combine the two result_type functions using the same branching (as it used to be before #134). I could do it, if you prefer.

@devmotion
Copy link
Member Author

Ah, to combine it with the weighted metrics is a neat idea! I think it would increase readability if the resulting evaluate logic for both parametrized and non-parametrized metrics is kept separated, i.e., forward parametrized metrics (using the trait) to the methods in wmetrics (with some slight generalizations such as replacing d.weights with parameters(d)) and the other ones to "old" implementation in metrics.

When I quickly checked the weighted metrics, I noted some issues, highlighted by the following example:

julia> evaluate(WeightedSqEuclidean([3]), 1, 3)
4

julia> evaluate(WeightedSqEuclidean([3]), [1], [3])
12

julia> evaluate(WeightedSqEuclidean([3,4]), 1, 3)
4

julia> evaluate(WeightedSqEuclidean([3,4]), [1], [3])
ERROR: DimensionMismatch("arrays have length 1 but weights have length 2.")
Stacktrace:
 [1] evaluate(::WeightedSqEuclidean{Array{Int64,1}}, ::Array{Int64,1}, ::Array{Int64,1}) at /home/david/.julia/dev/Distances/src/wmetrics.jl:60
 [2] top-level scope at none:0

Basically, the weighted metrics require the same updates as PeriodicEuclidean, and in particular https://github.com/JuliaStats/Distances.jl/blob/master/src/wmetrics.jl#L43 is just plain wrong. Moreover, I noted that there's no constructor such as WeightedSqEuclidean(). That seems reasonable to me, since it's completely unclear what a reasonable initialization would be - if the weights are not of the same length as the evaluated numbers/vectors the computation will fail. Due to the same reason I think the constructor PeriodicEuclidean() should be removed, since the resulting metric can only be used to evaluate empty vectors.

I will try to add these changes later today.

@KristofferC
Copy link
Member

Please try the #107 (comment) benchmark as well. It is a good benchmark to see that things get correctly optimized for a simple case.

@dkarrasch
Copy link
Member

I guess there is a trade-off between readability and code duplication. @KristofferC made me work hard to not add another evaluate method for PeriodicEuclidean. One reason being the benchmark problem @KristofferC mentioned here #135 (comment). That is at least an argument in favor of having only "one loop", that can be tracked and maintained. OTOH, since we currently have the weighted evaluate anyway, we could put the whole parametrized logic in there. I guess there is no right or wrong here, it's just a design decision. Within the current framework, I have a working version already.

@devmotion
Copy link
Member Author

OTOH, since we currently have the weighted evaluate anyway, we could put the whole parametrized logic in there.

That was exactly what I proposed, but I guess I was not clear enough 😃

@dkarrasch
Copy link
Member

No, you were clear. I should have added "as you said". I didn't mean to steal your argument, I meant to give your argument against mine. 😄

@devmotion
Copy link
Member Author

The last commit fixes the issues with weighted metrics that I mentioned above, and unifies the way in which parametrized metrics (i.e., at the moment weighted metrics and PeriodicEuclidean) are evaluated. I ran both @KristofferC's short benchmark script and the full benchmark suite and could not observe any regressions.

@dkarrasch
Copy link
Member

Wow, amazing work! I see you have included the generic eval_reduce proposal. You could include the minor test "fixes" from #136, and then we can close that one. And you may consider to rename your PR for future reference.

@devmotion devmotion changed the title Simplify PeriodicEuclidean Unify and fix parametrized metrics + add missing keywords to tests May 29, 2019
@test pairwise(dist, x, y) ≈ rxy
@test pairwise(dist, x) ≈ rxx
@test pairwise(dist, x, y, dims=2) ≈ rxy
@test pairwise(dist, x, dims=2) ≈ rxx
Copy link
Contributor

@johnnychen94 johnnychen94 May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose these two lines don't need change since pairwise(dist, x, y) still works (although with depwarn).

P.S. L500-L501 is duplicated to L502-L503; we can just remove it if we really need to change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes, either remove, or leave to test for behavior without dims keyword.

Copy link
Member Author

@devmotion devmotion May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why you would want to keep something that throws a depwarn, so I removed these duplicate lines.

@simd for I in 1:length(a)
ai = a[I]
bi = b[I]
pi = p[I]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any issue with overwriting the Base constant pi? I was experimenting with that and wasn't sure, so I used pI and aI etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Since we don't want to access the mathematical constant here, it should not matter that it is masked by the local variable.

@dkarrasch
Copy link
Member

Bump. It would be cool if we could review, merge, and tag this, to make the PeriodicEuclidean available.

@nalimilan
Copy link
Member

Is there any chance you could split this PR into smaller parts, or move out some fixes which can be applied separately first? It's very hard to review as-is as it touches many things. In particular, it would be nice to have a commit that moves code without changing it, so that we can review what actually changes.

@devmotion
Copy link
Member Author

Hmm... I'm not sure what would be a good way to split this PR. The first five commits unify and fix the parametrized metrics and fix and extend the tests to account for these changes; maybe the last two commits which fix and simplify different distance implementations could be separated.

it would be nice to have a commit that moves code without changing it

What do you mean?

@nalimilan
Copy link
Member

maybe the last two commits which fix and simplify different distance implementations could be separated.

Then yes, please do. Anything that isn't strictly required should be separated for clarity.

it would be nice to have a commit that moves code without changing it

What do you mean?

That the "Fix weighted metrics and handle parametrized metrics in unified way" commit moves lots of things (like the docstring for Euclidean([thresh])) from one place to another, making it impossible to spot what changed and want was just moved.

Also notice how the diff for metrics.jl makes it looks like you added a new _pairwise!(r::AbstractMatrix, dist::Euclidean, ...) method somewhere and modified the existing method with that signature to be _pairwise!(r::AbstractMatrix, dist::WeightedEuclidean, ...): this is almost impossible to review. If you moved the code without modifying it in a separate commit, then the diff would show what changed in the existing ::Euclidean and ::WeightedEuclidean methods (if anything). Also, does the code really needs to move?

@devmotion
Copy link
Member Author

Sorry for the delay! I tried to figure out how to split the PR, but even splitting the last two commits and rebasing them on master didn't work out well since they already require the restructuring of the weighted distances. Overall I got the impression that all changes are very closely related to each other.

I assumed that by unifying the parametrized metrics and getting rid of the separation between the weighted metrics and the periodic Euclidean metric, which were already defined in metrics.jl, there's no need for a separate file wmetrics.jl anymore, and so I moved the remaining parts of that file to metrics.jl.

I don't know how to do any more advanced rewriting of the git history, so I'm a bit lost here. Maybe it helps if you look at a smaller subset of changes in the Github review interface by, e.g., selecting the last two commits separately?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants