Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix AD issues with various kernels #154
[WIP] Fix AD issues with various kernels #154
Changes from 1 commit
a6211d0
8704f18
8f44c51
14db1f4
90c1dff
dcf1f6b
16e8af6
ede5879
e8b76ec
e236aaf
d50c73f
090cc8a
45c14d6
b920c19
2630adc
31730a8
e81cb01
4c2f233
0023292
acdec1a
f467162
651ae02
6b114d2
8655911
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume it should be possible to vectorize this code? What's the mathematical formula that you use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same equations you mentioned earlier.
d((x-y)'*Q*(x-y))/dx = (Q + Q') * (x - y)
,d((x-y)'*Q*(x-y))/dy = - (Q + Q') * (x - y)
, andd((x-y)'*Q*(x-y))/dQ = (x - y)' * (x - y)
.But this is being done for all pairwise combinations together using
map
. It later sums these differences to get\deltaB
and others.Please note that the current implementation is not correct. I am still debugging it. (it is only partially matching the intended result) If you happen to find any obvious mistakes please let me know. I am facing trouble in reducing the results of individual pairwise pullbacks to the final pullback. The way I am summing them is probably wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmotion isn't this wrong or have I done something silly? They are equal in case of euclidean. I feel this is the root of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work if
dist.qmat
is positive definite: JuliaStats/Distances.jl#174There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still does not solve the differences in the computed adjoints for the covariance matrix Q. My current implementation matches the second adjoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be the most natural way to ensure that it is always positive semi-definite (if the diagonal is non-negative) and optimization is performed in the correct space. So I guess users would want to use this parameterization even if it is not enforced by KernelFunctions and not directly supported by SqMahalanobis by using something like
Of course, it would be nice if (Sq)Mahalanobis would support specifying e.g. a Cholesky decomposition or PDMat directly (it could even be used for simplifying the computations since
x'*Q*x = (L'*x)'*(L'*x)
in this case), but can't we work around this by checking gradients of themykernel
setup instead of computingQ -> MahalanobisKernel(Q)
directly? That's at least how we do it in DistributionsAD, e.g. in https://github.com/TuringLang/DistributionsAD.jl/blob/a96b159ab25aab67d1a2076726e8b9c392eb6fc7/test/ad/distributions.jl#L18-L34.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that should work. Will try that out.
Regarding the issue with pairwise implementation which messes up
FiniteDifferences
results, do you suggest I override the implementation for the time being?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you test the suggested parameterization the implementation of
pairwise
shouldn't matter (since we do not test the intermediate step which might be affected by it).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Could we also change our side of the parametrization? i.e, the way it is stored in the struct. We could continue to allow initialization using a full matrix. This should allow for seamless AD regardless of how the user decides to initialize them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to do that, I think this deserves some discussion first (and then a separate PR possibly). Ideally, Distances would just support arbitrary matrices and contain optimized implementations for specific array types. We just forward
P
to SqMahalanobis, so ideally we wouldn't perform any transformations or computations. I'm also a bit worried that focusing on a specific parameterization might make it difficult for users who would like to use a different one (but still no dense matrix) or might lead to confusing behaviour.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is som discrepancy between the simple case above and this pullback - intuitively, from the simple case above I would assume that
δB = sum_{i, j} (a_i - b_j) * (a_i - b_j)^T * Δ_{i,j}
. However, here you computeδB = sum_{i, j} (a_i - b_j) * (a_i - b_j)^T * Δ_{i,j}^2
. Probably one of them is incorrect (table 7 in https://notendur.hi.is/jonasson/greinar/blas-rmd.pdf indicates that the pairwise one is incorrect). Can we add the derivation of the adjoints according to https://www.juliadiff.org/ChainRulesCore.jl/dev/arrays.html as docstrings or comments, or maybe even have a separate PR for the Mahalanobis fixes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I think a separate PR for mahalanobis fixes makes more sense.