-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Matrix Stats Aggregation module #18300
Conversation
f3b1159
to
bce9bbf
Compare
This PR has been updated to include MultiValuesSource classes (per comment in #18285) @colings86, @jpountz could one (or both) of you have a look if you get a chance? |
@@ -263,6 +265,7 @@ public final void readFrom(StreamInput in) throws IOException { | |||
public static final String FROM_AS_STRING = "from_as_string"; | |||
public static final String TO = "to"; | |||
public static final String TO_AS_STRING = "to_as_string"; | |||
public static final ParseField VALUE_TYPE = new ParseField("value_type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should actually be in ValuesSourceAggregatorBuilder since this CommonFields
interface is for the response side of the aggregations but value_type
is for the request side, so it doesn't really belong here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so is InternalAggregation
specifically for the response side? I'm not as intimate with the aggs framework so I thought it was just an internal framework class (request / response agnostic).
So then what do you think about having a CommonFields
static class to AggregatorBuilder
for common fields used on the request side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to having CommonFields
class in AggregatorBuilder
@nknize I left some comments on the aggregation type stuff but I don't know enough about the actual statistics we are calculating here to review those bits. Maybe @brwe or @polyfractal would be able to review those bits? |
|
||
public class MatrixAggregationPlugin extends Plugin { | ||
static { | ||
InternalMatrixStats.registerStreams(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rather do it in onModule? cc @colings86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I tried, and largely failed, to grok the stats calculations. My math skills are pretty rusty and the streaming formulas look different from what I'm used to. I'd try to get a second opinion from @brwe :) |
89d8c39
to
2fa3320
Compare
Great suggestions! I incorporated all feedback and the PR is ready for another review. Here are the main changes:
/cc @colings86, @jpountz |
/** kurtosis values (fourth moment) */ | ||
protected HashMap<String, Double> kurtosis; | ||
/** covariance values */ | ||
protected HashMap<String, HashMap<String, Double>> covariances; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make this class more efficient, we might want to give an ordinal to every field and then use field ordinals as keys rather than field names? (I don't mind it being done as a follow-up, I'm just suggesting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for followup PR. I'll open an issue
Thanks for the latest round of review @jpountz. I went ahead and made the minor code changes, added support for missing values, and updated documentation. Let me know what you think. |
2a811b5
to
f33b04e
Compare
} | ||
-------------------------------------------------- | ||
|
||
The aggregation type is `matrix_stats` and the `field` setting defines the set of fields (as an array) for computing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/field/fields/
I'd like @colings86 to have a final look, at least for the core refactorings that you did. But otherwise this looks good to me for a first iteration! |
@@ -76,6 +76,21 @@ public boolean valid() { | |||
return this; | |||
} | |||
|
|||
public ValuesSourceConfig<VS> format(final DocValueFormat format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make all the fields of this class private now we have setters/getters for them so there is only one way of accessing/modifying the values? (maybe in a followup PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and did it in this PR but will keep it as a separate commit when I merge.
@jpountz @nknize I took a look at the core changes and they LGTM. I also took a brief look at the module, there are unit and REST tests but we are missing integration tests for this. Is there a reason we have not done an integration tests for this aggregation? All other aggregations have integration tests. |
@colings86 I had asked @rjernst about integration testing in modules. I could be wrong but I think to avoid the overhead of starting/stopping integration test clusters in plugins and modules the |
f33b04e
to
2f66665
Compare
REST tests are real integration tests. No need for fantasy land tests. |
b85644a
to
1706782
Compare
This commit adds a new aggs-matrix-stats module. The module presents a new class of aggregations called Matrix Aggregations. Matrix aggregations work on multiple fields and produce a matrix as output. The first matrix aggregation provided by the module is matrix_stats aggregation. This aggregation computes the following statistics over a set of fields: * Covariance * Correlation For completeness (and interpretation purposes) the following per-field statistics are also provided: * sample count * population mean * population variance * population skewness * population kurtosis
1706782
to
54575e5
Compare
This PR adds a new aggs-matrix-stats module. The module introduces a new class of aggregations called
Matrix Aggregations
. Matrix aggregations work on multiple fields and produce a matrix as output. The matrix aggregation provided by this module is thematrix_stats
aggregation which computes the following descriptive statistics over a set of fields:For completeness (and interpretation purposes) the following per-field statistics are also provided:
Example request:
Example response:
This PR replaces #16826. The differences include:
multifield_stats
tomatrix_stats