Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Matrix Stats Aggregation module #18300

Merged
merged 3 commits into from
Jun 1, 2016

Conversation

nknize
Copy link
Contributor

@nknize nknize commented May 12, 2016

This PR adds a new aggs-matrix-stats module. The module introduces a new class of aggregations called Matrix Aggregations. Matrix aggregations work on multiple fields and produce a matrix as output. The matrix aggregation provided by this module is the matrix_stats aggregation which computes the following descriptive statistics over a set of fields:

  • Covariance
  • Correlation

For completeness (and interpretation purposes) the following per-field statistics are also provided:

  • sample count
  • population mean
  • population variance
  • population skewness
  • population kurtosis

Example request:

"aggs": {
    "matrixstats": {
        "matrix_stats": {
            "field": ["poverty", "income"]
        }
     }
}

Example response:

{
    "aggregations": {
        "matrixstats": {
            "fields": [{
                "name": "income",
                "count": 50,
                "mean": 51985.1,
                "variance": 7.383377037755103E7,
                "skewness": 0.5595114003506483,
                "kurtosis": 2.5692365287787124,
                "covariance": {
                    "income": 7.383377037755103E7,
                    "poverty": -21093.65836734694
                },
                "correlation": {
                    "income": 1.0,
                    "poverty": -0.8352655256272504
                }
            }, {
                "name": "poverty",
                "count": 50,
                "mean": 12.732000000000001,
                "variance": 8.637730612244896,
                "skewness": 0.4516049811903419,
                "kurtosis": 2.8615929677997767,
                "covariance": {
                    "income": -21093.65836734694,
                    "poverty": 8.637730612244896
                },
                "correlation": {
                    "income": -0.8352655256272504,
                    "poverty": 1.0
                }
            }]
        }
    }
}

This PR replaces #16826. The differences include:

  • renames aggregation from multifield_stats to matrix_stats
  • adds aggregation code as a module instead of directly to core
  • refactors Integration Tests as REST Tests

@nknize
Copy link
Contributor Author

nknize commented May 13, 2016

This PR has been updated to include MultiValuesSource classes (per comment in #18285)

@colings86, @jpountz could one (or both) of you have a look if you get a chance?

@@ -263,6 +265,7 @@ public final void readFrom(StreamInput in) throws IOException {
public static final String FROM_AS_STRING = "from_as_string";
public static final String TO = "to";
public static final String TO_AS_STRING = "to_as_string";
public static final ParseField VALUE_TYPE = new ParseField("value_type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should actually be in ValuesSourceAggregatorBuilder since this CommonFields interface is for the response side of the aggregations but value_type is for the request side, so it doesn't really belong here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so is InternalAggregation specifically for the response side? I'm not as intimate with the aggs framework so I thought it was just an internal framework class (request / response agnostic).

So then what do you think about having a CommonFields static class to AggregatorBuilder for common fields used on the request side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to having CommonFields class in AggregatorBuilder

@colings86
Copy link
Contributor

@nknize I left some comments on the aggregation type stuff but I don't know enough about the actual statistics we are calculating here to review those bits. Maybe @brwe or @polyfractal would be able to review those bits?


public class MatrixAggregationPlugin extends Plugin {
static {
InternalMatrixStats.registerStreams();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rather do it in onModule? cc @colings86

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@polyfractal
Copy link
Contributor

I tried, and largely failed, to grok the stats calculations. My math skills are pretty rusty and the streaming formulas look different from what I'm used to. I'd try to get a second opinion from @brwe :)

@nknize nknize force-pushed the feature/MatrixStats-Module branch from 89d8c39 to 2fa3320 Compare May 23, 2016 18:48
@nknize
Copy link
Contributor Author

nknize commented May 23, 2016

Great suggestions! I incorporated all feedback and the PR is ready for another review. Here are the main changes:

  • Added MultiValueMode support for handling multi value fields (it was easier adding it here instead of postponing to a later PR)
  • Added a MultiValuesSource class to encapsulate field names and ValuesSource arrays, this is now used to retrieve document field values instead of creating a Map inside of collect
  • Updated doReduce to merge all shard results in an empty RunningStats object.
  • Added Rest testing for MultiValueMode support

/cc @colings86, @jpountz

/** kurtosis values (fourth moment) */
protected HashMap<String, Double> kurtosis;
/** covariance values */
protected HashMap<String, HashMap<String, Double>> covariances;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make this class more efficient, we might want to give an ordinal to every field and then use field ordinals as keys rather than field names? (I don't mind it being done as a follow-up, I'm just suggesting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for followup PR. I'll open an issue

@nknize
Copy link
Contributor Author

nknize commented May 26, 2016

Thanks for the latest round of review @jpountz. I went ahead and made the minor code changes, added support for missing values, and updated documentation. Let me know what you think.

@nknize nknize force-pushed the feature/MatrixStats-Module branch from 2a811b5 to f33b04e Compare May 27, 2016 18:58
}
--------------------------------------------------

The aggregation type is `matrix_stats` and the `field` setting defines the set of fields (as an array) for computing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/field/fields/

@jpountz
Copy link
Contributor

jpountz commented May 31, 2016

I'd like @colings86 to have a final look, at least for the core refactorings that you did. But otherwise this looks good to me for a first iteration!

@@ -76,6 +76,21 @@ public boolean valid() {
return this;
}

public ValuesSourceConfig<VS> format(final DocValueFormat format) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make all the fields of this class private now we have setters/getters for them so there is only one way of accessing/modifying the values? (maybe in a followup PR?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and did it in this PR but will keep it as a separate commit when I merge.

@colings86
Copy link
Contributor

@jpountz @nknize I took a look at the core changes and they LGTM. I also took a brief look at the module, there are unit and REST tests but we are missing integration tests for this. Is there a reason we have not done an integration tests for this aggregation? All other aggregations have integration tests.

@nknize
Copy link
Contributor Author

nknize commented May 31, 2016

@colings86 I had asked @rjernst about integration testing in modules. I could be wrong but I think to avoid the overhead of starting/stopping integration test clusters in plugins and modules the ESIntegTestCase tests should be replaced by simple Unit and Rest tests? It was a bit tricky but what I did was refactor the integration tests (defined by AbstractNumericTestCase) from the original PR (#16826) to the yaml Rest definitions included in this PR. I also split out testing of RunningStats and MatrixStatsResults into a separate RunningStatsTests unit test class to ensure the RunningStats (and corresponding results) are computed as expected.

@nknize nknize force-pushed the feature/MatrixStats-Module branch from f33b04e to 2f66665 Compare May 31, 2016 16:26
@rjernst
Copy link
Member

rjernst commented Jun 1, 2016

there are unit and REST tests but we are missing integration tests for this

REST tests are real integration tests. No need for fantasy land tests.

@nknize nknize force-pushed the feature/MatrixStats-Module branch 4 times, most recently from b85644a to 1706782 Compare June 1, 2016 21:38
@nknize nknize removed the review label Jun 1, 2016
nknize added 3 commits June 1, 2016 16:39
This commit adds a new aggs-matrix-stats module. The module presents a new class of aggregations called Matrix Aggregations. Matrix aggregations work on multiple fields and produce a matrix as output. The first matrix aggregation provided by the module is matrix_stats aggregation. This aggregation computes the following statistics over a set of fields:

* Covariance
* Correlation

For completeness (and interpretation purposes) the following per-field statistics are also provided:

* sample count
* population mean
* population variance
* population skewness
* population kurtosis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants