Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Bad values for the variance scale #24

Closed
tveasey opened this issue Mar 26, 2018 · 2 comments · Fixed by #150
Closed

[ML] Bad values for the variance scale #24

tveasey opened this issue Mar 26, 2018 · 2 comments · Fixed by #150

Comments

@tveasey
Copy link
Contributor

tveasey commented Mar 26, 2018

A user data set has shown up two issues with the variance scale calculation in version 6.2.2 of the analytics:

  1. it is sometimes negative(!),
  2. it is sometimes infinite.

In particular, we are seeing the following error messages logged:
Error calculating joint distribution: Bad variance scale -5.75
Error calculating joint distribution: Bad variance scale inf

There is no prospect of getting hold of the data set; however the data characteristics sound benign. There were two detectors:

  • detector high_mean(x) over y influencers y,z.
  • detector high_median(x) over y influencers y,z

For x we have min: 0, max: 4.34571, avg: 2.0736 and cardinality of y is 430.

This issue is to investigate routes by which this problem could occur. The initial areas to investigate are CTimeSeriesDecomposition::scale and the calculation of the count variance scale, particularly for influencers.

cc @LucaWintergerst.

@LucaWintergerst
Copy link

The exceptions do not happen if influencer z is removed.
I also replaced z (which was the hostname) with z_ip and the same thing happened

@sophiec20 sophiec20 changed the title Bad values for the variance scale [ML] Bad values for the variance scale Mar 28, 2018
@hendrikmuhs hendrikmuhs self-assigned this Jun 27, 2018
@hendrikmuhs
Copy link

Update:

The root cause has been identified, counting influencer occurrences per bucket has a bug. The fix is simple (1LOC). The change affects results:

screenshot_20180706_160457

Hopefully to the better, I will analyze the diff to be sure.

I followed the code history back to version 5.5, so it's not a recent regression but likely has been introduced by PR 144 in the old repo.

As said, the fix is simple, but I plan to take some more time for related code improvements and test cases.

hendrikmuhs pushed a commit that referenced this issue Jul 11, 2018
Fix counting of influencer per bucket for metric population analyses, prior this fix the count has always 
been set to 1.

Fixes #24
hendrikmuhs pushed a commit to hendrikmuhs/ml-cpp that referenced this issue Jul 11, 2018
Fix counting of influencer per bucket for metric population analyses, prior this fix the count has always 
been set to 1.

Fixes elastic#24
hendrikmuhs pushed a commit that referenced this issue Jul 12, 2018
Fix counting of influencer per bucket for metric population analyses, prior this fix the count has always 
been set to 1.

Fixes #24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants