Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you tell me why I replaced the value with 1? #176

Closed
Youngsoo77 opened this issue Dec 5, 2016 · 10 comments
Closed

Could you tell me why I replaced the value with 1? #176

Youngsoo77 opened this issue Dec 5, 2016 · 10 comments

Comments

@Youngsoo77
Copy link

gitmostwanted/tasks/repo_status.py:43

1 if variance(chunk) >= 1000 

I can not understand this substitution.

Does it mean more activity?

Thank you for your kindness.

youngsoo

@kkamkou
Copy link
Owner

kkamkou commented Dec 5, 2016

It is a primitive noise cleanup. Normally, the variance is in range > 0 and < 1. Sometimes, there is a super activity (marketing advertisements, "oh! I found a nice library" articles and so forth) and we need to clean it up somehow. The way I see it is just replacing with the normal maximum value, which is 1. If you have any opinion how can we improve this logic, you're welcome.

More logic is here

@kkamkou kkamkou closed this as completed Dec 5, 2016
@Youngsoo77
Copy link
Author

Youngsoo77 commented Dec 5, 2016 via email

@kkamkou
Copy link
Owner

kkamkou commented Dec 5, 2016

You can run this query with any repository for 28 days. Then, just call variance.

@Youngsoo77
Copy link
Author

Youngsoo77 commented Dec 6, 2016 via email

@kkamkou
Copy link
Owner

kkamkou commented Dec 6, 2016

Yes, if the variance is abnormal, we'll replace it with 1. And keep as is otherwise.

@Youngsoo77
Copy link
Author

Youngsoo77 commented Dec 6, 2016 via email

@kkamkou
Copy link
Owner

kkamkou commented Dec 6, 2016

Lets assume we have a repo. every day we do have a number of stars or forks. For example:
1: 3 2: 5 3: 0 4: 9999 5: 87 6: 15 7: 4 ... 28: 7

What we do next is splitting by 7. [3, 5, 0, 9999, 87, 15, 4], [...], [...], [...]. And checking each of them (variance). If variance is huge, we assume that the mean value of the whole set is equal to 1. At the end we're calculating the mean value for all 4 mean values. [1, 5, 8, 10]. Primitive logic, and should be improved here.

@Youngsoo77
Copy link
Author

Youngsoo77 commented Dec 6, 2016 via email

@kkamkou
Copy link
Owner

kkamkou commented Dec 6, 2016

What is important for me was to cleanup the noise. 1 is just not zero :) Therefore there a huge room to improve the logic. Maybe we could use min() but this is not a best idea, because it might be high as well

@Youngsoo77
Copy link
Author

Thank you for your reply.
I will try hard to imporve that logic.

Have a good day.

  • youngsoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants