Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No feedback if iteration does not reach exact result in utils.sampling.sample_rows #178

Closed
moritz-kirmse-forcity opened this issue Nov 10, 2016 · 14 comments

Comments

@moritz-kirmse-forcity
Copy link

Function utils.sampling.sample rows includes a loop which shall converge to an exact pick of samples.
In some cases, this loop cannot converge.
Example:

import pandas as pd
import urbansim.utils.sampling
df = pd.DataFrame({'tot': [2,2,2]})
sample = urbansim.utils.sampling.sample_rows(total=3, data=df, replace=False, accounting_columns='tot')
sample.sum()  # = 2 and not = 3 as expected

Even if an exact pick is possible, the random permutation l.51 causes a further problem. If the values needed for an exact pick are at the head of the randomized list sample_idx, they are not available in the loop to adjust for an inexact pick, the loop cannot converge.
This aspect depends on the random seed, so it can make integration tests fail randomly.

Suggested solution:
there should be some feedback from the sample_rows function whenever convergence to an exact pick is not achieved. The caller may use this information if he wishes. In this way a more robust test suite can be obtained.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 10, 2016

Hi, this is not a part of the code I am familiar with. Can you link to the loop?

As always just starting a discussion, and willing to help implement any fix.

@moritz-kirmse-forcity
Copy link
Author

the loop is at sampling.py:59

what sampling does is:

  • reorder the dataset (random permutation)
  • compute mean weight (accounting_column) of each element
  • pick the n first elements according to mean weight
  • run a loop to adjust the sample which can:
    • remove several elements from the end of the sample if the total is too high
    • add some elements which have not been picked yet if total is too low

however, the permutation of the list is only done once.
Imagine the following case:

  • dataset = [1, 5, 1, 5, 1, 5]
  • target sum value = 10 (possible exact solution : [5, 5]
  • the random permutation puts all elements with a small value/weight at the top of the list.
    • for example [1, 1, 1, 5, 5, 5]
    • problem: sum(1, 1, 1, 5) = 8 is too small and sum(1, 1, 1, 5, 5) = 13 is too big
  • In order to reach the exact target value a small adjustment is needed
  • but all remaining elements at the bottom of the list are too big to reach the target

==> in this case the loop cannot converge

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 10, 2016

So, As used in the code, What is the intended behavior of sampling from the distribution [1, 5, 1, 5, 1, 5] with a target of 10? What shud 20 draws from that distribution look like?

What is the intended behavior of sampling from the distribution [1, 5, 1, 5, 1, 5] with a target of 2? What shud 20 draws from that distribution look like?

@moritz-kirmse-forcity
Copy link
Author

Well, sorry for being less than clear.
the function sample_rows offers two modes: if accounting_column == None, n rows shall be picked. If accounting_column != None, the function shall pick as many rows as needed so that the sum of values in the column accounting_column gets as close as possible to the target value.
For the example if we pick [5, 5] the sum is exactly 10
However if multiple draws are not allowed (replace=False) it is not possible to achieve a total of 20 (since the sum is 18) and we raise this exception

@bridwell
Copy link
Contributor

I actually wrote this method. Providing notification if the total is not met is a good idea. Something I just never implemented.

Per the removal not converging, I think this issue is in lines 83- 84:

 if not replace:
                np.append(sample_idx, curr_rows.index.values)

The intent here was to add the removed items back to the end of the list, to avoid this. But it was implemented improperly, this should probably be:

 if not replace:
               sample_idx = np.append(sample_idx, curr_rows.index.values)

Sorry, I can test this a little further to see if this fixes the problem.

Having said that, I could still see cases where the total cannot be matched exactly, so it makes sense to provide feedback as you suggest.

@moritz-kirmse-forcity
Copy link
Author

hello,
thanks for your feedback.
I agree with you, there are cases where the total cannot be met (ex: only even weights, but target is an odd number)

I just stumbled upon the problem while updating tests and in some bad luck random cases, the test would pass while in others, it would fail.

Most of the time, the model will not care if the exact value is reached or not. However the behavior becomes unpredictable for automatic tests. That's why a notification could be useful when the target is approximated.

I'll be glad to further discuss the topic if necessary.

@bridwell
Copy link
Contributor

Per the testing issue, I've been wrapping my calls inside a function that 1st sets up the environment:

https://github.com/AZMAG/smartpy_core/blob/master/smartpy_core/sampling.py#L74

So instead of

sample_rows(20, my_df, replace=False, accounting_column='my_col')

Your call would then be:

seed = 123
func = sample_rows
seeded_call(seed, func, my_df, replace=False, accounting_col='my_col')

There's perhaps a better way to do this, but I've found that it provides consistent test results without too much overhead. Using pyest fixtures are another way to handle this, this how the tests for the samle_rows method were written:

https://github.com/UDST/urbansim/blob/master/urbansim/utils/tests/test_sampling.py#L8

The non-converging issue you describe is still a concern, I'll look into it.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 10, 2016

So in use, a sample that does not exactly add up is correct, but for testing it is annoying?

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 10, 2016

For the case when replace=False I think the loop can be removed. the code looks like

sample = data.loc[np.random.permutation(data.index.values)]
csum = sample[accounting_column].values.cumsum()
index = np.searchsorted(csum, total, 'right')
return sample[:index].copy()

Untested as there isn't test code for this yet, as per this discussion.

@bridwell
Copy link
Contributor

I think you're right. That's a lot cleaner. But maybe after doing the search sorted, sample a household with size equal to the overage (assuming one exists)? That way the total could be hit exactly.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 10, 2016

Such a fix over samples the small elements. That may be the intended behavior, but I'd find it surprising givan the current dock string.

@bridwell
Copy link
Contributor

Good point. I have a new version that does the cumsum and then iterates over each remaining element until the total is met. I'll put in a PR after I do some more testing.

@bridwell
Copy link
Contributor

See the above pull request. Let me know if you see any issues with these changes or if this doesn't adequately address the issue.

@janowicz
Copy link
Contributor

janowicz commented Dec 6, 2016

Closing this issue for now as #180 is merged.

@janowicz janowicz closed this as completed Dec 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants