No feedback if iteration does not reach exact result in utils.sampling.sample_rows #178

moritz-kirmse-forcity · 2016-11-10T14:16:16Z

Function utils.sampling.sample rows includes a loop which shall converge to an exact pick of samples.
In some cases, this loop cannot converge.
Example:

import pandas as pd
import urbansim.utils.sampling
df = pd.DataFrame({'tot': [2,2,2]})
sample = urbansim.utils.sampling.sample_rows(total=3, data=df, replace=False, accounting_columns='tot')
sample.sum()  # = 2 and not = 3 as expected

Even if an exact pick is possible, the random permutation l.51 causes a further problem. If the values needed for an exact pick are at the head of the randomized list sample_idx, they are not available in the loop to adjust for an inexact pick, the loop cannot converge.
This aspect depends on the random seed, so it can make integration tests fail randomly.

Suggested solution:
there should be some feedback from the sample_rows function whenever convergence to an exact pick is not achieved. The caller may use this information if he wishes. In this way a more robust test suite can be obtained.

The text was updated successfully, but these errors were encountered:

Eh2406 · 2016-11-10T15:14:17Z

Hi, this is not a part of the code I am familiar with. Can you link to the loop?

As always just starting a discussion, and willing to help implement any fix.

moritz-kirmse-forcity · 2016-11-10T15:29:31Z

the loop is at sampling.py:59

what sampling does is:

reorder the dataset (random permutation)
compute mean weight (accounting_column) of each element
pick the n first elements according to mean weight
run a loop to adjust the sample which can:
- remove several elements from the end of the sample if the total is too high
- add some elements which have not been picked yet if total is too low

however, the permutation of the list is only done once.
Imagine the following case:

dataset = [1, 5, 1, 5, 1, 5]
target sum value = 10 (possible exact solution : [5, 5]
the random permutation puts all elements with a small value/weight at the top of the list.
- for example [1, 1, 1, 5, 5, 5]
- problem: sum(1, 1, 1, 5) = 8 is too small and sum(1, 1, 1, 5, 5) = 13 is too big
In order to reach the exact target value a small adjustment is needed
but all remaining elements at the bottom of the list are too big to reach the target

==> in this case the loop cannot converge

Eh2406 · 2016-11-10T16:06:13Z

So, As used in the code, What is the intended behavior of sampling from the distribution [1, 5, 1, 5, 1, 5] with a target of 10? What shud 20 draws from that distribution look like?

What is the intended behavior of sampling from the distribution [1, 5, 1, 5, 1, 5] with a target of 2? What shud 20 draws from that distribution look like?

moritz-kirmse-forcity · 2016-11-10T16:30:07Z

Well, sorry for being less than clear.
the function sample_rows offers two modes: if accounting_column == None, n rows shall be picked. If accounting_column != None, the function shall pick as many rows as needed so that the sum of values in the column accounting_column gets as close as possible to the target value.
For the example if we pick [5, 5] the sum is exactly 10
However if multiple draws are not allowed (replace=False) it is not possible to achieve a total of 20 (since the sum is 18) and we raise this exception

bridwell · 2016-11-10T16:34:53Z

I actually wrote this method. Providing notification if the total is not met is a good idea. Something I just never implemented.

Per the removal not converging, I think this issue is in lines 83- 84:

 if not replace:
                np.append(sample_idx, curr_rows.index.values)

The intent here was to add the removed items back to the end of the list, to avoid this. But it was implemented improperly, this should probably be:

 if not replace:
               sample_idx = np.append(sample_idx, curr_rows.index.values)

Sorry, I can test this a little further to see if this fixes the problem.

Having said that, I could still see cases where the total cannot be matched exactly, so it makes sense to provide feedback as you suggest.

moritz-kirmse-forcity · 2016-11-10T16:45:21Z

hello,
thanks for your feedback.
I agree with you, there are cases where the total cannot be met (ex: only even weights, but target is an odd number)

I just stumbled upon the problem while updating tests and in some bad luck random cases, the test would pass while in others, it would fail.

Most of the time, the model will not care if the exact value is reached or not. However the behavior becomes unpredictable for automatic tests. That's why a notification could be useful when the target is approximated.

I'll be glad to further discuss the topic if necessary.

bridwell · 2016-11-10T16:59:32Z

Per the testing issue, I've been wrapping my calls inside a function that 1st sets up the environment:

https://github.com/AZMAG/smartpy_core/blob/master/smartpy_core/sampling.py#L74

So instead of

sample_rows(20, my_df, replace=False, accounting_column='my_col')

Your call would then be:

seed = 123
func = sample_rows
seeded_call(seed, func, my_df, replace=False, accounting_col='my_col')

There's perhaps a better way to do this, but I've found that it provides consistent test results without too much overhead. Using pyest fixtures are another way to handle this, this how the tests for the samle_rows method were written:

https://github.com/UDST/urbansim/blob/master/urbansim/utils/tests/test_sampling.py#L8

The non-converging issue you describe is still a concern, I'll look into it.

Eh2406 · 2016-11-10T17:05:56Z

So in use, a sample that does not exactly add up is correct, but for testing it is annoying?

Eh2406 · 2016-11-10T17:47:11Z

For the case when replace=False I think the loop can be removed. the code looks like

sample = data.loc[np.random.permutation(data.index.values)]
csum = sample[accounting_column].values.cumsum()
index = np.searchsorted(csum, total, 'right')
return sample[:index].copy()

Untested as there isn't test code for this yet, as per this discussion.

bridwell · 2016-11-10T18:13:49Z

I think you're right. That's a lot cleaner. But maybe after doing the search sorted, sample a household with size equal to the overage (assuming one exists)? That way the total could be hit exactly.

Eh2406 · 2016-11-10T18:18:50Z

Such a fix over samples the small elements. That may be the intended behavior, but I'd find it surprising givan the current dock string.

bridwell · 2016-11-10T22:49:25Z

Good point. I have a new version that does the cumsum and then iterates over each remaining element until the total is met. I'll put in a PR after I do some more testing.

bridwell · 2016-11-14T18:55:13Z

See the above pull request. Let me know if you see any issues with these changes or if this doesn't adequately address the issue.

janowicz · 2016-12-06T21:29:04Z

Closing this issue for now as #180 is merged.

moritz-kirmse-forcity mentioned this issue Nov 10, 2016

20161109 bugfix in models.transition.remove_rows #177

Open

bridwell mentioned this issue Nov 14, 2016

Fix sample rows no replace #180

Merged

janowicz closed this as completed Dec 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No feedback if iteration does not reach exact result in utils.sampling.sample_rows #178

No feedback if iteration does not reach exact result in utils.sampling.sample_rows #178

moritz-kirmse-forcity commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

moritz-kirmse-forcity commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

moritz-kirmse-forcity commented Nov 10, 2016

bridwell commented Nov 10, 2016

moritz-kirmse-forcity commented Nov 10, 2016

bridwell commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

bridwell commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

bridwell commented Nov 10, 2016

bridwell commented Nov 14, 2016

janowicz commented Dec 6, 2016

No feedback if iteration does not reach exact result in utils.sampling.sample_rows #178

No feedback if iteration does not reach exact result in utils.sampling.sample_rows #178

Comments

moritz-kirmse-forcity commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

moritz-kirmse-forcity commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

moritz-kirmse-forcity commented Nov 10, 2016

bridwell commented Nov 10, 2016

moritz-kirmse-forcity commented Nov 10, 2016

bridwell commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

bridwell commented Nov 10, 2016

Eh2406 commented Nov 10, 2016

bridwell commented Nov 10, 2016

bridwell commented Nov 14, 2016

janowicz commented Dec 6, 2016