Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance & redundant tests #16

Open
fscottfoti opened this issue Mar 27, 2017 · 10 comments
Open

Slow performance & redundant tests #16

fscottfoti opened this issue Mar 27, 2017 · 10 comments

Comments

@fscottfoti
Copy link

First off, this library is very cool @smmaurer thanks for doing this!

Has anyone else noticed that orca_test is pretty slow? I mean, our simulation is about 73 minutes and when I added the UAL code it slowed down dramatically. At this point I've found the two causes of the problem and the actual new code is fairly quick.

Basically the orca test code adds 25 minutes to the simulation, and that's only verifying schemas in a few places.

My guess is that I see merge_tables called with all columns in the code. Maybe it should only be called with only the columns that are being verified in the specific orca_test. I mean, we have lots of computed columns and it's definitely known that if you ask for all of them that's an expensive operation.

If using all the columns is necessary, perhaps it's not necessary to use all the rows to verify the schemas? For verification purposes, I imagine we only need a few hundred rows from each table?

Barring all that, a simple on-off switch would seem essential so that it's not required to merge all the tables when not in debug mode...

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 27, 2017

Note that the current algorithm does not check whether the assertions are redundant. As I recall, the assertions we experimented with were very redundant. We often had is_numeric called 3-7 times on many of the columns. We could have reduced this by better research on which assertions implied the others, but we ran out of time.

@smmaurer
Copy link
Member

Interesting! Thanks for digging into this.

I agree that we shouldn't be generating computed columns unless they're explicitly mentioned in the data spec. Can you tell what's triggering that? It doesn't look like merge_tables() is called directly by orca_test.

Here are some things that come to mind:

  1. If a table is a DataFrameWrapper, orca_test runs assert_table_can_be_generated() on it before testing any column characteristics. This is supposed to check that there are no errors in loading the core DataFrame (like missing data files, etc). If I'm accidentally causing DataFrames to be generated multiple times, this could be unexpectedly costly.

https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L250-L271

  1. As you said, orca_test does have to generate computed columns in order to verify the data spec. This shouldn't be too costly in aggregate, so long as the columns listed in the spec would all be generated eventually anyway. But maybe there's a bug or sequencing issue causing them to be generated multiple times? Here's the code that triggers evaluation of computed columns.

https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L320-L354

I don't think the primary key/ foreign key tests do any merging, just set algebra.

If we can figure out what the source of the inefficiency is, I'm happy to take a crack at fixing it. Adding 30% to execution time is definitely not ideal.

@fscottfoti
Copy link
Author

Sorry not merge tables, I misspoke - I think it's this to_frame call. Does that get called a lot? Some columns might not be cached so they might get computed every time...

@smmaurer
Copy link
Member

Ah, got it.

That line gets called for columns that are indexes of the underlying DataFrame. It's an attempt to get around a limitation in orca where DataFrameWrapper.get_column() fails if you request an index by name. (Is that intended? I've been meaning to open an issue about it.)

Originally we were using pd.Series(t.index) to get the index of DataFrameWrapper t, but @Eh2406 pointed out that this doesn't work for MultiIndexes. So #13 replaced it with

t.to_frame([]).reset_index()[column_name]

The intention is to pass an empty list of columns to to_frame() so that only the indexes are returned, but maybe that's not supported. Here's the relevant orca code:

https://github.com/UDST/orca/blob/master/orca/orca.py#L193-L218

If that's the cause of the slowdown, it would explain why I didn't notice it with Bay Area UrbanSim last August. Any ideas about a more efficient way to pull MultiIndexes out of a DataFrameWrapper?

@smmaurer
Copy link
Member

I dug into this a bit more.

DataFrameWrapper.to_frame([]) returns the whole DataFrame, which is not what we want.

DataFrameWrapper.to_frame(['']) returns just the index(es), but without their names.

DataFrameWrapper.index will return either an Index or a MultiIndex, including names. So we'll stick with that, and just need some logic to distinguish between the single- and multi-index cases.

Working on a fix!

@fscottfoti
Copy link
Author

Honestly I try to avoid multi-indexes so no brilliant ideas, but your explanation certainly makes sense. Hopefully that takes care of it!

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 29, 2017

Thanks for finding this!

On a different note we have performance issues because our spec looks like:

....
ColumnSpec('zone_id', registered =  True , numeric =True , missing = False, min = 1, max = 2899)
...

It processes them in order.

  1. registered test
  2. numeric, but numeric implies registered so:
    1. registered test
    2. numeric test
  3. missing test
  4. min, but min implies numeric so:
    1. registered test
    2. numeric test
    3. min test
  5. max, but max implies numeric so:
    1. registered test
    2. numeric test
    3. max test

A little redundant yes? But I did not have time to remove the redundant tests in our spec, nore do I see an easy way for OT to remove them automatically.

@smmaurer
Copy link
Member

(Update: per comments in PR #17, the to_frame() call does not seem to be the source of the slowdown after all.)

You're right about the redundancy, @Eh2406. Could you do a test of execution time with orca_test vs. without orca_test, to get a sense of the performance hit?

I'd expect most of the tests to be practically costless, unless there's a bad cache setting and columns are being regenerated. But maybe I'm wrong -- and the redundancy could also be annoying if it makes the debug output harder to read, or just on a conceptual level.

Right now the hierarchy is specified within the assertion functions. For example, assert_table_can_be_generated() calls assert_table_is_registered() as its first action. But if the redundancy is a problem, we could move that logic up to the function that parses a spec, and call each relevant assertion only once.

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 30, 2017

We have never run orca_test during our models runs. It lives in its own workbook. As I recall that workbook takes approximately 15 minutes to run. To be clear that spec was generated from some prior art by @semcogli, and some automatic scan of a functional models data. So it is long, redundant and not optimized.

I see 3 good outcomes for dealing with the redundancy:

  1. orca_test detects slow checks and suggests targeted ways to improve cache settings. I.E consider improving the cache settings on nearest_schools
  2. orca_test detects redundancy in the specifications and suggests targeted ways to remove that redundancy. I.E you specified registered = True and numeric =True This is the same but slower than just numeric =True
  3. orca_test detects redundancy but just doesn't do the redundant checks. As you suggested we could do that in the parsing code. Or we could use some kind of memoization cash, so that redundant calls are a no-op. I'm sure there are other options.

@smmaurer smmaurer changed the title performance issues? Slow performance & redundant tests Mar 31, 2017
@smmaurer
Copy link
Member

Makes sense. Looking at the orca documentation, the default is for computed data not to be cached at all, which does seem like it would lend itself to situations where orca_test slows down performance by re-generating things.

https://udst.github.io/orca/core.html#caching

As a first step, let's add a test that warns users if a computed object has cache=False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants