-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance & redundant tests #16
Comments
Note that the current algorithm does not check whether the assertions are redundant. As I recall, the assertions we experimented with were very redundant. We often had |
Interesting! Thanks for digging into this. I agree that we shouldn't be generating computed columns unless they're explicitly mentioned in the data spec. Can you tell what's triggering that? It doesn't look like Here are some things that come to mind:
https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L250-L271
https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L320-L354 I don't think the primary key/ foreign key tests do any merging, just set algebra. If we can figure out what the source of the inefficiency is, I'm happy to take a crack at fixing it. Adding 30% to execution time is definitely not ideal. |
Sorry not merge tables, I misspoke - I think it's this to_frame call. Does that get called a lot? Some columns might not be cached so they might get computed every time... |
Ah, got it. That line gets called for columns that are indexes of the underlying DataFrame. It's an attempt to get around a limitation in orca where Originally we were using
The intention is to pass an empty list of columns to https://github.com/UDST/orca/blob/master/orca/orca.py#L193-L218 If that's the cause of the slowdown, it would explain why I didn't notice it with Bay Area UrbanSim last August. Any ideas about a more efficient way to pull MultiIndexes out of a DataFrameWrapper? |
I dug into this a bit more.
Working on a fix! |
Honestly I try to avoid multi-indexes so no brilliant ideas, but your explanation certainly makes sense. Hopefully that takes care of it! |
Thanks for finding this! On a different note we have performance issues because our spec looks like: ....
ColumnSpec('zone_id', registered = True , numeric =True , missing = False, min = 1, max = 2899)
... It processes them in order.
A little redundant yes? But I did not have time to remove the redundant tests in our spec, nore do I see an easy way for OT to remove them automatically. |
(Update: per comments in PR #17, the You're right about the redundancy, @Eh2406. Could you do a test of execution time with orca_test vs. without orca_test, to get a sense of the performance hit? I'd expect most of the tests to be practically costless, unless there's a bad cache setting and columns are being regenerated. But maybe I'm wrong -- and the redundancy could also be annoying if it makes the debug output harder to read, or just on a conceptual level. Right now the hierarchy is specified within the assertion functions. For example, |
We have never run orca_test during our models runs. It lives in its own workbook. As I recall that workbook takes approximately 15 minutes to run. To be clear that spec was generated from some prior art by @semcogli, and some automatic scan of a functional models data. So it is long, redundant and not optimized. I see 3 good outcomes for dealing with the redundancy:
|
Makes sense. Looking at the orca documentation, the default is for computed data not to be cached at all, which does seem like it would lend itself to situations where orca_test slows down performance by re-generating things. https://udst.github.io/orca/core.html#caching As a first step, let's add a test that warns users if a computed object has cache=False. |
First off, this library is very cool @smmaurer thanks for doing this!
Has anyone else noticed that orca_test is pretty slow? I mean, our simulation is about 73 minutes and when I added the UAL code it slowed down dramatically. At this point I've found the two causes of the problem and the actual new code is fairly quick.
Basically the orca test code adds 25 minutes to the simulation, and that's only verifying schemas in a few places.
My guess is that I see merge_tables called with all columns in the code. Maybe it should only be called with only the columns that are being verified in the specific orca_test. I mean, we have lots of computed columns and it's definitely known that if you ask for all of them that's an expensive operation.
If using all the columns is necessary, perhaps it's not necessary to use all the rows to verify the schemas? For verification purposes, I imagine we only need a few hundred rows from each table?
Barring all that, a simple on-off switch would seem essential so that it's not required to merge all the tables when not in debug mode...
The text was updated successfully, but these errors were encountered: