You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's been almost 2 months since our last meeting regarding technical progress on activitysim, so it's good to get back into it. We've been a bit light on commits this week, but there are good reasons for that which I'll cover below and during our meeting tomorrow. So let's recap recent progress.
First, we are moving forward with the CSV configuration and Python coding approach which seems to work well for the models we've encountered thus far. This should come as no surprise, but I am increasingly convinced that this approach is good for this project, and we will start work to document and test this more officially.
We have taken steps to integrate with the separate OMX repository. The branch still exists in this repo should we ever need any of those changes again, but we're moving forward with the assumption that OMX will remain a separate repo. It's worth noting that there is an open pull request on the OMX repository with some cosmetic changes, and to date we've gotten no response to either the PR or the open issue that we've posted to that repository. We did however get the updates we needed to install OMX for activitysim.
After we got the necessary changes in OMX, we've been able to close some open issues and PRs on activitysim, which is good because we don't want too many outstanding PRs at one time as the changes would eventually begin to conflict with each other.
We spent a fair amount of time while we were waiting on licensing issues to resolve by trying to parallelize UrbanSim (which would have ramifications to the parallelization of activitysim). We were able to make some simple changes (mainly cacheing) which increased performance by 20-30% in some cases, but our opinion at this time is that because of UrbanSim's highly interdependent nature - e.g. each model is dependent on the previous step, and there are few subproblems (like a person can move from one side of the region to another so segmentation of the problem is difficult) which are easily amenable to parallelization.
We looked at the Multiprocessing module, Cython, Continuum's Blaze, and columnar databases, but our opinion is that the amount of work necessary to reframe the problem using these tools will be extensive, and even then the gains are likely to be modest. We've profiled the code extensively at this point, and the great portion of the time is spent in database-like operations - merges, lookups, groupbys, aggregations, and the like. This is where the time should be and there's not a great deal on our end that we can do to speed these things up. Doing these operations in C with all the data in memory (what we're doing now) is definitely the state-of-the-art for these sorts of operations, but we will continue to monitor computational frameworks for advances in this area.
That is not to say that activitysim will necessarily have the same limitations as urbansim. For instance, it's easy to imagine that if you can parallelize households in activitysim (that they're not co-dependent and don't need to synchronize with other households), then you can split batches up among computing hardware and gather results at the end. We will certainly evaluate this when the time comes - but is likely true that it will be challenging to parallelize the computation within an individual model. We can talk about this in some detail if you would like and are happy to answer questions about this.
The bulk of our time in the last 2 weeks or so has been spent on CDAP - the coordinated daily activity choice. This has been a very interesting problem! Initially I coded up a prototype of CDAP which was going well for one and two-person households, and we've even converted the csv specs over to the new framework for these households. Through several emails with Dave Ory, I realized that I had interpreted the methodology incompletely and that the real method was a bit more complicated. Basically there are contributions to the utility that come from every 1-person, 2-person, and 3-person combination of a multi-person household and so there are, for instance, 4 passes through the core MNL code for a 2 person household, etc.
At any rate, we were able to quickly get the information we needed from Dave, go to the drawing board and develop the best way to tackle the problem. Why was this so challenging you might ask? What's interesting is that in Python, all numeric computation has to be vectorized - for loops are incredibly slow - and for a problem like this vectorization is not trivial. In other words, we have to build large dataframes where a given household appears multiple times which each of the permutations of the people in that household.
Again, we can go into more depth on the call, but we do have a clear plan to move forward with this, Matt is currently working on it, and it will be completely documented and tested which requires significantly more attention than the progress we were making before break. We should also discuss why vectorization is such a problem for CDAP and why this is a challenge for the use of Python, which requires vectorized computation. I think I've said enough about this at this point and we can do the rest at the meeting tomorrow.
It's probably not hugely relevant to this group, but if you're interested, in January we put together an OpenStreetMap importer for our accessibility engine which is now ready to go. GTFS (transit) will be next when we get the chance.
We should also talk about next steps in the call as I hope to have some cycles freeing up next week to work on this. It is my hope that Matt will continue with CDAP and I will start looking at the next model in the list (which is tour generation).
@jiffyclub feel free to add anything I have overlooked. Thanks and talk to you tomorrow!
The text was updated successfully, but these errors were encountered:
It's been almost 2 months since our last meeting regarding technical progress on activitysim, so it's good to get back into it. We've been a bit light on commits this week, but there are good reasons for that which I'll cover below and during our meeting tomorrow. So let's recap recent progress.
We looked at the Multiprocessing module, Cython, Continuum's Blaze, and columnar databases, but our opinion is that the amount of work necessary to reframe the problem using these tools will be extensive, and even then the gains are likely to be modest. We've profiled the code extensively at this point, and the great portion of the time is spent in database-like operations - merges, lookups, groupbys, aggregations, and the like. This is where the time should be and there's not a great deal on our end that we can do to speed these things up. Doing these operations in C with all the data in memory (what we're doing now) is definitely the state-of-the-art for these sorts of operations, but we will continue to monitor computational frameworks for advances in this area.
That is not to say that activitysim will necessarily have the same limitations as urbansim. For instance, it's easy to imagine that if you can parallelize households in activitysim (that they're not co-dependent and don't need to synchronize with other households), then you can split batches up among computing hardware and gather results at the end. We will certainly evaluate this when the time comes - but is likely true that it will be challenging to parallelize the computation within an individual model. We can talk about this in some detail if you would like and are happy to answer questions about this.
At any rate, we were able to quickly get the information we needed from Dave, go to the drawing board and develop the best way to tackle the problem. Why was this so challenging you might ask? What's interesting is that in Python, all numeric computation has to be vectorized - for loops are incredibly slow - and for a problem like this vectorization is not trivial. In other words, we have to build large dataframes where a given household appears multiple times which each of the permutations of the people in that household.
Again, we can go into more depth on the call, but we do have a clear plan to move forward with this, Matt is currently working on it, and it will be completely documented and tested which requires significantly more attention than the progress we were making before break. We should also discuss why vectorization is such a problem for CDAP and why this is a challenge for the use of Python, which requires vectorized computation. I think I've said enough about this at this point and we can do the rest at the meeting tomorrow.
@jiffyclub feel free to add anything I have overlooked. Thanks and talk to you tomorrow!
The text was updated successfully, but these errors were encountered: