-
Notifications
You must be signed in to change notification settings - Fork 100
Project Meeting 2024.06.20
Michelle Bina edited this page Jun 20, 2024
·
6 revisions
- Admin
- Contracting
- Phase 9b Updates
- Phase 9a Updates
- CS to test SANDAG model, full sample, sharrow on, single process in AWS cloud using Intel vs AMD hardware, keeping the image same (holding all other factors common)
- Jeff to test with shared memory for skims completely disabled (single process only)
- RSG to test running MTC model keeping all things constant except the sharrow fix all branch code.
- WSP to test SFCTA run that is crashing due to insufficient resources with a small sample run, to test the hypothesis that it has something to do with disk space.
- AMPO Contracting
- Agencies should have received Agreement MOUs from AMPO
- Typically give agencies 3 months to get everything executed and transmitted
- Drafting of TOs for Phase 9b
- Joe to follow up with Jeff, Sijia, and David to discuss details
- Latest Run Results
- Compared to the start of Phase 9, many changes were made to resolve egregious run time and memory usage performance. There have been a lot of successes, but still a few outstanding things that question the stability of the ActivitySim code.
- One outstanding thing not resolved: while there have been successfully runs of the SANDAG model with sharrow on/off, single process, full sample – in one of those tests, it ran very well (on WSP’s machine) and other attempts to do the example same thing but have very different (negative) performance results. Hypotheses include:
- Could be hardware
- Success on a machine with AMD hardware
- No success on machines with Intel hardware
- CS to test this hypothesis on aws – using different instance types, varying AMD and Intel hardware
- Could be the version of numba
- RSG did a test with a numba version change and it still performed poorly, so that’s not it
- Could be a different hardware-related thing - not the CPU but the bandwidth between the CPU and RAM, but this is harder to test
- Could be related to a shared memory process in sharrow. Sharrow utilizes in multiprocess shared memory for xarray, even when running in single process.
- Jeff is creating code to test running without any shared memory. Jeff doesn’t know why this would be a problem but is trying anything.
- Could be hardware
- Other outstanding thing – when we’ve attempted to run multiprocess on SFCTA’s server, it is crashing due to a cryptic insufficient resource report. We can’t figure out what resource is insufficient. There’s 1 TB of RAM and presumably plenty of disk space.
- WSP to test SFCTA run that is crashing due to insufficient resources with a small sample run, to test the hypothesis that it has something to do with disk space.
- Longer-term consideration: We may want to find a way to track disk usage/requirements if we get into very large multiprocess runs.
- RSG ran the SEMCOG model with and without sharrow. SEMCOG model taking longer to run with 1.3 beta
- With sharrow there’s a reduction of run time from 6.1 hours to 4.2 hours. However, the workplace location choice model takes longer (this was seen with the MWCOG model as well, before Phase 9 work).
- We did see the same pattern in the SANDAG model (see Issue #6)
- Rerunning with updated code, there’s an increase in run time with sharrow. Maybe there’s something in the sharrow fix all branch that’s causing this. RSG to re-run MTC model with the sharrow fix all branch to see if it’s showing worse times; if so, then there’s something in that PR that’s slowing things down. We need to do a new baseline for the MTC model.