Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowing down of planning performance after prolonged OTP2 execution #3273

Closed
sven4all opened this issue Dec 3, 2020 · 7 comments
Closed

Comments

@sven4all
Copy link
Contributor

sven4all commented Dec 3, 2020

Expected behavior

I expect the performance of planning with OTP more or less the same wether you run it for 1 hour in a row or 48 hours. I also expect to have a more or less stable memory usage (anyway not forever increasing).

Observed behavior

The performance of planning is slowing down, that is also reflected in the debugOutput.

This is the output for planning exactly the same trip on a OTP instance (two different machines with exactly the same hardware configuration) that is running for 5 minutes

  "metadata": {
    "searchWindowUsed": 5400,
    "nextDateTime": 1606990260000,
    "prevDateTime": 1606979460000
  },
  "debugOutput": {
    "precalculationTime": 0,
    "directStreetRouterTime": 17,
    "transitRouterTime": 419,
    "filteringTime": 1,
    "renderingTime": 0,
    "totalTime": 437,
    "transitRouterTimes": {
      "tripPatternFilterTime": 62,
      "accessEgressTime": 48,
      "raptorSearchTime": 301,
      "itineraryCreationTime": 8
    }
  }

and this is the result for a machine that is running for approximately two days.

 "metadata": {
    "searchWindowUsed": 5400,
    "nextDateTime": 1606990260000,
    "prevDateTime": 1606979460000
  },
  "debugOutput": {
    "precalculationTime": 0,
    "directStreetRouterTime": 17,
    "transitRouterTime": 1337,
    "filteringTime": 0,
    "renderingTime": 0,
    "totalTime": 1354,
    "transitRouterTimes": {
      "tripPatternFilterTime": 175,
      "accessEgressTime": 33,
      "raptorSearchTime": 1126,
      "itineraryCreationTime": 3
    }
  }

When changing the requested date in the request from today (3 december), to (7 december Monday with more or less the same schedule as today). The performance of the machine that is running for 5 minutes is more or less the same as for the machine that is running for almost two days.

5 minutes:

"metadata": {
    "searchWindowUsed": 5400,
    "nextDateTime": 1607335860000,
    "prevDateTime": 1607325060000
  },
  "debugOutput": {
    "precalculationTime": 0,
    "directStreetRouterTime": 11,
    "transitRouterTime": 361,
    "filteringTime": 1,
    "renderingTime": 0,
    "totalTime": 373,
    "transitRouterTimes": {
      "tripPatternFilterTime": 42,
      "accessEgressTime": 38,
      "raptorSearchTime": 274,
      "itineraryCreationTime": 7
    }
  }

2 days:

"metadata": {
    "searchWindowUsed": 5400,
    "nextDateTime": 1607335860000,
    "prevDateTime": 1607325060000
  },
  "debugOutput": {
    "precalculationTime": 0,
    "directStreetRouterTime": 17,
    "transitRouterTime": 343,
    "filteringTime": 0,
    "renderingTime": 0,
    "totalTime": 360,
    "transitRouterTimes": {
      "tripPatternFilterTime": 38,
      "accessEgressTime": 32,
      "raptorSearchTime": 270,
      "itineraryCreationTime": 3
    }
  }

The memory consumption of the machine that is running for 2 days is 4gb more than on the machine that is running for 5 minutes, but it happened multiple times that it keeped increasing until there was no more memory available.

Version of OTP used (exact commit hash or JAR name)

https://repo1.maven.org/maven2/org/opentripplanner/otp/2.0.0/otp-2.0.0-shaded.jar

Data sets in use (links to GTFS and OSM PBF files)

https://data.moopmobility.nl/gtfs/new/
http://download.geofabrik.de/europe/netherlands-latest.osm.pbf

gtfs-rt config in router-config en build-config

Command line used to start OTP

/usr/bin/java -Djava.io.tmpdir= -XX:NewRatio=1 -Xms124G -Xmx124G -server -jar /home/otp/otp-2.x-SNAPSHOT-shaded.jar --load /home/otp/graphBuildFiles/newest/graphs --serve --maxThreads 16

Router config and graph build config JSON

https://gist.github.com/sven4all/08e46bdacc32699a91684d72638bef7e

Steps to reproduce the problem

  1. Start otp instance and let it run for 2 days.
  2. Start another otp instance.
  3. Run exactly the same calls on both machines.
@t2gran t2gran added this to the 2.1 milestone Dec 3, 2020
@sven4all
Copy link
Contributor Author

sven4all commented Dec 3, 2020

Based on this discussion this morning I installed jvmtop, the HPCUR is constantly cycling between 20GB to 82GB in a period of 50-55s on the instance that is running for 2 days.

PID MAIN-CLASS HPCUR HPMAX NHCUR NHMAX CPU GC VM USERNAME #T DL
5371 SHOT-shaded.jar 47079m 126976m 114m n/a 1.86% 0.00% 11.0. otp 223

The instance that is running for 2 hours now is constantly cycling the HPCUR between 8GB to 70GB in approximately 4 minutes.

If you want run to run jvmtop on your own machine patric-r/jvmtop#109 (comment) is a handy comment (the master version is not supporting java 11).

@gmellemstrand
Copy link
Contributor

I see that you are running with -Xms124G -Xmx124G. That is quite a lot of memory. Does the machine you are running it on have that amount of memory available? Maybe this affects how often the garbage collection is performed.

Running a graph built using the provided files in IntelliJ, I see 3.5G of used memory after about 45 minutes. This is the realtime feeds provided, but not running any requests against the OTP instance. (Currently using -Xmx18G)

I do see the number of TripPatternForDate objects increasing steadily. From about 10k to about 14k in this timeframe. I am not sure it is a problem, but it might be worth looking into if updates are matched correctly. In our deployment we are using SiriTripPatternCache in order to do some custom matching, where the realtime data provider do not provide matching id's. The following warnings might also be because of this:

13:42:07.920 WARN (TimetableSnapshotSource.java:244) Failed to apply TripUpdate.

@sven4all
Copy link
Contributor Author

sven4all commented Dec 8, 2020

The machines do have 126GB of RAM available, that could be a reason why garbage collection is done less often.

Is that 3.5G with complete functionality (including streetGraph) or is the 3.5G only related to the realtime part?

@gmellemstrand
Copy link
Contributor

The 3.5G is with a graph built using netherlands-latest.osm.pbf, gtfs-nl.zip, and gtfs-iff-nl.zip, which gave me a graph size of 693MB, and using all the updaters provided in the router-config. I made no routing requests, so that would require more memory.

@gmellemstrand
Copy link
Contributor

@sven4all Do you still see the issue using the newest dev-2.x?

@sven4all
Copy link
Contributor Author

sven4all commented May 3, 2021

The performance is fine now after longer time, I have to investigate the memory further. It's still using a lot of memory but I also provide that. It doens't seem to run out of memory.

@t2gran
Copy link
Member

t2gran commented Jul 19, 2021

I will close this issue now, since it seems that the memory leak is fixed. If you @sven4all find the memory usage to be too greedy or have any other problems with it, then please do not hesitate to reopen the issue.

@t2gran t2gran closed this as completed Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants