Profiling Memory Usage and Object Creation #1204

benjchristensen · 2014-05-16T03:38:43Z

We need to spend time profiling memory and object allocation and finding places where we can improve.

I would really appreciate help diving into this and finding problem areas. Even if you don't fix them, but just identity use cases, operators, etc that would be very valuable.

This is partly a result of the fact that in Netflix production we have seen an increase in YoungGen GCs since 0.17.x.

The areas to start should probably be:

Observable.create
Observable.lift
Subscriber
CompositeSubscription
map
flatMap

If you can or want to get involved in this please comment here so we all can collaborate together.

akarnokd · 2014-05-16T08:39:23Z

The most likely source of garbage is the CompositeSubscription: whenever a task is scheduled or a new merge-source appears, the state transition creates a new State object and copies the subscription array. PR #1145 reduces the amount of garbage by switching to HashSet on a larger composite. Since flatMap->mergeMap->merge needs to track the active inner subscriptions, it uses a composite as well and would gain the same benefits.

benjchristensen · 2014-05-16T17:23:27Z

The HashSet change shouldn't kick in though because these aren't generally large lists (handful at most, not hundreds). I expect it's just far too many of them being created.

daschl · 2014-05-18T15:42:08Z

I'll also like to help out here, since its crucial for our uses as well.

benjchristensen · 2014-05-19T16:51:41Z

Thanks @daschl I'd appreciate your help. Profiling and identifying hot spots is what we need most right now.

daschl · 2014-05-20T06:19:02Z

I did some GC profiling of my test workloads and I'd also like to nominate:

BlockingObservable (it seems that especially in the .single() case its not optimized)
Observable.subscribe (for both Observer and Subscriber) - in my GC logs they take up 40% of the overall GC pressure.

The bad news is that i had to fall back out of Observables to plain execution on the hot code path (aside from the overall wrapping observable), because also using it in the path produces way too much garbage (moving away from Rx in the hot code path got my throughput from 20% to 80% according to the GC logs) and it correlates with my findings since I could not sustain constant IO throughput because of full GCs happening way too frequently.

benjchristensen · 2014-05-20T08:05:04Z

Observable.subscribe (for both Observer and Subscriber)

Not surprised on this. Were you able to identify what the garbage is?

daschl · 2014-05-20T08:13:41Z

I did some changes in my code and will re-profile and post the results here so we get better measurements.

Here is a slightly older profile run which might give you a hint or two.

benjchristensen · 2014-05-20T08:20:43Z

We can definitely improve on the AtomicReference by using AtomicFieldUpdater. The others will require more effort to analyze along with sample code to see what's triggering them.

akarnokd · 2014-05-20T12:14:38Z

I think that many AtomicReference come from mostly the CompositeSubscription instances.

benjchristensen · 2014-05-20T21:56:32Z

I went back in history to 0.16.1 to compare performance of the basic map/flatMap behavior and found that current master is faster. This fits what I had expected of the work of the past couple months, which means it's not an obvious glaring issue (unless my test is completely flawed).

Here is the code for the test:

Results

0.16

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  2421210.583    68845.966    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     1017.787       73.597    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  2398541.067    90703.197    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5      990.623      100.735    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  4020548.060   262841.500    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    16205.747      352.618    ops/s

Master

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  3184873.133   172320.420    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     9079.937      343.905    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  3411785.677    73767.161    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    10860.963      294.309    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  7208334.997   703327.745    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    18720.797      278.529    ops/s

GC

On the master branch test I'm seeing GC results like this:

Iteration   5: 3189218.350 ops/s
          GC | wall time = 5.001 secs,  GC time = 0.047 secs, GC% = 0.94%, GC count = +98
Iteration   5: 9198.700 ops/s
          GC | wall time = 5.002 secs,  GC time = 0.048 secs, GC% = 0.96%, GC count = +98

versus 0.16

Iteration   5: 2420099.017 ops/s
          GC | wall time = 5.000 secs,  GC time = 0.046 secs, GC% = 0.92%, GC count = +96
Iteration   5: 993.867 ops/s
          GC | wall time = 5.001 secs,  GC time = 0.100 secs, GC% = 2.00%, GC count = +212

Summary

Unless I'm mistaken, current code is better:

similar GC behavior on Observable with 1 item, better GC behavior with 1024 items
higher ops/second on all of the map/flatMap tests

I'll start profiling this and improve ... but this does not reveal the source of the problems seen. Possibly it's related to schedulers, or it's a specific operator. I exercised map, flatMap (and thus merge), Observable and Subscribe here to get the most fundamental ones.

benjchristensen · 2014-05-20T22:10:23Z

The observeOn test shows the cost of scheduling ... particularly Observables of a single item.

    @GenerateMicroBenchmark
    public void observeOn(UseCaseInput input) throws InterruptedException {
        input.observable.observeOn(Schedulers.computation()).subscribe(input.observer);
        input.awaitCompletion();
    }

r.u.PerfObserveOn.observeOn        1  thrpt         5   457830.180     9943.221    ops/s
r.u.PerfObserveOn.observeOn     1024  thrpt         5    14496.430      518.463    ops/s

Thus, with an Observable of 1 item we can do 457k onNext/second, with an Observable of 1024 items we can do 14.8million onNext/second (14.5k * 1024).

benjchristensen · 2014-05-20T22:10:47Z

By the way, all testing is just being done on my Mac laptop ... so these numbers are all relative and not representative of proper server hardware.

benjchristensen · 2014-05-20T23:31:31Z

Converting from AtomicReference to AtomicReferenceFieldUpdater improved performance from this:

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  3184873.133   172320.420    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     9079.937      343.905    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  3411785.677    73767.161    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    10860.963      294.309    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  7208334.997   703327.745    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    18720.797      278.529    ops/s

to this:

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  3459205.110   124790.906    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     9225.037      604.720    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  3639603.060   225599.038    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    11135.613      337.022    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  7264202.633   214787.109    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    18795.790      713.668    ops/s

daschl · 2014-05-21T04:58:16Z

@benjchristensen I suppose the AtomicReferenceFieldUpdater mostly comes from less GC pressure, is that why the raw throughput increases in your tests? How did the GC wall time change?

daschl · 2014-05-21T04:58:33Z

If you want me to run a specific workload/type of test let me know so we can compare results.

akarnokd · 2014-05-21T08:55:45Z

I've been experimenting with FieldUpdaters and Unsafe for the SerializedObserver. It seems that by using Unsafe directly and thus avoiding security checks and an indirection, I can get about 8-10% more throughput. The downsides are that it is scheduled to be removed/standardized in future Java versions and platform dependence in terms of availability. Should I pursue the Unsafe path?

daschl · 2014-05-21T09:29:32Z

@akarnokd since RxJava also runs on android, I'm not sure how good/standard the support is there.

I know that the netty folks are having the same issues and they are wrapping those Unsafe calles in a PlatformDependent util class.

benjchristensen · 2014-05-21T17:50:03Z

wrapping those Unsafe calles in a PlatformDependent util class.

This sounds like a valid approach for us. As we mature Rx we'll want to squeeze as much performance out of it as we can, while still remaining portable.

benjchristensen · 2014-05-28T19:11:01Z

Testing with:

../gradlew benchmarks '-Pjmh=-f 1 -tu s -bm thrpt -wi 5 -i 5 -r 5 -prof GC .*PerfTransforms.*'

May 21st `0efda07`

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  4597237.797   227193.650    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5    12334.190      214.479    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  4522036.147   216489.787    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    12293.223      321.573    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  9133316.230   303274.438    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    19698.323      150.313    ops/s

May 26th `a34cba2`

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  4367166.623   145506.857    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5    11382.233      425.976    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  4468497.400    83649.115    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    11374.780      780.039    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  8851147.610   303583.393    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    19649.227     1134.279    ops/s

According to these results we got slower (though within mean error it appears, so if not slower, then no better).

akarnokd · 2014-05-28T19:27:26Z

Which Java version is this? Java 6 intrinsics isn't as good as the newer versions. Maybe the wip field in the merge operator causes more false sharing and needs padding.

benjchristensen · 2014-05-28T19:45:03Z

/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk

benjchristensen · 2014-05-28T19:49:43Z

Master branch with /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  4378589.533   109056.155    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5    10702.953      447.216    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  4341206.933   184228.619    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    10961.550      545.574    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  8996983.320   226242.030    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    19423.813      779.759    ops/s

benjchristensen · 2014-05-28T20:10:59Z

Here is a simple test without JMH (but using same coding pattern) that shows significant increases in throughput from 0.16 -> 0.17 -> 0.18 -> current master branch for this code:

    public void mapTransformation(UseCaseInput input) throws InterruptedException {
        input.observable.map(i -> {
            return String.valueOf(i);
        }).map(i -> {
            return Integer.parseInt(i);
        }).subscribe(input.observer);
        input.awaitCompletion();
    }

master

Run: 10 - 10,333,567 ops/sec 
Run: 11 - 10,235,100 ops/sec 
Run: 12 - 10,269,259 ops/sec 
Run: 13 - 10,170,560 ops/sec 
Run: 14 - 10,072,319 ops/sec

Version 0.18.3

Run: 10 - 7,973,782 ops/sec 
Run: 11 - 8,695,425 ops/sec 
Run: 12 - 8,283,768 ops/sec 
Run: 13 - 8,283,562 ops/sec 
Run: 14 - 8,270,888 ops/sec

Version 0.17.6 (using `OnSubscribe`)

Run: 10 - 7,812,927 ops/sec 
Run: 11 - 7,632,713 ops/sec 
Run: 12 - 7,488,673 ops/sec 
Run: 13 - 7,975,944 ops/sec 
Run: 14 - 7,882,146 ops/sec

Version 0.17.6 (using `OnSubscribeFunc`)

Run: 10 - 7,049,700 ops/sec 
Run: 11 - 7,175,042 ops/sec 
Run: 12 - 7,173,240 ops/sec 
Run: 13 - 7,069,685 ops/sec 
Run: 14 - 7,248,320 ops/sec

Version 0.16.1

Run: 10 - 4,765,899 ops/sec 
Run: 11 - 4,792,623 ops/sec 
Run: 12 - 4,709,982 ops/sec 
Run: 13 - 4,761,451 ops/sec 
Run: 14 - 4,769,627 ops/sec

daschl · 2014-05-28T20:13:11Z

Very good progress! I'll get back to profiling from master next week.

akarnokd · 2014-05-28T20:29:04Z

I run some benchmarks with AtomicInteger and volatile int inside merge and it appears that JMH benchmark gives 10% more throughput with AtomicInteger.

benjchristensen · 2014-05-28T20:40:51Z

Well that's odd and doesn't help much when two different ways of measuring are giving contradictory results :-(

benjchristensen · 2014-05-29T17:37:59Z

This is creating lots of ArrayList instances unnecessarily:
https://github.com/Netflix/RxJava/blob/0.18.2/rxjava-core/src/main/java/rx/subscriptions/CompositeSubscription.java#L172

benjchristensen · 2014-05-29T17:40:24Z

benjchristensen · 2014-05-29T17:44:47Z

benjchristensen · 2014-05-29T17:46:09Z

Those were from 0.18.2 ... now with Master, plus a modified CompositeSubscription we get this:

benjchristensen · 2014-05-29T17:49:38Z

Here is evidence for need of work on ReplaySubject:

benjchristensen · 2014-05-29T17:55:56Z

benjchristensen · 2014-05-29T18:01:27Z

The Subscription[] issue is definitely being caused by the use of a Scheduler. Here is some evidence:

Then the master branch with CompositeSubscription change:

The issue is definitely the CompositeSubscription as the following shows where it is master as it currently stands without my changes to CompositeSubscription:

mattrjacobs · 2014-05-29T18:23:30Z

I've added some logging to our production instances and discovered that the cause of large Subscription arrays we see in practice is caused by a prefetching operation which generates many (> 500) HystrixCommands, calls observe on them, puts them in an ArrayList, and finally calls Observable.merge on this list to have a single place to listen for overall completion.

This seems like a valid case to support and any work which improves performance for large Subscription arrays would be a meaningful improvement.

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen · 2014-05-29T19:15:40Z

I have submitted a pull request for this: #1281

We are testing the code in our environment shortly.

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen · 2014-05-29T19:40:40Z

For anyone wanting to dig into this, Java Flight Recorder has been very helpful, and far better than the other profiling tools I've used and tried for this.

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen · 2014-05-29T22:20:32Z

Backporting to 0.18.x in #1283

chrisjenx · 2014-05-30T11:11:53Z

Superb work guys, I've held off on 18.x on Android as I noticed an increase in GC, really glad you guys take this seriously!

benjchristensen · 2014-05-31T05:58:30Z

Thanks @chrisjenx ... it looks like the most glaring issues are resolved, and low hanging fruit taken care of. There are a few other things for us to improve on, but I think we'll release 0.19 early next week. I would appreciate your feedback on whether you see an improvement.

I have also opened #1299 to document our attempts on blocking vs non-blocking implementations and to seek input from anyone who can provide better solutions.

benjchristensen · 2014-05-31T05:59:59Z

@akarnokd Is there anything else that stands out to you that we should fix before closing this issue? I'll continue doing some profiling but it seems the obvious ones are done.

We'll continue working on performance going forward and those can have their own issues and pull requests, so if nothing else obvious stands out let's close this issue and not leave it open-ended.

akarnokd · 2014-05-31T07:00:41Z

The history List in ReplaySubject; since ArrayList uses more memory then actual items, it might be worth compacting it on a terminal state (one time, but might be costly and run out of memory). Alternatively, it could use fixed increment expansion strategy. Third option is to have a cache() overload that passes in a capacity hint to reduce reallocation and wasted space.

benjchristensen · 2014-06-01T05:02:02Z

I think that the object allocation penalty of resizing after a terminal event would be worse.

A cache() overload that takes a capacity hint may be valuable, particularly in the single item case where it could just a single volatile ref instead of an array.

benjchristensen · 2014-06-01T05:04:20Z

I have opened a new issue for the cache() overload: #1303

benjchristensen · 2014-06-01T05:07:09Z

I'm closing this issue out as I believe we have handled the most glaring problems and don't want this to be a never-ending issue. We will of course continue seeking to improve performance, but let's use individual issues for each improvement/problem we find.

Thanks everyone for your involvement on this one as it was rather significant and important.

@Xorlev and @daschl I would appreciate feedback once you've had a chance to try the changes in the master branch (or the portion that was backported to 0.18.4) to know if you see the improvements or still have issues. @Xorlev In particular I'd like to know if the issue you had was only the GC pressure, or if you still see signs of a memory leak (which I have not seen yet).

Xorlev · 2014-06-03T21:29:08Z

@benjchristensen Hystrix 1.3.16 w/ RxJava 0.18.4 has been in prod for about a day now, I'm happy to report a decrease in garbage (and CPU usage in general). I believe the pressure & the suboptimal subscription removal was causing the leak-like behavior. @mattrjacobs's use case matches a few of our own (fan out commands, wait on all), which is likely the source of the large numbers of subscriptions.

I'll keep an eye out for any similar issues that might crop up.

Thanks a lot for all the help and dedication to improving RxJava.

benjchristensen · 2014-06-03T22:35:45Z

Excellent. Thank you @Xorlev for the confirmation. I'll release Hystrix 1.3.17 in a few days hopefully with RxJava 0.19 as a dependency, and at least one performance optimization I found I can do in Hystrix directly.

benjchristensen added this to the 0.19 milestone May 20, 2014

benjchristensen mentioned this issue May 20, 2014

Scheduler fixes and related improvements #1190

Closed

This was referenced May 21, 2014

CompositeSubscription with atomic field updater #1236

Merged

SafeSubscriber memory reduction #1237

Merged

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014

Reduce Subscription Object Allocation

fb5f0d7

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen mentioned this issue May 29, 2014

Reduce Subscription Object Allocation #1281

Merged

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014

Reduce Subscription Object Allocation

42f5311

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014

Reduce Subscription Object Allocation

8dd70ef

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014

Reduce Subscription Object Allocation

15c385d

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014

Reduce Subscription Object Allocation

6fe35b8

- significant reduction in object allocations - details on research available at ReactiveX#1204

benjchristensen mentioned this issue May 29, 2014

Reduce Subscription Object Allocation #1283

Merged

akarnokd mentioned this issue May 30, 2014

ReplaySubject remove replayState CHM and related SubjectObserver changes #1287

Merged

benjchristensen mentioned this issue May 31, 2014

Explore Blocking vs Non-Blocking Solutions #1299

Closed

benjchristensen closed this as completed Jun 1, 2014

Profiling Memory Usage and Object Creation #1204

Profiling Memory Usage and Object Creation #1204

Comments

benjchristensen commented May 16, 2014

akarnokd commented May 16, 2014

benjchristensen commented May 16, 2014

daschl commented May 18, 2014

benjchristensen commented May 19, 2014

daschl commented May 20, 2014

benjchristensen commented May 20, 2014

daschl commented May 20, 2014

benjchristensen commented May 20, 2014

akarnokd commented May 20, 2014

benjchristensen commented May 20, 2014

Results

0.16

Master

GC

Summary

benjchristensen commented May 20, 2014

benjchristensen commented May 20, 2014

benjchristensen commented May 20, 2014

daschl commented May 21, 2014

daschl commented May 21, 2014

akarnokd commented May 21, 2014

daschl commented May 21, 2014

benjchristensen commented May 21, 2014

benjchristensen commented May 28, 2014

May 21st 0efda07

May 26th a34cba2

akarnokd commented May 28, 2014

benjchristensen commented May 28, 2014

benjchristensen commented May 28, 2014

benjchristensen commented May 28, 2014

master

Version 0.18.3

Version 0.17.6 (using OnSubscribe)

Version 0.17.6 (using OnSubscribeFunc)

Version 0.16.1

daschl commented May 28, 2014

akarnokd commented May 28, 2014

benjchristensen commented May 28, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

mattrjacobs commented May 29, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

benjchristensen commented May 29, 2014

chrisjenx commented May 30, 2014

benjchristensen commented May 31, 2014

benjchristensen commented May 31, 2014

akarnokd commented May 31, 2014

benjchristensen commented Jun 1, 2014

benjchristensen commented Jun 1, 2014

benjchristensen commented Jun 1, 2014

Xorlev commented Jun 3, 2014

benjchristensen commented Jun 3, 2014

May 21st `0efda07`

May 26th `a34cba2`

Version 0.17.6 (using `OnSubscribe`)

Version 0.17.6 (using `OnSubscribeFunc`)