Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Performance Review #14615

Closed
alonsogarciapablo opened this issue Jan 16, 2019 · 20 comments
Closed

[Dashboard] Performance Review #14615

alonsogarciapablo opened this issue Jan 16, 2019 · 20 comments
Assignees
Labels

Comments

@alonsogarciapablo
Copy link
Contributor

We have to make sure the New Dashboard performs 👌 in accounts with loads of maps and datasets. @andy-esch recently explained how the performance of the actual Dashboard is pretty bad in his account (~1000 datasets).

The goal of this issue is to assess what the performance of the New Dashboard is in accounts like @andy-esch 's and do a little investigation of what's causing the UI to be slow, so that we can decide how to fix some or all of them in the near future.

A more performant Dashboard will be add a lot of value to heavy CARTO users 💪

cc: @javitonino @ethervoid

@javitonino
Copy link
Contributor

Randomly came across this which should help: #14621

@javitonino
Copy link
Contributor

Some more info that might be relevant: #13953

@gonzaloriestra
Copy link
Contributor

Another place where with_presenter_cache could help:

VisualizationPresenter.new(dependent_visualization, @current_viewer, @context).to_summarized_poro

@ethervoid ethervoid self-assigned this Jan 22, 2019
@alonsogarciapablo
Copy link
Contributor Author

@jesusbotella sent me this issue about the (poor) performance of the datasets view:
https://groups.google.com/forum/#!msg/cartodb/K7dG4p0Tpyw/NP5ncVshCQAJ
Similar to what @andy-esch was expressing. I see @javitonino already replied. If we can fix this in the New Dashboard, it will be a super-win for our users 💪

@alonsogarciapablo
Copy link
Contributor Author

@ethervoid do you have any insights regarding this so far? Thanks!

@ethervoid
Copy link
Contributor

@alonsogarciapablo I'm reading all the issues, understanding the measures taken an so on. I finished yesterday the Likes cleaning so I'm on it :)

@alonsogarciapablo
Copy link
Contributor Author

Perfect! Thanks for the update! 💪

@ethervoid
Copy link
Contributor

ethervoid commented Jan 29, 2019

Update

I've been uploading and trying to reproduce the problem in my local environment but is not the same and I don't have the same latency problems that in production. Why?

  • The metadata and the user database are in the same machine
  • We don't use pgbouncer
  • We don't have hundreds/thousands of users in the same database

Then I tested it in staging cloud with similar results. It could be due to similar factors than the local environment.

Next step is to provision a new RUI, redirect @andy-esch user to that rui and start doing profiling and so on.

For example doing this request: https://team.carto.com/u/eschbacher/api/v1/viz?tag_name=&q=&page=1&type=derived&exclude_shared=false&per_page=1&locked=true&shared=no&only_liked=false&api_key=apikey the avg response time is 4 seconds so is a good point to start measuring

@alonsogarciapablo
Copy link
Contributor Author

Thanks for the update @ethervoid! Keep 🕵️ and you will find something! 💪

@ethervoid
Copy link
Contributor

ethervoid commented Jan 29, 2019

Update

I've been doing benchmarking and here are my insights by now.

TL; DR; Looks like either the connection to the database or the size of the database is damaging the performance. We have to take into account the the dbd-team pg data disk is 2.7TB

I've added benchmarking in every visualization object to check where the time goes:

"timeviz": [
                {
                    "label": "to_poro",
                    "real": 0.0069547109997074585,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "load_related",
                    "real": 0.000002701999619603157,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "load_dependent",
                    "real": 0.0000027089990908280015,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "liked",
                    "real": 0.001709437001409242,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0.009999999999999787,
                    "total": 0.009999999999999787
                },
                {
                    "label": "permission",
                    "real": 3.5094704250004725,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0.15999999999999992,
                    "utime": 2.4799999999999995,
                    "total": 2.6399999999999997
                },
                {
                    "label": "stats",
                    "real": 0.003290009999545873,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "user_table",
                    "real": 0.00007167700096033514,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                }
            ]

As you can see a huge amount, 3.5s, goes to the permission part. Why? because permissions is the first time the user presenter is loaded, without cache, aaaaaand there we have the db_size part

So it looks like we still have the same problem. Measuring the time spent for @andy-esch 's user to retrieve the db size you can see the amount of time spent here (2,96s):

irb(main):002:0> Benchmark.measure { Carto::User.find_by_username('eschbacher').db_size_in_bytes }
[pool] Creating a new connection for 10.60.255.215:cartodb_user_6b304b59-2c86-4d7b-a8a2-dc98a5a35d48_db:6432:postgres:ar (1)
[pool] Found a connection for 10.60.255.215:cartodb_user_6b304b59-2c86-4d7b-a8a2-dc98a5a35d48_db:6432:postgres:ar (1)
=> #<Benchmark::Tms:0x0000000c8f7378 @label="", @real=3.636054564267397, @cstime=0.0, @cutime=0.0, @stime=0.15000000000000002, @utime=2.8100000000000005, @total=2.9600000000000004>

As an interesting part, @javitonino , was taking a look too and he notice that using the console the connection to the db takes ~3s:

Benchmark.measure { db = u.in_database(as: :superuser) } -> 3s
Benchmark.measure { u.db_size_in_bytes } -> 0.25s

Also @javitonino did other test: ActiveRecord::ConnectionAdapters::PostgreSQL::OID::TypeMapInitializer#run <4.620301> this command list all the different types in the DB and took 4.6s

Tomorrow I'll keep investigating :)

@ethervoid
Copy link
Contributor

Update

Well, we finally have a winner here and a patch for it

ActiveRecord in Rails 4.0 is loading all the types from the database when a connection is made and in our team database is taking ~2.5 seconds because it loads 145k types.

I did a test using two different user models: one from Sequel and the other from AR to do the same operation:

  0.000000   0.000000   0.000000 (  0.235332) --> Sequel
  1.920000   0.150000   2.070000 (  2.689540) --> AR

You can see the problematic method is this one:

ActiveRecord::ConnectionAdapters::PostgreSQL::OID::TypeMapInitializer#run <4.615232>
                                                ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#execute_and_clear <5.136224>
                                              ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#load_additional_types <5.136321>
                                            ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#initialize_type_map <5.136989>
                                          ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#initialize <5.142473>
                                        Class#new <5.142497>
                                      ActiveRecord::ConnectionHandling#postgresql_connection <5.142617>
                                    ActiveRecord::ConnectionAdapters::ConnectionPool#new_connection <5.142634>
                                  ActiveRecord::ConnectionAdapters::ConnectionPool#checkout_new_connection <5.142647>
                                ActiveRecord::ConnectionAdapters::ConnectionPool#acquire_connection <5.142675>
                              MonitorMixin#mon_synchronize <5.143279>
                            ActiveRecord::ConnectionAdapters::ConnectionPool#checkout <5.143303>
                          MonitorMixin#mon_synchronize <5.143356>
                        ActiveRecord::ConnectionAdapters::ConnectionPool#connection <5.143382>
                      Carto::Db::Connection.get_database <5.144334>
                    CartoDB::ConnectionPool#create_new_connection <5.144612>
                  CartoDB::ConnectionPool#fetch <5.144657>
                Carto::Db::Connection.connect <5.144780>
              Carto::UserService#in_database <5.145184>
            Carto::UserService#cartodb_extension_version <5.146027>
          Carto::UserService#cartodb_extension_version_pre_mu? <5.146076>
        Carto::UserService#db_size_in_bytes <5.672939>
      Kernel#eval <5.833719>
    IRB::WorkSpace#evaluate <5.833742>
  IRB::Context#evaluate <5.833836>
IRB::Irb#signal_status <5.833899>

As you can see, the AR model is taking 2.6 seconds to do the same operation than Sequel. Why? Basically, AR when open a connection with a database get all the types from pg_type which in our database is 145k different types.

We can divide the problem into data gathering and processing:

To execute the query and retrieve all the data takes 1.3seconds

  1.240000   0.090000   1.330000 (  1.331408)

Process the data takes 0.5s

  0.530000   0.010000   0.540000 (  0.537235)

We can improve the data processing a bit but we're not going to deal with the real problem: the huge amount of types.

But why are we having that huge amount of types? Taking a look at the table we can see most of the types are related to tables recordset and arrays of those recordset. So we have decided to filter those in this query and in case our app needs it there is no problem because AR would ask for it doing a query

But...are you saying that every time AR asks for a type that we don't have in our cache it's going to do a query? Yes, but those types are not a common thing to ask for. Basically our app is dealing with the metadata database which has a few custom types (207) so we're, probably, going to notice that overhead.

Is not a perfect solution but is going to save us about ~2 seconds when dealing with huge user databases operations

@ethervoid
Copy link
Contributor

Applying this patch now we're doing it in 0.6s

ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#execute_and_clear(sql="              SELECT t.oid, t.typname, t.typelem, t.typdelim, t.typinput, r.rngsubtype, t.typtype, t.typbasetype\n              FROM pg_type as t\n              LEFT JOIN pg_range as r ON oid = rngtypid\n              LEFT JOIN pg_type as tt ON (tt.typtype = 'c' AND tt.typarray = t.oid AND tt.typinput = 'record_in(cstring,oid,integer)'::regprocedure)\n              WHERE\n                tt.oid is null\n                AND (t.typname IN ('int2', 'int4', 'int8', 'oid', 'float4', 'float8', 'text', 'varchar', 'char', 'name', 'bpchar', 'bool', 'bit', 'varbit', 'timestamptz', 'date', 'time', 'money', 'bytea', 'point', 'hstore', 'json', 'jsonb', 'cidr', 'inet', 'uuid', 'xml', 'tsvector', 'macaddr', 'citext', 'ltree', 'interval', 'path', 'line', 'polygon', 'circle', 'lseg', 'box', 'timestamp', 'numeric')\n                OR t.typtype IN ('r', 'e', 'd')\n                OR t.typinput = 'array_in(cstring,oid,integer)'::regprocedure\n                OR t.typelem != 0)\n", name="SCHEMA", binds=[]) <0.598322>

I've made some tests and when a new type is need it ask for it without problem.

@ethervoid
Copy link
Contributor

ethervoid commented Feb 4, 2019

Update

We've re-routed team users to our test platform in production and here are the results using the patch:

Without the patch activated

screenshot from 2019-02-04 11-28-52

With the patch activated

screenshot from 2019-02-04 11-39-58

As you can see we have improved the time for the dashboard:

  • /me endpoint have improved from 4.3s to 1.4s
  • /viz endpoints have improved from: about ~5s to ~3s

In summary, we have passed from ~26s to load the dashboard to ~12.8.

I'd let the patch active and the team users routing through it in order to test for undesired side-effects. If we don't have any issue In a couple of days (2-3), I'll put that patch into production.

Hope this improvement is good enough :)

@ethervoid
Copy link
Contributor

Now I'm going to start checking the order by size part. Which maybe could be improved in order to avoid making N queries for size data. Stay tunned

@alonsogarciapablo
Copy link
Contributor Author

Great news @ethervoid! It's a great improvement!

@ethervoid
Copy link
Contributor

@alonsogarciapablo asked about the performance with the new dashboard. I've changed @andy-esch user to the new dashboard and here you can see the results:

Without the new patch

screenshot from 2019-02-04 14-17-30

With the new patch

screenshot from 2019-02-04 14-15-33

More or less the same. From ~12s to ~6s

@ethervoid
Copy link
Contributor

Update

About the size sort we have a bit of a problem there:

screenshot from 2019-02-04 14-24-57

In accounts with a huge number of tables we get a 504 (timeout) after 180s

@ethervoid
Copy link
Contributor

The ordering issue is going to continue here

ethervoid added a commit that referenced this issue Feb 6, 2019
* Patch for load times with pg_type loading in AR

- See #14615
- See rails/rails#19578
@ethervoid
Copy link
Contributor

Deployed this part. I'll continue here

@alonsogarciapablo
Copy link
Contributor Author

Great @ethervoid 👏👏👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants