Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip updating certain secondary indexes during replay #683

Closed
abitmore opened this issue Feb 20, 2018 · 10 comments
Closed

Skip updating certain secondary indexes during replay #683

abitmore opened this issue Feb 20, 2018 · 10 comments

Comments

@abitmore
Copy link
Member

Some (if not all) secondary indexes can be generated from current chain state only, no need to be continuously updated during replay.

@abitmore abitmore added this to the Future Non-Consensus-Changing Release milestone Feb 20, 2018
@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 19, 2018

Note: I am building requirements. I am not claiming this issue. Please comment on this post, and I'll update it with your changes:

According to libraries/chain/db_init.cpp, The account_index has two secondary indexes. account_member_index and account_referrer_index. The proposal_index has 1 secondary index, required_approval_index.

The grouped_orders plugin also adds a secondary index to limit_order_index called limit_order_group_index.

These indexes are kept updated during the replay process. But it is not necessary, as these indexes are only used when the current chain state is up to date. Therefore, delaying the updating of these indexes will increase replay performance.

It has yet to be proven that all 4 of these indexes are not used during the replay process. Each index should be examined to verify it is not used while the replay process is in action. Only if it is not used should updates to it be skipped during the replay process.

It appears the majority of the replay process is contained in libraries/chain/db_management.cpp. The process should be modified to:

  1. Be aware that the replay process is in progress.
  2. Skip updating the secondary indexes that are not used during the replay process.
  3. At the end of the replay process, build these secondary indexes.

Edit: Added limit_order_group_index, added the fact that we must verify that each index is not used within the replay process.

@abitmore
Copy link
Member Author

I didn't check how many secondary indexes are there in the code. However, in grouped_orders plugin I did add one more.

We do need to make sure that they're not used in replay.

@pmconrad
Copy link
Contributor

If possible, make a test run without secondary indexes first, to see how big the savings would be.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 19, 2018

If possible, make a test run without secondary indexes first, to see how big the savings would be.

@pmconrad I will attempt to.

We do need to make sure that they're not used in replay.

@abitmore Do you have suggestions of what should happen if they are? I'm thinking of the scenario of a replay is running, and a client connects and makes an API call that requires a secondary index. Can that happen (I think clients can connect during replay, but unsure)? If so, should it block until complete? Should it return an error?

@abitmore
Copy link
Member Author

Clients can't connect during replay. To do a simple test, we can remove related code from db_init.cpp, then try a replay, compare the time elapsed to the result when running old code. If they're needed, the replay should fail.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 19, 2018

Clients can't connect during replay.

Awesome. Here are my numbers:
Started witness node as:
witness_node --data-dir data/my_datadir --replay --rpc-endpoint "127.0.0.1:8090" --max-ops-per-account 1000 --partial-operations true

After 2 tries with secondary indexes, 3154084 blocks, avg(241.1075) secs with a difference of less than 0.5 secs

After 2 tries without secondary indexes, 3154084 blocks, avg(234.9055) secs with a difference of less than 1.19 secs.

Between with and without, a difference of 2.6%. So replaying 3154084 blocks from the genesis until the first of February, 2016 costs an extra 6 seconds.

Note: As indexes grow, insertion times can be longer (although usually not linearly, heavily dependent on implementation). So interpolating based on number of current total blocks may not be accurate.

Therefore, I tested again with a larger number of blocks (see further down):

here are the details with a small number of blocks:
try 1 (with secondary indexes):
99.5535% 3140000 of 3154084
2188164ms th_a db_management.cpp:78 reindex ] Writing database to disk at block 3144084
2188380ms th_a db_management.cpp:80 reindex ] Done
99.8705% 3150000 of 3154084
2189139ms th_a db_management.cpp:122 reindex ] Done reindexing, elapsed time: 240.82684299999999666 sec
Try 2 (with secondary indexes):
99.5535% 3140000 of 3154084
2540865ms th_a db_management.cpp:78 reindex ] Writing database to disk at block 3144084
2541077ms th_a db_management.cpp:80 reindex ] Done
99.8705% 3150000 of 3154084
2541826ms th_a db_management.cpp:122 reindex ] Done reindexing, elapsed time: 241.33825699999999870 sec
Try 3 (without secondary indexes):
99.5535% 3140000 of 3154084
3540592ms th_a db_management.cpp:78 reindex ] Writing database to disk at block 3144084
3540819ms th_a db_management.cpp:80 reindex ] Done
99.8705% 3150000 of 3154084
3541556ms th_a db_management.cpp:122 reindex ] Done reindexing, elapsed time: 235.50080600000001141 sec
try 4 (without secondary indexes):
99.5535% 3140000 of 3154084
225939ms th_a db_management.cpp:78 reindex ] Writing database to disk at block 3144084
226161ms th_a db_management.cpp:80 reindex ] Done
99.8705% 3150000 of 3154084
226901ms th_a db_management.cpp:122 reindex ] Done reindexing, elapsed time: 234.31027000000000271 sec

Try 1 with secondary indexes and a larger number of blocks (25380257 blocks):
99.999% 25380000 of 25380257
1115270ms th_a db_management.cpp:122 reindex ] Done reindexing, elapsed time: 6762.67697399999997288 sec
Try 2 without seondary indexes and a larger number of blocks (25380257 blocks):
99.999% 25380000 of 25380257
1144517ms th_a db_management.cpp:122 reindex ] Done reindexing, elapsed time: 6601.85906400000021677 sec

A difference of 160.8179 seconds, which is 2.38%.

@abitmore
Copy link
Member Author

Just found that I have some statistics about replay here: bitshares/bitshares-fc#20 (comment).

Replay time with and without grouped_orders plugin (which has a secondary index) is 5951 seconds vs 5603 seconds, the difference is 5%.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 21, 2018

My interpretation of the tests above:

  • Running with 3 secondary indexes increases the replay time by 2 to 3 percent. At the current blockchain size, running on my machine, it was demonstrated to add an extra 2 minutes and 40 seconds to a 110 minute process.
  • Running without the secondary indexes will require an additional step at the end to generate those indexes (not tested, so I am unsure how long that will take).
  • Moving some indexes to a plugin (as was done in the grouped_orders plugin, and as suggested in issue Move account_member_index to a plugin #682) is another way to mitigate the performance issue for some end-users.

With respect to the results above, I look forward to your comments, questions, and advice on how to proceed.

@pmconrad
Copy link
Contributor

IMO 2-3% performance gain to not justify the risks associated with getting it wrong.

@abitmore
Copy link
Member Author

Fixed by #1918.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants