Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events: Migrating from v1 to v2 is slow #11056

Closed
Stebalien opened this issue Jul 6, 2023 · 7 comments · Fixed by #11064
Closed

Events: Migrating from v1 to v2 is slow #11056

Stebalien opened this issue Jul 6, 2023 · 7 comments · Fixed by #11064
Assignees

Comments

@Stebalien
Copy link
Member

We should:

  1. Figure out what is slow. We might just be missing some indices, or maybe we need smaller transactions?
  2. Parallelize.
  3. Consider doing this migration in the background? This will be kind of annoying but there's no reason it can't be done.
@arajasek
Copy link
Contributor

arajasek commented Jul 10, 2023

My guess is we're spending most of our time walking up the chain and computing TSK CIDs for heights with no entries in the DB.

Proposal:

  • Identify heights that actually have entries in the events DB
  • chainStore.GetTipsetByHeight() those heights
  • if tsk doesn't match, stmtDeleteOffChainEvent them
  • if not, select-and-delete?

@Stebalien
Copy link
Member Author

My guess is we're spending most of our time walking up the chain and computing TSK CIDs for heights with no entries in the DB.

Not a second per tipset, unless that's an exaggeration.

Proposal:

That's going to repeatedly walk the chain, right? Although I guess the cache should help. But I'm pretty sure we have events at most heights.

@arajasek
Copy link
Contributor

arajasek commented Jul 10, 2023

@Stebalien

My guess is we're spending most of our time walking up the chain and computing TSK CIDs for heights with no entries in the DB.

Not a second per tipset, unless that's an exaggeration.

I ran some quick diagnostics, it was pretty consistently 0.5-1 seconds, and the vast majority of them had no events. My guess is that it's because we run 2 statements on every epoch (the delete off-chain and the select), even though most are no-ops.

Proposal:

That's going to repeatedly walk the chain, right? Although I guess the cache should help. But I'm pretty sure we have events at most heights.

I was proposing doing so using the database itself, not the chain. Writing some sort of SELECT heights FROM event THAT ACTUALLY HAVE ENTRIES WE WANT TO MIGRATE / DELETE, or whatever that is in database speak.

@Stebalien
Copy link
Member Author

Ah, I see. So one query to find the relevent epochs? Yeah, that should work. Also, it really sounds like we don't have the right indices. These statements should take milliseconds.

@arajasek
Copy link
Contributor

Here's a commit that adds some logging, as well as a shed command that begins the migration for easy reproducing.

Command: lotus-shed eth migrate-events

Just eyeballing some results, most tipsets are taking 1s, with 500ms devoted to the off-chain delete statement, and another 500 to the select statement:

2023-07-10T17:28:14.302-0400	INFO	filter	filter/index.go:180	delete time: 533.835943ms
2023-07-10T17:28:14.774-0400	INFO	filter	filter/index.go:185	select time: 471.039961ms
2023-07-10T17:28:14.774-0400	INFO	filter	filter/index.go:166	Migrating height 3023717

2023-07-10T17:28:15.322-0400	INFO	filter	filter/index.go:180	delete time: 548.036118ms
2023-07-10T17:28:15.781-0400	INFO	filter	filter/index.go:185	select time: 459.305833ms
2023-07-10T17:28:15.781-0400	INFO	filter	filter/index.go:166	Migrating height 3023716

2023-07-10T17:28:16.300-0400	INFO	filter	filter/index.go:180	delete time: 518.672855ms
2023-07-10T17:28:16.757-0400	INFO	filter	filter/index.go:185	select time: 456.932488ms
2023-07-10T17:28:16.757-0400	INFO	filter	filter/index.go:166	Migrating height 3023715

^CReceived interrupt, shutting down... Press CTRL+C again to force shutdown
2023-07-10T17:28:17.278-0400	INFO	filter	filter/index.go:180	delete time: 520.756515ms
2023-07-10T17:28:17.743-0400	INFO	filter	filter/index.go:185	select time: 464.845068ms
2023-07-10T17:28:17.744-0400	INFO	filter	filter/index.go:166	Migrating height 3023714

2023-07-10T17:28:18.280-0400	INFO	filter	filter/index.go:180	delete time: 536.074573ms
2023-07-10T17:28:18.755-0400	INFO	filter	filter/index.go:185	select time: 474.758714ms
2023-07-10T17:28:18.755-0400	INFO	filter	filter/index.go:166	Migrating height 3023713

2023-07-10T17:28:19.311-0400	INFO	filter	filter/index.go:180	delete time: 556.234014ms
2023-07-10T17:28:19.780-0400	INFO	filter	filter/index.go:185	select time: 468.380992ms

@fridrik01
Copy link
Contributor

The issue was slow SQL statements performed during the migration, I added some temporary indices for the purpose of the migration in #11064 and that resulted in the total migration taking 55sec on an events.db going back since FEVM date. That should be fast enough.

@TippyFlitsUK
Copy link
Contributor

I have just tested this on my production miner and I am fully synced with EnableEthRPC enabled!!
Took slightly longer for me in production but still only a couple of minutes!! 👍
I will keep monitoring but so far, everything is looking very solid!! 🚀

Daemon:  1.23.4-dev+mainnet+git.0d7621be5+api1.5.0
Local: lotus version 1.23.4-dev+mainnet+git.0d7621be5
2023-07-11T12:33:31.918+0100	[34mINFO[0m	filter	filter/index.go:304	upgrading event index from version 1 to version 2
2023-07-11T12:36:12.157+0100	[34mINFO[0m	filter	filter/index.go:262	Successfully migrated events to version 2 in 2m40.239250597s

lotus-daemon (copy).log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants