Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp/services/ledgerexporter: Extend tool to work on ledger ranges in parallel. #4426

Closed
Tracked by #4571
Shaptic opened this issue Jun 6, 2022 · 9 comments
Closed
Tracked by #4571
Assignees

Comments

@Shaptic
Copy link
Contributor

Shaptic commented Jun 6, 2022

We need to extend the ledgerexporter tool to export ledgers in parallel (e.g. doable on AWS Batch-able), because single-process exporting is absolutely time-prohibitive for large (and dense) ledger ranges.

@Shaptic Shaptic mentioned this issue Jun 6, 2022
7 tasks
@Shaptic Shaptic changed the title Extend ledgerexporter to work in parallel (e.g. AWS Batch-able), because single-process exporting is absolutely time-prohibitive for large ledger ranges. exp/services/ledgerexporter: Extend tool to work on ledger ranges in parallel. Jun 6, 2022
@bartekn bartekn self-assigned this Jun 6, 2022
@2opremio
Copy link
Contributor

I started implementing a single-process parallel version using goroutines, but .... considering the current memory requirements of Captive Core I am thinking that's probably a bad idea.

Then, I thought about simply adding a new --end-ledger flag, which we could use to split out subranges across different processes, but that poses another challenge: There is race condition on the /latest path. I don't think we can get around using some sort of distributed lock.

Additionally, I am thinking whether it's OK to update the /latest path before all the prior ledgers are successfully exported (e.g. update /latest when ledger 35 is exported but ledger 34 hasn't finished yet). Depending on the answer to that we may want to choose different parallelization strategies.

@2opremio
Copy link
Contributor

@bartekn any thoughts about this? Using a distributed lock is going to be a bit of a pain in the ass. If we need to do so I think I would rather use Kubernetes instead of AWS Batch (were we can use things like https://github.com/werf/lockgate )

@2opremio
Copy link
Contributor

2opremio commented Jun 28, 2022

Actually, we could use a lock in the archive backend itself. For S3 we have Object locks

@bartekn bartekn removed their assignment Jun 29, 2022
@bartekn
Copy link
Contributor

bartekn commented Jun 29, 2022

I'm not sure if any kind of locking is needed because this issue is about exporting old ledgers. So for any range we can split it to more smaller ranges and ingest these in paralel.

@2opremio
Copy link
Contributor

2opremio commented Jun 29, 2022

Yes, it doesn't make sense to parallelize exporting online ledgers. However, how do you suggest we handle the update of the /latest path? For instance:

  1. The network is at ledger 50000
  2. The ledger archive is at, say ledger 10000, (as pointed out by the /latest path)
  3. We want to use e.g. 4 workers (10k ledgers each), each in its own process
    • worker A: range 10001 to 11000
    • worker B: range 11001 to 12000
    • ....

How does /latest get updated? If we (wrongly) decided that each process updates /latest on its own, we have a race condition (e.g. if A finishes the processing last, /latest will end up with value 11000, which is incorrect).

Maybe we could cut corners and keep it simple e.g:

  • we disable updating the /latest endpoint through a flag (which we supply to the exporter workers when running in parallel)
  • we allow allow having gaps in the archive (so that we don't have to worry about only updating latest when all the prior ledgers have been exported)

thoughts?

@Shaptic
Copy link
Contributor Author

Shaptic commented Jun 29, 2022

This would require quite a bit of extra work, but we could have a /latest.lock and read+update the /latest file "atomically"...

We could also have the workers wait on each other in sequence so that /latest is updated in order, though that poses complications with separate runtimes (e.g. on Batch).

We could also abandon the idea of /latest getting written by the exporter jobs and either have a "job manager" which writes it out at the end, or a post-processing step in the job itself (or the whole batch), e.g. echo $(ls ./ledgers | sort -n | tail -n1) > ./latest

@Shaptic
Copy link
Contributor Author

Shaptic commented Jun 29, 2022

Additionally, I am thinking whether it's OK to update the /latest path before all the prior ledgers are successfully exported (e.g. update /latest when ledger 35 is exported but ledger 34 hasn't finished yet). Depending on the answer to that we may want to choose different parallelization strategies.

Currently, the indexer assumes that /latest is accurate. Specifically, it polls the file and if it grows, the assumption is that those ledgers are available to be exported. If we want that behavior to work (it was unclear at the time if we'd do real-time or batched index updates), then /latest needs to in fact be the latest.

@Shaptic
Copy link
Contributor Author

Shaptic commented Jun 30, 2022

Another suggestion: we can use a two-step model for parallelization: each process exports ledgers to its own (sub-)directory with its own /latest file, then another process merges them all into a single directory.

@Shaptic
Copy link
Contributor Author

Shaptic commented Jul 14, 2022

Done in #4443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants