Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Add file details to recoveryState while downloading segments from remote store #6825

Merged
merged 8 commits into from
Apr 4, 2023

Conversation

sachinpkale
Copy link
Member

@sachinpkale sachinpkale commented Mar 24, 2023

Description

  • _cat/recovery API provide details of progress of recovery. This helps in getting insight especially when shards are bigger in size.
  • While restoring remote store backed index, entry is created but there is no data shown under _cat/recovery.
  • This feature request adds the support of tracking segment download progress.

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sachinpkale
Copy link
Member Author

Sample response:

[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     3s   remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    3               15.0%         20          1528423717 69092481        4.5%          1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     4.7s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    3               15.0%         20          1528423717 194610305       12.7%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     6.2s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    4               20.0%         20          1528423717 334396161       21.9%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     7.9s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    4               20.0%         20          1528423717 440269569       28.8%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     9.7s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    4               20.0%         20          1528423717 509246209       33.3%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     11.5s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    4               20.0%         20          1528423717 603486977       39.5%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     12.8s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    9               45.0%         20          1528423717 672821109       44.0%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     13.8s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    9               45.0%         20          1528423717 748990325       49.0%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     14.7s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    10              50.0%         20          1528423717 801528526       52.4%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     15.8s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    10              50.0%         20          1528423717 916560590       60.0%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     16.7s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    10              50.0%         20          1528423717 988551886       64.7%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     17.7s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    10              50.0%         20          1528423717 1060985550      69.4%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     18.9s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    16              80.0%         20          1528423717 1112420414      72.8%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     20.4s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    16              80.0%         20          1528423717 1213411390      79.4%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     21.7s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    16              80.0%         20          1528423717 1294299198      84.7%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     23.1s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    16              80.0%         20          1528423717 1379758142      90.3%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     24.3s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    16              80.0%         20          1528423717 1418670142      92.8%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     25.2s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    16              80.0%         20          1528423717 1462546494      95.7%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time  type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     26.1s remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    16              80.0%         20          1528423717 1515941950      99.2%         1528423717  -1           0                      -1.0%
[ec2-user@ip-172-32-32-249 ~]$ curl "$node:9200/_cat/recovery?active_only&v"
index      shard time type         stage source_host source_node target_host   target_node repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
my-index-1 0     27s  remote_store index n/a         n/a         172.32.32.112 node-3      n/a        n/a      20    20              100.0%        20          1528423717 1528423717      100.0%        1528423717  -1           0                      -1.0%

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@codecov-commenter
Copy link

Codecov Report

Merging #6825 (4306497) into main (e4d9fb5) will increase coverage by 0.08%.
The diff coverage is 85.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6825      +/-   ##
============================================
+ Coverage     70.60%   70.68%   +0.08%     
- Complexity    59151    59228      +77     
============================================
  Files          4810     4810              
  Lines        283502   283515      +13     
  Branches      40884    40888       +4     
============================================
+ Hits         200153   200414     +261     
+ Misses        66869    66635     -234     
+ Partials      16480    16466      -14     
Impacted Files Coverage Δ
...java/org/opensearch/index/shard/StoreRecovery.java 65.75% <66.66%> (-0.20%) ⬇️
...in/java/org/opensearch/index/shard/IndexShard.java 69.39% <77.77%> (-0.68%) ⬇️
...earch/index/store/RemoteSegmentStoreDirectory.java 97.97% <100.00%> (-1.34%) ⬇️

... and 505 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sachinpkale sachinpkale marked this pull request as ready for review March 25, 2023 03:06
@@ -4377,6 +4377,9 @@ public void close() throws IOException {
* @throws IOException if exception occurs while reading segments from remote store
*/
public void syncSegmentsFromRemoteSegmentStore(boolean overrideLocal) throws IOException {
if (recoveryState.getStage() != RecoveryState.Stage.INDEX) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - should we add an assert here and control this from the caller itself; what I mean is that calling this method when recovery state stage is not Index is probably not the correct thing to do in the first place. By the looks, for the caller it could lead to unexpected behaviour while this call is supposed to sync segments from remote store.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I am not very much convinced on adding this block here (also the part of quietly returning). In SegRep integration with remote store, this can create issues (cc: @ankitkala ). Let me think more on this.

@@ -268,7 +268,9 @@ public void copyFrom(Directory from, String src, String dest, IOContext context)
in.copyFrom(new FilterDirectory(from) {
@Override
public IndexInput openInput(String name, IOContext context) throws IOException {
index.addFileDetail(dest, l, false);
if (index.getFileDetails(dest) == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question - this check has been added to dedup file download? or from stats point of view?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add this ? Is it for retries for the same file ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is from stats point of view. If I report the same file twice, assertion in addFileDetails method fails.

Before downloading files, we are reporting all the files to be downloaded (new code block is added in the IndexShard class) before the actual transfer starts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we then remove the assert ? Looks to be harmless .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those asserts make sure that we don't add the same file twice to the recovery tracker. IMO, that is the correct behavior. Currently, in our restore flow, we call IndexShard.syncSegmentsFromRemoteSegmentStore twice. Even though the second time, we skip downloading the files, without the above check, it will get added to tracker again.

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on high level. Have left some minor comments & questions.

@sachinpkale sachinpkale force-pushed the recovery-file-details branch from 4306497 to f15b02d Compare March 30, 2023 07:28
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sachin Kale <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sachin Kale <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 1 minor comment .

@gbbafna gbbafna merged commit e12a5b9 into opensearch-project:main Apr 4, 2023
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Apr 4, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 4, 2023
…gments from remote store (#6825)

* Use existing StatsDirectoryWrapper to record recovery stats

Signed-off-by: Sachin Kale <[email protected]>
(cherry picked from commit e12a5b9)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Apr 4, 2023
…gments from remote store (#6825) (#6976)

* Use existing StatsDirectoryWrapper to record recovery stats

(cherry picked from commit e12a5b9)

Signed-off-by: Sachin Kale <[email protected]>
mitrofmep pushed a commit to mitrofmep/OpenSearch that referenced this pull request Apr 5, 2023
…gments from remote store (opensearch-project#6825)

* Use existing StatsDirectoryWrapper to record recovery stats

Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Valentin Mitrofanov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants