Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue when primary replica is out-dated #515

Closed
windkit opened this issue Nov 14, 2016 · 3 comments
Closed

Performance Issue when primary replica is out-dated #515

windkit opened this issue Nov 14, 2016 · 3 comments

Comments

@windkit
Copy link
Contributor

windkit commented Nov 14, 2016

Description

When the primary replica is out-dated (primary inconsistency), client would experience a long latency.
This is because leo_storage_read_repairer would wait for ?DEF_REQ_TIMEOUT (30s)

Details

  1. Put Object
    $ s3cmd put test1 s3://test/1
    WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
    upload: 'test1' -> 's3://test/1'  [1 of 1]
     139073 of 139073   100% in    0s     4.97 MB/s  done
    
  2. List Replicas
    $ leofs-adm whereis test/1
    -------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
     del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
    -------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
           | [email protected]      | b0111fb7e79e29994e595b820e1ce691     |       136K |   451f713c99 | false          |              0 | 5413f0919bc71  | 2016-11-14 17:57:56 +0900
           | [email protected]      | b0111fb7e79e29994e595b820e1ce691     |       136K |   451f713c99 | false          |              0 | 5413f0919bc71  | 2016-11-14 17:57:56 +0900
           | [email protected]      | b0111fb7e79e29994e595b820e1ce691     |       136K |   451f713c99 | false          |              0 | 5413f0919bc71  | 2016-11-14 17:57:56 +0900
    
  3. Stop the primary node ([email protected])
    192.168.100.40:leo_storage/bin$ ./leo_storage stop
    
  4. Overwrite Object
    $ s3cmd put testfile_5m s3://test/1
    WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
    upload: 'testfile_5m' -> 's3://test/1'  [1 of 1]
     5242880 of 5242880   100% in    0s    42.64 MB/s  done
    
  5. Restart Node
    192.168.100.40:leo_storage/bin$ ./leo_storage start
    
  6. List Replica
    $ leofs-adm whereis test/1
    -------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
     del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
    -------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
           | [email protected]      | b0111fb7e79e29994e595b820e1ce691     |       136K |   451f713c99 | false          |              0 | 5413f0919bc71  | 2016-11-14 17:57:56 +0900
           | [email protected]      | b0111fb7e79e29994e595b820e1ce691     |      5120K |   a4cf6430bd | false          |              0 | 5413f18a25370  | 2016-11-14 18:02:15 +0900
           | [email protected]      | b0111fb7e79e29994e595b820e1ce691     |      5120K |   a4cf6430bd | false          |              0 | 5413f18a25370  | 2016-11-14 18:02:15 +0900
    
  7. Get Object
    $ s3cmd get s3://test/1 dl
    download: 's3://test/1' -> 'dl'  [1 of 1]
     5242880 of 5242880   100% in   30s   169.86 kB/s  done
    

Related Log

At [email protected]

[W]     [email protected]       2016-11-14 18:04:16.790098 +0900        1479114256      leo_storage_read_repairer:compare/4     167     [{node,'[email protected]'},{addr_id,234033039629873983006165090470339602065},{key,<<"test/1">>},{clock,1479113874979953},{cause,primary_inconsistency}]
[W]     [email protected]       2016-11-14 18:04:16.791279 +0900        1479114256      leo_storage_read_repairer:compare/4     167     [{node,'[email protected]'},{addr_id,234033039629873983006165090470339602065},{key,<<"test/1">>},{clock,1479113874979953},{cause,primary_inconsistency}]
[W]     [email protected]       2016-11-14 18:04:46.792108 +0900        1479114286      leo_storage_read_repairer:loop/6        124     [{key,<<"test/1">>},{cause,timeout}]

At Gateway

[W]     [email protected]       2016-11-14 18:04:48.258088 +0900        1479114288      leo_gateway_rpc_handler:handle_error/5  300     [{node,'[email protected]'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
@windkit
Copy link
Contributor Author

windkit commented Nov 14, 2016

Currently there is a bug in leo_storage_read_repairer the loop cannot exit normally (will wait for timeout)
https://github.com/leo-project/leo_storage/blob/develop/src/leo_storage_read_repairer.erl#L108

For example Replicas 3, Read Quorum 2 and the primary replica is inconsistent.

Event Args
Initial NumOfNodes = 3, R =2, length(E) = 0
primary NumOfNodes = 3, R =1, length(E) = 0
secondary (primary_inconsistent) NumOfNodes = 3, R =1, length(E) = 1
secondary (primary_inconsistent) NumOfNodes = 3, R =1, length(E) = 2

The loop then wait for timeout (30s)

@windkit
Copy link
Contributor Author

windkit commented Nov 15, 2016

I have created a PR for this issue at
leo-project/leo_storage#18

@windkit
Copy link
Contributor Author

windkit commented Dec 2, 2016

DONE

@windkit windkit closed this as completed Dec 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants