Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PR=quorum in HEADs at GC and PW=1 at writing blocks [RCS-282] #1242

Merged
merged 1 commit into from
Sep 11, 2015

Conversation

kuenishi
Copy link
Contributor

@kuenishi kuenishi commented Sep 8, 2015

This commit tries to address following potential issues:

  • False-notfound in case of faulty cluster that has several nodes
    down, block server misunderstands that block was already deleted
    which may lead to block leak (PR default is 0)
  • Make sure at least single replica of a block is written in one
    of primary nodes (PW default is 0)
  • Prevent block server from crash in case heading a block returning
    other errors than notfound - this makes GC correctly skip unreachable
    blocks and go to try to collect next blocks and manifests as much
    as possible

The price of this commit: with PR and PW values increased at blocks,
the availability of PUTs and throughtput of garbage collection may
decrease than previous releases of Riak CS. A few Riak nodes being
unreachable may prevent PUT requests from successful return, or
garbage collection from collecting all blocks until unreachable nodes
come back.

We could use PW=2 for putting blocks but we didn't this is because
it decreases availability more - and we expect such blocks be repaired
during leeway period.

@kuenishi kuenishi added this to the 2.1.0 milestone Sep 8, 2015
@kuenishi kuenishi changed the title Add PR=quorum in HEADs at GC and PW=1 at writing blocks Add PR=quorum in HEADs at GC and PW=1 at writing blocks [RCS-282] Sep 8, 2015
@shino
Copy link
Contributor

shino commented Sep 10, 2015

No much time for now, sorry for no digging into details. GC seems to be hang up when executing batch for 4-node cluster which has two stopped node.

@shino
Copy link
Contributor

shino commented Sep 11, 2015

Quick diff:

--- a/src/riak_cs_block_server.erl
+++ b/src/riak_cs_block_server.erl
@@ -219,7 +219,8 @@ handle_cast({delete_block, ReplyPid, Bucket, Key, UUID, BlockNumber}, State=#sta
                 %% Report errors in HEADs to prevent crashing block
                 %% servers, as crash logs forces lager to sync log
                 %% files at each line.
-                Result = format_delete_result(Error, BlockNumber),
+                Result = format_delete_result(Error, {UUID, BlockNumber}),

@kuenishi kuenishi force-pushed the feature/pr-quorum-block-server branch from b040e56 to 1bf2890 Compare September 11, 2015 05:30
This commit tries to address following potential issues:

* False-notfound in case of faulty cluster that has several nodes
  down, block server misunderstands that block was already deleted
  which may lead to block leak (PR default is 0)
* Make sure at least single replica of a block is written in one
  of primary nodes (PW default is 0)
* Prevent block server from crash in case heading a block returning
  other errors than notfound - this makes GC correctly skip unreachable
  blocks and go to try to collect next blocks and manifests as much
  as possible

The price of this commit: with PR and PW values increased at blocks,
the availability of PUTs and throughtput of garbage collection may
decrease than previous releases of Riak CS. A few Riak nodes being
unreachable may prevent PUT requests from successful return, or
garbage collection from collecting all blocks until unreachable nodes
come back.

We could use PW=2 for putting blocks but we didn't this is because
it decreases availability more - and we expect such blocks be repaired
during leeway period.
@kuenishi kuenishi force-pushed the feature/pr-quorum-block-server branch from 1bf2890 to 2def697 Compare September 11, 2015 05:33
@shino
Copy link
Contributor

shino commented Sep 11, 2015

Much safer (= less possibility of false negativity) than before 🔒

borshop added a commit that referenced this pull request Sep 11, 2015
Add PR=quorum in HEADs at GC and PW=1 at writing blocks [RCS-282]

Reviewed-by: shino
@kuenishi
Copy link
Contributor Author

@borshop merge

@borshop borshop merged commit 2def697 into develop Sep 11, 2015
@shino shino deleted the feature/pr-quorum-block-server branch September 11, 2015 07:38
@kuenishi
Copy link
Contributor Author

For release note: Github Issue has full description about this fix.

@Basho-JIRA
Copy link

For release note: Github Issue has full description about this fix.

_[posted via JIRA by Kota Uenishi]_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants