Unusual failure handling for pull & push execution with replication > 1 #2048

onderkalaci · 2018-03-09T13:33:32Z

Although it is expected, I'd like to open an issue to keep track of.

For replication factor 2, even if a worker is down, Citus successfully handles failures and returns the result. It looks like the decision we've taken for pull & push execution breaks that. See references [1], [2]

See the steps:

-- create a distributed table with replication factor = 2
SET citus.shard_replication_factor TO 2;
CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 int, value_3 float, value_4 bigint);
SELECT create_distributed_table('users_table', 'user_id');

-- generate some random data
INSERT INTO users_table SELECT (i * random())::int % 10000, timestamp '2014-01-10 20:00:00' +
       random() * (timestamp '2014-01-20 20:00:00' -
                   timestamp '2014-01-10 10:00:00'),(i * random())::int % 10000, (i * random())::int % 10000, (i * random())::int % 10000 FROM generate_series(0, 10000) i;


-- stop one of the workers
pg-latest/bin/pg_ctl -D citus-installation/data/ -m i stop

-- run a real-time query, it'll get the results
SELECT count(*) FROM users_table ;
WARNING:  connection error: 10.192.0.174:5432
DETAIL:  could not connect to server: Connection refused
	Is the server running on host "10.192.0.174" and accepting
	TCP/IP connections on port 5432?
WARNING:  connection error: 10.192.0.174:5432
-- some more warnings
 count 
-------
 10001
(1 row)

-- now, run the same query via pull & push execution
SELECT * FROM (SELECT count(*) FROM users_table OFFSET 0) as foo;
DEBUG:  generating subplan 51_1 for subquery SELECT count(*) AS count FROM public.users_table OFFSET 0
DEBUG:  Plan 51 query after replacing subqueries and CTEs: SELECT count FROM (SELECT intermediate_result.count FROM read_intermediate_result('51_1'::text, 'binary'::citus_copy_format) intermediate_result(count bigint)) foo
WARNING:  connection error: 10.192.0.174:5432
DETAIL:  could not send data to server: Connection refused
could not send SSL negotiation packet: Connection refused
ERROR:  failure on connection marked as essential: 10.192.0.174:5432

The reason is that we've marked the connections for pushing results back as critical, which leads to this issue.

The text was updated successfully, but these errors were encountered:

lithp mentioned this issue May 16, 2018

Mitmproxy-based automated failure testing #2119

Merged

16 tasks

metdos mentioned this issue Jul 8, 2018

Automate Remaining Failure/Cancellation Tests #2262

Open

5 tasks

olegpelipenko mentioned this issue Jul 18, 2018

Unusual failure handling on write operations when remote node is unavailable and replication_shard == 2 #2290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unusual failure handling for pull & push execution with replication > 1 #2048

Unusual failure handling for pull & push execution with replication > 1 #2048

onderkalaci commented Mar 9, 2018

Unusual failure handling for pull & push execution with replication > 1 #2048

Unusual failure handling for pull & push execution with replication > 1 #2048

Comments

onderkalaci commented Mar 9, 2018