Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusual failure handling for pull & push execution with replication > 1 #2048

Open
onderkalaci opened this issue Mar 9, 2018 · 0 comments
Open

Comments

@onderkalaci
Copy link
Contributor

Although it is expected, I'd like to open an issue to keep track of.

For replication factor 2, even if a worker is down, Citus successfully handles failures and returns the result. It looks like the decision we've taken for pull & push execution breaks that. See references [1], [2]

See the steps:

-- create a distributed table with replication factor = 2
SET citus.shard_replication_factor TO 2;
CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 int, value_3 float, value_4 bigint);
SELECT create_distributed_table('users_table', 'user_id');

-- generate some random data
INSERT INTO users_table SELECT (i * random())::int % 10000, timestamp '2014-01-10 20:00:00' +
       random() * (timestamp '2014-01-20 20:00:00' -
                   timestamp '2014-01-10 10:00:00'),(i * random())::int % 10000, (i * random())::int % 10000, (i * random())::int % 10000 FROM generate_series(0, 10000) i;


-- stop one of the workers
pg-latest/bin/pg_ctl -D citus-installation/data/ -m i stop

-- run a real-time query, it'll get the results
SELECT count(*) FROM users_table ;
WARNING:  connection error: 10.192.0.174:5432
DETAIL:  could not connect to server: Connection refused
	Is the server running on host "10.192.0.174" and accepting
	TCP/IP connections on port 5432?
WARNING:  connection error: 10.192.0.174:5432
-- some more warnings
 count 
-------
 10001
(1 row)

-- now, run the same query via pull & push execution
SELECT * FROM (SELECT count(*) FROM users_table OFFSET 0) as foo;
DEBUG:  generating subplan 51_1 for subquery SELECT count(*) AS count FROM public.users_table OFFSET 0
DEBUG:  Plan 51 query after replacing subqueries and CTEs: SELECT count FROM (SELECT intermediate_result.count FROM read_intermediate_result('51_1'::text, 'binary'::citus_copy_format) intermediate_result(count bigint)) foo
WARNING:  connection error: 10.192.0.174:5432
DETAIL:  could not send data to server: Connection refused
could not send SSL negotiation packet: Connection refused
ERROR:  failure on connection marked as essential: 10.192.0.174:5432

The reason is that we've marked the connections for pushing results back as critical, which leads to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant