Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitmproxy-based automated failure testing #2119

Merged
merged 1 commit into from
Jul 6, 2018
Merged

Conversation

lithp
Copy link
Contributor

@lithp lithp commented Apr 19, 2018

Some slides explaining how this works: https://docs.google.com/presentation/d/1zxF32GvcFJp0s6UClL39hla9WqhbGvYgo30e2WYnADQ/edit?usp=sharing

A modernized #2044. This one works at the network level instead of the process level. A drawback is: we don't have a good way of exposing timing problems. An example might be, maybe if a packet comes in before the process is ready for it the process crashes. That class of bugs seems rare to me, so I think this approach is safe to move forward with.

It's also much easier! This doesn't require sudo, just an extra daemon which packets flow through. It also has the advantage of not caring at all how the process runs, so the same tests should work modulo any refactoring which we choose to do.

Some work left to be done:

  • There's a race condition which causes mitmproxy to sometimes return 'NoneType' object has no attribute 'tell'. What is going wrong?
  • Should we really be using a fifo? Maybe a socket makes more sense? A hostname:port is much more portable (though we would still need some way of blocking to make sure changes have been made, and would have to rely on a tool such as netcat in order to keep using COPY)
  • Refactor fluent.py into something more maintainable, the double-queue solution probably isn't optimal.
  • The Handler and Mixin hierarchy is cute but there's definitely a better way. Separating out the Building and the Handling seems like it'd improve things.
  • Convert all of our manual tests into automated tests under this framework!
  • Either pg_regress_multi.pl or fluent.py should redirect the proxy output to a log file
  • How hard would it be to extend this to cancellation testing?
  • Instead of bumping the Citus version, add citus.mitmproxy() to the test functions?
  • Currently it's only possible to fail one of the workers, should we add a second mitmproxy so we can play with that worker as well?
  • DROP TABLE crashes if you drop the connection after DROP TABLE IF EXISTS is sent. Open a ticket!
  • What do we do about test output which includes PIDs?
  • Profile this and figure out how to improve performance? 6 tests take 10 seconds to run.
  • recorder.dump() returns an empty line, fix that!
  • Some of these tests could be made faster. Instead of recreating tables with different settings of shard_replication_factor, we can start by using the higher replication factor, then manually set some of the shards inactive.
  • Make failure_drop_table consistent, prepared transactions have random ids, introduce something like enable_unique_job_ids -> enable_unique_transaction_ids
  • Make the failure_task_tracker_executor test reproducible (don't hard-code filepaths)

@lithp
Copy link
Contributor Author

lithp commented Apr 25, 2018

COPY with a mkfifo works great, but it relies on #2127 to somehow be fixed.

@lithp lithp force-pushed the mitmproxy-failure-testing branch 4 times, most recently from 8a24953 to cade93b Compare April 26, 2018 19:54
@codecov
Copy link

codecov bot commented Apr 27, 2018

Codecov Report

Merging #2119 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master   #2119      +/-   ##
=========================================
+ Coverage    93.7%   93.7%   +<.01%     
=========================================
  Files         110     110              
  Lines       27446   27453       +7     
=========================================
+ Hits        25719   25726       +7     
  Misses       1727    1727

@lithp lithp force-pushed the mitmproxy-failure-testing branch 3 times, most recently from cbfd90f to 05108bb Compare May 1, 2018 23:54
@lithp
Copy link
Contributor Author

lithp commented May 2, 2018

It's finally running (albeit in a rather hacky way) and passing on Travis!

@lithp lithp force-pushed the mitmproxy-failure-testing branch from 05108bb to fedadb2 Compare May 2, 2018 20:40
@lithp
Copy link
Contributor Author

lithp commented May 2, 2018

Rebased onto master

@lithp
Copy link
Contributor Author

lithp commented May 2, 2018

Clearing out some outdated TODOs:

  • rebase onto master
  • Jason brought up a good point: We could remove the dependency on plperlu by using COPY to write to our fifo.
  • The fifo location is hard-coded, use some kind of path which always works. Maybe make it a relative one?
  • Does the travis postgres have plperlu? Is that really something we want to be adding a dependency on?
  • Break out the tests into a new make target: make check-failure
  • Somehow install mitmproxy on travis
  • Have pg_regress_multi.pl start the mitmproxy daemon.
  • Maybe we can work around COPY inside plpgsql functions fails after first invocation? #2127 by using EXECUTE so it plans the query afresh every time?

@lithp
Copy link
Contributor Author

lithp commented May 2, 2018

Cancellation testing

The proxy sits at the network layer instead of acting directly on the binary. This means it's more stable under refactoring, but also means we have much less control over what the binary is doing when we take some action. If the action is sending a signal (to cancel the binary), it's likely that precise timing will be more important. We'll want to send a signal right at some critical section. So, maybe cancellation testing is better handled a different way.

For instance, we could embed macro calls like

SIGNAL_INJECTION_POINT("after_begin")

which raise a signal if the test script has previously called

citus.interrupt_at('after_begin')

With mitmproxy

If mitmproxy is told the name of the process, it could send a signal when asked to. We could embed a script like flow.contains(b"SELECT").interrupt(19234), which you would call with the result of pg_backend_pid().

@lithp
Copy link
Contributor Author

lithp commented May 2, 2018

@lithp
Copy link
Contributor Author

lithp commented May 2, 2018

Here's the traceback when the race condition happens:

Traceback (most recent call last):
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/bin/mitmdump", line 11, in <module>
    sys.exit(mitmdump())
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/lib/python3.5/site-packages/mitmproxy/tools/main.py", line 155, in mitmdump
    m = run(dump.DumpMaster, cmdline.mitmdump, args, extra)
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/lib/python3.5/site-packages/mitmproxy/tools/main.py", line 122, in run
    master.run()
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/lib/python3.5/site-packages/mitmproxy/master.py", line 93, in run
    self.tick(0.1)
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/lib/python3.5/site-packages/mitmproxy/master.py", line 109, in tick
    self.addons.handle_lifecycle(mtype, obj)
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/lib/python3.5/site-packages/mitmproxy/addonmanager.py", line 238, in handle_lifecycle
    self.trigger(name, message)
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/lib/python3.5/site-packages/mitmproxy/addonmanager.py", line 282, in trigger
    with safecall():
  File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
    return next(self.gen)
  File "/home/brian/.local/share/virtualenvs/regress-MQrw9tUE/lib/python3.5/site-packages/mitmproxy/addonmanager.py", line 60, in safecall
    tell = ctx.master.tell
AttributeError: 'NoneType' object has no attribute 'tell'

@lithp lithp force-pushed the mitmproxy-failure-testing branch from c1dd90f to aa71c3d Compare May 16, 2018 00:00
@lithp
Copy link
Contributor Author

lithp commented May 16, 2018

All the tests from our test document:

  • Fail some nodes while running a query, but all shards have at least one active placements
    • Do this with real time executor
      • aggregates
      • simple joins
      • table fetching queries
      • functions
    • Do this with task tracker executor queries (Look at the task tracker design document for identifying/testing various failure scenarios)
      • aggregates
      • simple joins
      • re-partition joins
    • Do this with router executor queries
      • Aggregates
      • window functions
      • simple-key value lookups
      • recursively planned queries
  • Fail some nodes while running a query, one shard doesn’t have any active placements (N/A, I'm rolling this into the above test)
  • COPY on distributed tables with (1PC)
    • Test with
      • hash
      • append
      • range
    • Stop a worker before COPY starts;
      • If there is one replica; COPY should error out, placement should stay healthy.
      • If there are more than one replica; COPY should not error out, placements should be marked as INACTIVE. (This test is only applicable for hash distributed tables, otherwise the query errors out)
    • Stop a worker during COPY;
      • If there is one replica; COPY should error out, placement should stay healthy.
      • If there are more than one replica; COPY should error out, placement should stay healthy.
    • Stop a worker during COMMIT;
      • If there is one replica; COPY should give warning, some of the data already committed, placements should stay healthy.
      • If there are more than one replica; COPY should not error out, placement should be marked as INACTIVE.
      • If data is corrupted (i.e., 1 row is erroneous), COPY errors-out and no data is ingested to the workers
  • COPY on hash partitioned tables with default values (2PC):
    • Stop a worker before COPY starts;
      • If there is one replica; COPY should error out, placement should stay healthy.
      • If there are more than one replica; COPY should not error out, placements should be marked as INACTIVE.
    • Stop a worker during COPY;
      • If there is one replica; COPY should error out, placement should stay healthy.
      • If there are more than one replica; COPY should error out, placement should stay healthy.
    • Stop a worker during COMMIT;
      • If there is one replica; COPY should give warning,
        • Wait show citus.recover_2pc_interval and see that prepared transactions are rollbacked on the workers via: SELECT run_command_on_workers('SELECT count(*) FROM pg_prepared_xacts');
        • You can mostly do that with gdb, stopping at StartRemoteTransactionCommit() on the calls after the first one
      • If there are more than one replica; COPY should not error out, placement should be marked as INACTIVE
  • Copy on reference tables
    • Always use 2PC, never mark any placement as INACTIVE
      • Example: If a table has unique constraint and we COPY the same row twice, COPY errors-out and no data is ingested to the workers and no placement is marked as inactive.
  • COPY on append partitioned failures
    • Shard creation fails: Create table with composite type and don’t create composite type on one node. Verify COPY handles failure as expected.
    • Min/Max fetching fails: In the above don’t create min/max function for composite type on workers. Verify COPY handles failure as expected .
    • Test COPY in transaction & ROLLBACK
  • Fail nodes during master_create_empty_shard
    • Fail one node and verify that it is able to create a shard. Verify metadata.
    • Fail several nodes such that active nodes < shard_replication_factor and verify that no metadata is inserted.
    • Test master_create_empty_shard in transaction & ROLLBACK
  • Fail nodes during master_append_table_to_shard
    • Fail all nodes having the shard and verify that we error out and no metadata is added
    • Fail some nodes having the shard and verify it is marked as invalid and metadata is updated accordingly
    • Ctrl-C during master_append_table_to_shard
    • Test master_append_table_to_shard in transaction & ROLLBACK
  • Fail nodes during master_apply_delete_command ( )
    • Verify that the transaction fails, no metadata changes are happened and none of the shard placements are dropped at all
    • Verify that pg_dist_shard_placement row is not deleted.
    • Verify that shards on nodes which are up are not deleted and their metadata is not removed.
    • Test master_apply_delete_command & ROLLBACK
  • Fail some nodes while running a subselect
  • Fail some nodes while inserting/updating/deleting rows
    • Test BEGIN, inserting/updating/deleting rows, ROLLBACK
    • Also test multiple shard inserts
  • Fail on create_distributed_table
    • The whole transaction should be rollbacked in any case: With replication factor 1 or greater and with 2PC/1PC enabled, inside a transaction or not
    • Test BEGIN; create_worker_shards(); ROLLBACK
  • Fail a node during create_distributed_table(..., colocate_with => '...')
    • Test with all colocate_with options
  • Fail nodes during create_reference_table()
  • Fail nodes during DDL command propagation
    • Verify that if node is failed before executing DDL command, then command can’t be executed.
    • Verify that if node fails during execution or if you get a postgres error on a particular shard and not the others, we error out and ROLLBACK.
  • Fail nodes during truncate command
  • Fail nodes during insert into/select command
  • Fail nodes during insert into/select command (when data is pulled to master)
  • Fail nodes during add/remove/disable node with and without reference tables
  • Fail nodes during modification to reference tables. Whole transaction should fail, no data should be inserted and no metadata update should happen.
  • Fail a node during MX metadata sync
    • Test during create_distributed_table, ddl changes, mark_tables_colocated, DROP TABLE
  • Fail a node during creating distributed table from local NON-EMPTY table
    • Check metadata, check if workers have corrupted data etc.
  • Fail a node during creating distributed table from partitioned table
    • Check metadata, check if workers have corrupted data etc.
  • Fail a node when creating a partition of a distributed-partitioned table
    • Check metadata, check if workers have corrupted data etc.
  • Fail a MX node during DROP SEQUENCE
  • Fail a node while running table size functions
    • This is probably okay. Size functions do not run when replication factor is greater than 1. They also fail when a single node is down.
  • Fail a node during ALTER EXTENSION citus UPDATE
  • Fail a node during master_add_node, master_activate_node, master_disable_node, master_add_inactive_node
    • master_add_inactive_node should not error out
    • check metadata of reference tables, corruption of reference table data, replication count in pg_dist_colocation
  • Fail a node during CREATE INDEX … CONCURRENTLY
    • Issued created index concurrently
    • Killed a worker process
  • Fail a node during ALTER TABLE ADD/DROP CONSTRAINT
  • Fail a node during ALTER TABLE RENAME COLUMN
  • Fail a node when creating a savepoint (or rollbacking to a savepoint)
    • When a node fails, the whole transaction fails anyway. But, keep the test anyway.
  • Fail node when DISTINCT/COUNT DISTINCT running
    • Fail a node when processing is on the worker
    • Fail the coordinator when the processing is on the coordinator
  • Fail/kill maintenance daemon
    • When the daemon is sleeping
    • When the daemon is running call home
    • When the daemon is running deadlock detection
    • When the daemon is running recovery
  • Fail node while recover_prepared_statements() in progress
    • Fail the coordinator
    • Fail the workers
  • Fail node while running multi-shard UPDATE/DELETEs
    • Test with both 1PC and 2PC
      • Fail coordinator
      • Fail worker

-- not exists in the release testing but let's try to get them as well

  • Shard rebalancer
  • Tenant isolation

@lithp
Copy link
Contributor Author

lithp commented May 16, 2018

Things to play with: citus.shard_count, citus.shard_replication_factor

@lithp
Copy link
Contributor Author

lithp commented May 16, 2018

Useful: get_shard_id_for_distribution_column(table-name, value)

@lithp
Copy link
Contributor Author

lithp commented May 16, 2018

Q: How do you tell which executor is being used, to know which one a specific join uses?
A: EXPLAIN tells you

@lithp
Copy link
Contributor Author

lithp commented May 16, 2018

@lithp
Copy link
Contributor Author

lithp commented May 16, 2018

Can you add any test cases for this: #2031

WARNING: connection not open
CONTEXT: while executing command on localhost:57640
COPY copy_test, line 1: "0, 0"
ERROR: failure on connection marked as essential: localhost:57640
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug. If the other worker is accessible we shouldn't error out, we should just mark the placement inactive

This probably means the server terminated abnormally
before or while processing the request.
CONTEXT: while executing command on localhost:57640
COPY copy_test, line 1: "0, 0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, the other worker is doing fine, this should mark the placement inactive and continue.

(1 row)

COPY copy_test FROM PROGRAM 'echo 0, 0 && echo 1, 1 && echo 2, 4 && echo 3, 9' WITH CSV;
ERROR: failed to COPY to shard 100400 on localhost:57640
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again here, it the COPY fails while some rows are being sent we error out, instead of marking the placement inactive.

@lithp lithp requested a review from onderkalaci May 18, 2018 01:43
.travis.yml Outdated
@@ -39,7 +49,7 @@ install:
sudo dpkg --force-confold --force-confdef --force-all -i *hll*.deb
fi
before_script: citus_indent --quiet --check
script: CFLAGS=-Werror pg_travis_multi_test check
script: CFLAGS=-Werror pg_travis_multi_test check-failure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not forget this here :)

@onderkalaci onderkalaci changed the title [WIP] Mitmproxy-based automated failure testing Mitmproxy-based automated failure testing Jun 8, 2018
@lithp lithp force-pushed the mitmproxy-failure-testing branch from da1181a to 49ab26a Compare June 11, 2018 10:34
Copy link
Member

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noting the things we've discussed here:

  1. Try to avoid the code changes in Citus [Brain]
  2. Update the older tests (old APIs are used), try to avoid .source files and remove them if not a small change for now
  3. Add tests for DDLs and real-time SELECTs [Onder]

import structs

'''
Use with a command line like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to a readme under the folder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

There are also some special commands. This proxy also records every packet and lets you
inspect them:

recorder.dump() - emits a list of captured packets in COPY text format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add dump_networker_traffic / clear_network_traffic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also done

@@ -275,6 +287,18 @@ sub revert_replace_postgres
push(@pgOptions, '-c', "citus.remote_task_check_interval=1ms");
push(@pgOptions, '-c', "citus.shard_replication_factor=2");
push(@pgOptions, '-c', "citus.node_connection_timeout=${connectionTimeout}");
push(@pgOptions, '-c', "citus.sslmode=disable");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only do it when failre testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -178,6 +185,11 @@ ()
MESSAGE
}

if ($useMitmproxy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error out on check-full and check-failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added check-failure to check-full.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also added a line item to the travis build matrix, on PG10 we test both the regular tests and also the failure tests

with open(fifoname, mode='w') as fifo:
fifo.write('{}\n'.format(result))

def replace_thread(fifoname):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe create_thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.root = self
self.command = None

def dump(self, normalize_shards=True, dump_unknown_messages=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dump_unknown_messages maybe remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

CREATE FUNCTION citus.dump_network_traffic(
normalize_shards bool default true,
dump_unknown_messages bool default false
) RETURNS TABLE(conn int, from_client bool, message text) AS $$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_client bool use a text instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now source text, and a lot easier to read!

@@ -79,6 +79,10 @@ check-follower-cluster: all
$(pg_regress_multi_check) --load-extension=citus --follower-cluster \
-- $(MULTI_REGRESS_OPTS) --schedule=$(citus_abs_srcdir)/multi_follower_schedule $(EXTRA_TESTS)

check-failure: all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If simple: stderr is directed to a file instead of stdout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already done

@lithp lithp force-pushed the mitmproxy-failure-testing branch 2 times, most recently from 09b40b5 to 8eea61c Compare June 21, 2018 17:14
@lithp
Copy link
Contributor Author

lithp commented Jun 21, 2018

I've rewritten history so that this branch never messed with the citus version or added a .sql file for it, and instead always used test_helpers.sql. This made rebasing easy, and I've now rebased it onto master. @onderkalaci you should now be able to test sequential modifications!

@lithp lithp force-pushed the mitmproxy-failure-testing branch from 8eea61c to 9133912 Compare June 21, 2018 20:48
@lithp
Copy link
Contributor Author

lithp commented Jun 21, 2018

I've removed the changes to prepared transaction id generation and reworked the test, failure_drop_table, to no longer rely on it. I've pushed the changes to a branch for posterity in case we want them back, I think the test is a lot easier to read with them.

@onderkalaci
Copy link
Member

@onderkalaci you should now be able to test sequential modifications!

All looks good, I've added the tests in #2212. Once you address the feedback we've added together here, feel free to ping me. I'd like to have a final look to the changes.

Add tests for DDLs and real-time SELECTs [Onder]

I think I won't have time to add proper regression tests for real-time select failure / cancellation right now. I'm planning to tackle those during the release testing. For the other tests that are in the failure_schedule, feel free to pick one and add tests covering various edge cases. I'm OK to remove the other tests and let the whole team involve in writing all the tests mentioned here.

We should probably get #2212 and #2210 to master (plus one other extensive test written by you).

@lithp lithp force-pushed the mitmproxy-failure-testing branch from 9133912 to 466fed0 Compare June 26, 2018 20:17
@lithp
Copy link
Contributor Author

lithp commented Jun 26, 2018

I've merged #2182 and rebased onto master so that this PR now includes 0 changes to citus code

@lithp
Copy link
Contributor Author

lithp commented Jun 27, 2018

@onderkalaci ready for you to review again :)

I think this is my only remaining work:

Copy link
Member

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a slight preference to not add all the tests in this PR. I think it'd be better to concentrate on each test separately during the release testing. That said, feel free to not remove them, I can look a bit closer to each.

I think we're almost done, I'll have a final look & final test once you address the minor notes you've commented.

@@ -0,0 +1,63 @@
CREATE TABLE test (a int, b int);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this file at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -0,0 +1,165 @@
Automated Failure testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great readme, thanks a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😀

def _handle(self, flow, message):
flow.kill() # tell mitmproxy this connection should be closed

client_conn = flow.client_conn # connections.ClientConnection(tcp.BaseHandler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is connections.ClientConnection(tcp.BaseHandler) forgotten?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this, I think I added this while I was figuring out how to get the actual socket. This is just a note so I could remember what the type is client_conn is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the comment to something more clear

@@ -0,0 +1,328 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include it Pipfile.lock? Isn't it something that is generated on the fly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the reasoning but it sounds like adding Pipfile.lock is recommended: pypa/pipenv#598 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the reasoning

Having just read waaaay too much about the craptastic Python dependency-management universe (I do not understand why it is so Balkanized), we should definitely check this in. Pipfile has the general version constraints and Pipfile.lock has the specific versions that we're using which meet those constraints. We'd only want to not provide Pipfile.lock if we were developing a library, I think.

@lithp lithp force-pushed the mitmproxy-failure-testing branch 3 times, most recently from 3121431 to 6c01440 Compare June 29, 2018 02:13
@lithp
Copy link
Contributor Author

lithp commented Jun 29, 2018

@onderkalaci addressed your feedback, removed all the tests (I'll open them again in new PRs, already opened #2244), sent all output to a file, and squashed, this is ready for you again!

Copy link
Member

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good, after considering pretty minor comments :shipit:

@@ -1,6 +1,8 @@
sudo: required
dist: trusty
language: c
python:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor note that I've not a good answer for now:

For example, test-automation repo comes with Python 2.7, and it's a bit painful to install 3.5 to execute the tests here. Also, any other 3.Xpython gives warnings etc starting the tests.

Is there anything we can do about relaxing the version checks? It seems not for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. This script hooks unfortunately deeply into the mitmproxy, which is why the version requirement is so specific.

One solution might be to move test-automation from 2.7 to 3.5.

Another would be to replace mitmproxy with our own proxy, we're not using very much of mitmproxy so this wouldn't be a huge change, probably only a week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using pyenv to install other Python versions and using pipenv on top of that to isolate all this. This works reliably well now.


# II. Running mitmproxy manually

$ mkfifo /tmp/mitm.fifo # first, you need a fifo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add cd src/test/regress/mitmscripts ?

means mitmdump will accept connections on port 9702 and forward them to the worker
listening on port 9700.

Now, open psql and run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add cd src/test/regress

You also want to tell the UDFs how to talk to mitmproxy (careful, this must be an absolute
path):

# SET citus.mitmfifo = '/tmp/mitm.fifo';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we note that this is not actually a GUC, but, still Postgres allows us to read it? Though I'm now familiar with this, I was confused when I first saw/realize this fact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add a quick note

@@ -79,6 +79,10 @@ check-follower-cluster: all
$(pg_regress_multi_check) --load-extension=citus --follower-cluster \
-- $(MULTI_REGRESS_OPTS) --schedule=$(citus_abs_srcdir)/multi_follower_schedule $(EXTRA_TESTS)

check-failure: all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already done

@metdos
Copy link
Contributor

metdos commented Jul 6, 2018

@lithp, when are we expecting to merge this?

- Lots of detail is in src/test/regress/mitmscripts/README
- Create a new target, make check-failure, which runs tests
- Tells travis how to install everything and run the tests
@lithp lithp force-pushed the mitmproxy-failure-testing branch from 6c01440 to 3e309e3 Compare July 6, 2018 18:51
@lithp lithp merged commit a54f9a6 into master Jul 6, 2018
@jasonmp85
Copy link
Contributor

@metdos @lithp so we merged this but that checklist looks pretty un-checked. Are we going to track further outstanding work elsewhere?

@metdos
Copy link
Contributor

metdos commented Jul 8, 2018

@metdos @lithp so we merged this but that checklist looks pretty un-checked. Are we going to track further outstanding work elsewhere?

I added test conversion parts to this issue (#2262), do we have any plans for others @lithp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants