-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitmproxy-based automated failure testing #2119
Conversation
|
8a24953
to
cade93b
Compare
Codecov Report
@@ Coverage Diff @@
## master #2119 +/- ##
=========================================
+ Coverage 93.7% 93.7% +<.01%
=========================================
Files 110 110
Lines 27446 27453 +7
=========================================
+ Hits 25719 25726 +7
Misses 1727 1727 |
cbfd90f
to
05108bb
Compare
It's finally running (albeit in a rather hacky way) and passing on Travis! |
05108bb
to
fedadb2
Compare
Rebased onto master |
Clearing out some outdated TODOs:
|
Cancellation testingThe proxy sits at the network layer instead of acting directly on the binary. This means it's more stable under refactoring, but also means we have much less control over what the binary is doing when we take some action. If the action is sending a signal (to cancel the binary), it's likely that precise timing will be more important. We'll want to send a signal right at some critical section. So, maybe cancellation testing is better handled a different way. For instance, we could embed macro calls like
which raise a signal if the test script has previously called
With mitmproxyIf mitmproxy is told the name of the process, it could send a signal when asked to. We could embed a script like |
Here's the traceback when the race condition happens:
|
c1dd90f
to
aa71c3d
Compare
All the tests from our test document:
-- not exists in the release testing but let's try to get them as well
|
Things to play with: |
Useful: |
Q: How do you tell which executor is being used, to know which one a specific join uses? |
There are additional tests here: https://docs.google.com/document/d/10mEwsYftun6ONoRCmtwlNcKvgdHEuGzqTuBnAZJuUag/edit |
Can you add any test cases for this: #2031 |
WARNING: connection not open | ||
CONTEXT: while executing command on localhost:57640 | ||
COPY copy_test, line 1: "0, 0" | ||
ERROR: failure on connection marked as essential: localhost:57640 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug. If the other worker is accessible we shouldn't error out, we should just mark the placement inactive
This probably means the server terminated abnormally | ||
before or while processing the request. | ||
CONTEXT: while executing command on localhost:57640 | ||
COPY copy_test, line 1: "0, 0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, the other worker is doing fine, this should mark the placement inactive and continue.
(1 row) | ||
|
||
COPY copy_test FROM PROGRAM 'echo 0, 0 && echo 1, 1 && echo 2, 4 && echo 3, 9' WITH CSV; | ||
ERROR: failed to COPY to shard 100400 on localhost:57640 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again here, it the COPY fails while some rows are being sent we error out, instead of marking the placement inactive.
.travis.yml
Outdated
@@ -39,7 +49,7 @@ install: | |||
sudo dpkg --force-confold --force-confdef --force-all -i *hll*.deb | |||
fi | |||
before_script: citus_indent --quiet --check | |||
script: CFLAGS=-Werror pg_travis_multi_test check | |||
script: CFLAGS=-Werror pg_travis_multi_test check-failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not forget this here :)
da1181a
to
49ab26a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noting the things we've discussed here:
- Try to avoid the code changes in Citus [Brain]
- Update the older tests (old APIs are used), try to avoid
.source
files and remove them if not a small change for now - Add tests for DDLs and real-time
SELECTs
[Onder]
import structs | ||
|
||
''' | ||
Use with a command line like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to a readme under the folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There are also some special commands. This proxy also records every packet and lets you | ||
inspect them: | ||
|
||
recorder.dump() - emits a list of captured packets in COPY text format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add dump_networker_traffic / clear_network_traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also done
src/test/regress/pg_regress_multi.pl
Outdated
@@ -275,6 +287,18 @@ sub revert_replace_postgres | |||
push(@pgOptions, '-c', "citus.remote_task_check_interval=1ms"); | |||
push(@pgOptions, '-c', "citus.shard_replication_factor=2"); | |||
push(@pgOptions, '-c', "citus.node_connection_timeout=${connectionTimeout}"); | |||
push(@pgOptions, '-c', "citus.sslmode=disable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only do it when failre testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -178,6 +185,11 @@ () | |||
MESSAGE | |||
} | |||
|
|||
if ($useMitmproxy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error out on check-full
and check-failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added check-failure
to check-full
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also added a line item to the travis build matrix, on PG10 we test both the regular tests and also the failure tests
with open(fifoname, mode='w') as fifo: | ||
fifo.write('{}\n'.format(result)) | ||
|
||
def replace_thread(fifoname): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe create_thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
self.root = self | ||
self.command = None | ||
|
||
def dump(self, normalize_shards=True, dump_unknown_messages=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dump_unknown_messages
maybe remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
CREATE FUNCTION citus.dump_network_traffic( | ||
normalize_shards bool default true, | ||
dump_unknown_messages bool default false | ||
) RETURNS TABLE(conn int, from_client bool, message text) AS $$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_client bool
use a text instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now source text
, and a lot easier to read!
@@ -79,6 +79,10 @@ check-follower-cluster: all | |||
$(pg_regress_multi_check) --load-extension=citus --follower-cluster \ | |||
-- $(MULTI_REGRESS_OPTS) --schedule=$(citus_abs_srcdir)/multi_follower_schedule $(EXTRA_TESTS) | |||
|
|||
check-failure: all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If simple: stderr is directed to a file instead of stdout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already done
09b40b5
to
8eea61c
Compare
I've rewritten history so that this branch never messed with the citus version or added a .sql file for it, and instead always used |
8eea61c
to
9133912
Compare
I've removed the changes to prepared transaction id generation and reworked the test, |
All looks good, I've added the tests in #2212. Once you address the feedback we've added together here, feel free to ping me. I'd like to have a final look to the changes.
I think I won't have time to add proper regression tests for We should probably get #2212 and #2210 to master (plus one other extensive test written by you). |
9133912
to
466fed0
Compare
I've merged #2182 and rebased onto master so that this PR now includes 0 changes to citus code |
@onderkalaci ready for you to review again :) I think this is my only remaining work:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've a slight preference to not add all the tests in this PR. I think it'd be better to concentrate on each test separately during the release testing. That said, feel free to not remove them, I can look a bit closer to each.
I think we're almost done, I'll have a final look & final test once you address the minor notes you've commented.
@@ -0,0 +1,63 @@ | |||
CREATE TABLE test (a int, b int); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this file at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -0,0 +1,165 @@ | |||
Automated Failure testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great readme, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😀
def _handle(self, flow, message): | ||
flow.kill() # tell mitmproxy this connection should be closed | ||
|
||
client_conn = flow.client_conn # connections.ClientConnection(tcp.BaseHandler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is connections.ClientConnection(tcp.BaseHandler)
forgotten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove this, I think I added this while I was figuring out how to get the actual socket. This is just a note so I could remember what the type is client_conn
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the comment to something more clear
@@ -0,0 +1,328 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include it Pipfile.lock
? Isn't it something that is generated on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the reasoning but it sounds like adding Pipfile.lock
is recommended: pypa/pipenv#598 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the reasoning
Having just read waaaay too much about the craptastic Python dependency-management universe (I do not understand why it is so Balkanized), we should definitely check this in. Pipfile
has the general version constraints and Pipfile.lock
has the specific versions that we're using which meet those constraints. We'd only want to not provide Pipfile.lock
if we were developing a library, I think.
3121431
to
6c01440
Compare
@onderkalaci addressed your feedback, removed all the tests (I'll open them again in new PRs, already opened #2244), sent all output to a file, and squashed, this is ready for you again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good, after considering pretty minor comments
@@ -1,6 +1,8 @@ | |||
sudo: required | |||
dist: trusty | |||
language: c | |||
python: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor note that I've not a good answer for now:
For example, test-automation repo comes with Python 2.7, and it's a bit painful to install 3.5 to execute the tests here. Also, any other 3.X
python gives warnings etc starting the tests.
Is there anything we can do about relaxing the version checks? It seems not for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. This script hooks unfortunately deeply into the mitmproxy, which is why the version requirement is so specific.
One solution might be to move test-automation from 2.7 to 3.5.
Another would be to replace mitmproxy with our own proxy, we're not using very much of mitmproxy so this wouldn't be a huge change, probably only a week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using pyenv
to install other Python versions and using pipenv
on top of that to isolate all this. This works reliably well now.
|
||
# II. Running mitmproxy manually | ||
|
||
$ mkfifo /tmp/mitm.fifo # first, you need a fifo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add cd src/test/regress/mitmscripts
?
means mitmdump will accept connections on port 9702 and forward them to the worker | ||
listening on port 9700. | ||
|
||
Now, open psql and run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add cd src/test/regress
You also want to tell the UDFs how to talk to mitmproxy (careful, this must be an absolute | ||
path): | ||
|
||
# SET citus.mitmfifo = '/tmp/mitm.fifo'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we note that this is not actually a GUC, but, still Postgres allows us to read it? Though I'm now familiar with this, I was confused when I first saw/realize this fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add a quick note
@@ -79,6 +79,10 @@ check-follower-cluster: all | |||
$(pg_regress_multi_check) --load-extension=citus --follower-cluster \ | |||
-- $(MULTI_REGRESS_OPTS) --schedule=$(citus_abs_srcdir)/multi_follower_schedule $(EXTRA_TESTS) | |||
|
|||
check-failure: all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already done
@lithp, when are we expecting to merge this? |
- Lots of detail is in src/test/regress/mitmscripts/README - Create a new target, make check-failure, which runs tests - Tells travis how to install everything and run the tests
6c01440
to
3e309e3
Compare
Some slides explaining how this works: https://docs.google.com/presentation/d/1zxF32GvcFJp0s6UClL39hla9WqhbGvYgo30e2WYnADQ/edit?usp=sharing
A modernized #2044. This one works at the network level instead of the process level. A drawback is: we don't have a good way of exposing timing problems. An example might be, maybe if a packet comes in before the process is ready for it the process crashes. That class of bugs seems rare to me, so I think this approach is safe to move forward with.
It's also much easier! This doesn't require sudo, just an extra daemon which packets flow through. It also has the advantage of not caring at all how the process runs, so the same tests should work modulo any refactoring which we choose to do.
Some work left to be done:
'NoneType' object has no attribute 'tell'
. What is going wrong?netcat
in order to keep usingCOPY
)Handler
andMixin
hierarchy is cute but there's definitely a better way. Separating out the Building and the Handling seems like it'd improve things.pg_regress_multi.pl
orfluent.py
should redirect the proxy output to a log filecitus.mitmproxy()
to the test functions?DROP TABLE
crashes if you drop the connection afterDROP TABLE IF EXISTS
is sent. Open a ticket!recorder.dump()
returns an empty line, fix that!enable_unique_job_ids
->enable_unique_transaction_ids
failure_task_tracker_executor
test reproducible (don't hard-code filepaths)