Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tests): Fix false negative cram tests #1666

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
3 changes: 3 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,13 @@

* index: Previously specifying a directory that does not exist in the path to `--output` would result in an incorrect error stating that the input file does not exist. It now shows the correct path responsible for the error. [#1644][] (@victorlin)
* curate format-dates: Update help docs and improve failure messages to show use of `--expected-date-formats`. [#1653][] (@joverlee521)
* tests: Fix false negative tests due to incorrect use of DeepDiff's `--exclude-regex-path` option. [#1665][] [#1666][] (@corneliusroemer)

[#1644]: https://github.com/nextstrain/augur/issues/1644
[#1653]: https://github.com/nextstrain/augur/pull/1653
[#1656]: https://github.com/nextstrain/augur/pull/1656
[#1665]: https://github.com/nextstrain/augur/issues/1665
[#1666]: https://github.com/nextstrain/augur/pull/1666

## 26.0.0 (17 September 2024)

Expand Down
2 changes: 2 additions & 0 deletions docs/contribute/DEV_DOCS.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@ To compare JSON outputs with stochastic numerical values, use `scripts/diff_json

Both tree and JSON comparison scripts rely on [deepdiff](https://deepdiff.readthedocs.io/en/latest/) for underlying comparisons.

When using `diff_jsons.py`'s `--exclude-regex-paths` argument, make sure to escape any special regex characters, in particular square brackets: do this `\['seqid'\]`, not this `['seqid']`. See [#1655](https://github.com/nextstrain/augur/issues/1665) for what happens if you don't.

#### When to use which type of test

1. Unit tests should be used for the [public API](https://docs.nextstrain.org/projects/augur/en/stable/api/public/index.html).
Expand Down
13 changes: 12 additions & 1 deletion scripts/diff_jsons.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import argparse
import deepdiff
import json
import re

from augur.argparse_ import ExtendOverwriteDefault

Expand All @@ -15,12 +16,22 @@
parser.add_argument("first_json", help="first JSON to compare")
parser.add_argument("second_json", help="second JSON to compare")
parser.add_argument("--significant-digits", type=int, default=5, help="number of significant digits to use when comparing numeric values")
parser.add_argument("--exclude-paths", nargs="+", action=ExtendOverwriteDefault, help="list of paths to exclude from consideration when performing a diff", default=["root['generated_by']['version']"])
parser.add_argument("--exclude-paths", nargs="+", action=ExtendOverwriteDefault, help="list of paths to exclude from consideration when performing a diff", default=["root['generated_by']", "root['meta']['updated']"])
parser.add_argument("--exclude-regex-paths", nargs="+", action="extend", help="list of path regular expressions to exclude from consideration when performing a diff")
parser.add_argument("--ignore-numeric-type-changes", action="store_true", help="ignore numeric type changes in the diff (e.g., int of 1 to float of 1.0)")

args = parser.parse_args()

# Test for most fatal errors in regex path usage
# Exclude regexes should never match `'`, otherwise the diff is always going to pass
Comment on lines +25 to +26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include " as well? I'm assuming root['generated_by'] and root["generated_by"] are both valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the path that deep diff uses to match against uses a single quote not double quote. Hence excluding only single quote. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested with both ['seqid'] and ["seqid"]:

   $ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/clades.json" clades_no-labels.json --exclude-regex-paths "['seqid']"
-  {'dictionary_item_removed': [root['branches']]}
+  Traceback (most recent call last):
+    File "/Users/vlin/Dropbox/repos/nextstrain/augur/tests/functional/clades/cram/../../../../scripts/diff_jsons.py", line 30, in <module>
+      raise Exception(
+  Exception: Exclude regex ['seqid'] matches `'` which means this diff will always pass which is probably not what you want.
+  You probably forgot to escape something in your regex. See for example: https://stackoverflow.com/a/79173188/7483211
+  [1]
   $ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/clades.json" clades_no-labels.json --exclude-regex-paths '["seqid"]'
-  {'dictionary_item_removed': [root['branches']]}
+  {}

It looks like " should get the same treatment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, there's a bunch more things that should not not be allowed to be matched, things like [ and ] and any single letter. But in a way it's not super clearcut as someone might want to exclude everything but a single path (they should use an include path instead then) - so I don't know how far we want to go here. Including " sounds good though, no harm I can see.

for regex in args.exclude_regex_paths or []:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Adding or [] (e596b50) doesn't seem useful:

  • We don't run mypy/pyright on this file
  • A better solution would be to set default=[] on add_argument()

Copy link
Member Author

@corneliusroemer corneliusroemer Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because we don't run mypy doesn't mean that we shouldn't prevent type errors from happening? The error actually happens irrespective of running mypy. It's not mypy erroring it's a real runtime error.

Why is changing the defaults better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is changing the defaults better?

Setting the default up top allows for a valid assumption that args.exclude_regex_paths is always iterable. Otherwise we would have to do another args.exclude_regex_paths or [] if using it as an iterable somewhere else in the code for whatever reason.

result = re.compile(regex).search("'")
if result is not None:
raise Exception(
f"Exclude regex {regex} matches `'` which means this diff will always pass which is probably not what you want.\n"
"You probably forgot to escape something in your regex. See for example: https://stackoverflow.com/a/79173188/7483211"
)

with open(args.first_json, "r") as fh:
first_json = json.load(fh)

Expand Down
2 changes: 1 addition & 1 deletion tests/functional/ancestral/cram/ambiguous-positions.t
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Setup

$ python3 "$SCRIPTS/diff_jsons.py" \
> expected.json "nt_muts.json" \
> --exclude-regex-paths "root\['annotations'\]" "root\['generated_by'\]" "root\['reference'\]"
> --exclude-paths "annotations" "generated_by" "reference"
{}

$ python3 "$SCRIPTS/compare-json-vcf.py" \
Expand Down
6 changes: 2 additions & 4 deletions tests/functional/ancestral/cram/case-sensitive.t
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@ Change the _reference_ to lowercase

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$TESTDIR/../data/simple-genome/nt_muts.ref-seq.json" \
> "nt_muts.ref-seq.json" \
> --exclude-paths "root['generated_by']"
> "nt_muts.ref-seq.json"
{}


Expand All @@ -37,6 +36,5 @@ be lowecase which will be compared against the uppercase reference

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$TESTDIR/../data/simple-genome/nt_muts.ref-seq.json" \
> "nt_muts.ref-seq.json" \
> --exclude-paths "root['generated_by']"
> "nt_muts.ref-seq.json"
{}
6 changes: 2 additions & 4 deletions tests/functional/ancestral/cram/general.t
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@ node-data JSON we diff against.

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$TESTDIR/../data/simple-genome/nt_muts.ref-seq.json" \
> "nt_muts.ref-seq.json" \
> --exclude-paths "root['generated_by']"
> "nt_muts.ref-seq.json"
{}

Same as above but without providing a `--root-sequence`. The effect of this on behaviour is:
Expand All @@ -39,6 +38,5 @@ mutations (as there's nothing to compare the root node to)

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$TESTDIR/../data/simple-genome/nt_muts.no-ref-seq.json" \
> "nt_muts.no-ref-seq.json" \
> --exclude-paths "root['generated_by']"
> "nt_muts.no-ref-seq.json"
{}
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ ancestor).
Check that the reference length was correctly exported as the nuc annotation

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> --exclude-regex-paths "['seqid']" -- \
> --exclude-regex-paths "\['seqid'\]" -- \
> "$TESTDIR/../data/ancestral_mutations_with_root_sequence.json" \
> "$CRAMTMP/$TESTFILE/ancestral_mutations.json"
{}
2 changes: 1 addition & 1 deletion tests/functional/ancestral/cram/vcf-multi-allele.t
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ See <https://github.com/nextstrain/augur/issues/1380> for the bug this is testin
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$DATA/nt_muts.ref-seq.json" \
> nt_muts.json \
> --exclude-regex-paths "root\['nodes'\]\['.+'\]\['sequence'\]" "root\['generated_by'\]"
> --exclude-regex-paths "root\['nodes'\]\['.+'\]\['sequence'\]"
{'iterable_item_added': {"root['nodes']['sample_B']['muts'][0]": 'A30G'}}

$ cat > expected.vcf <<EOF
Expand Down
2 changes: 1 addition & 1 deletion tests/functional/ancestral/cram/vcf.t
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ but it will have the reference sequence attached.
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$DATA/nt_muts.ref-seq.json" \
> "nt_muts.vcf-input.ref-seq.json" \
> --exclude-regex-paths "root\['nodes'\]\['.+'\]\['sequence'\]" "root\['generated_by'\]"
> --exclude-regex-paths "root\['nodes'\]\['.+'\]\['sequence'\]"
{}

Here's the same mutations as in $DATA/nt_muts.ref-seq.json,
Expand Down

Large diffs are not rendered by default.

3 changes: 1 addition & 2 deletions tests/functional/clades/cram/membership-and-label.t
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,5 @@ Test custom membership key + label key. The only change should be the key names

$ cat clades_custom.json | sed "s/lineage/clade_membership/" | sed "s/origin/clade/" > clades_sed.json

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/clades.json" clades_sed.json \
> --exclude-paths "root['generated_by']"
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/clades.json" clades_sed.json
{}
3 changes: 1 addition & 2 deletions tests/functional/clades/cram/no-label.t
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,5 @@ Test the ability to _not_ export a branch label (same logic as not exporting the
> --label-name none \
> --output-node-data clades_no-labels.json &>/dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/clades.json" clades_no-labels.json \
> --exclude-paths "root['generated_by']"
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/clades.json" clades_no-labels.json
{'dictionary_item_removed': [root['branches']]}
9 changes: 3 additions & 6 deletions tests/functional/clades/cram/root-clade-identification.t
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ This is an oversight and ideally would be fixed
> --clades "$TESTDIR/../data/toy_clades_nuc.tsv" \
> --output-node-data toy_clades_1.json &>/dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/toy_clades_1.json" toy_clades_1.json \
> --exclude-paths "root['generated_by']"
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/toy_clades_1.json" toy_clades_1.json
{}

A clade which exists at the root is identified (and correctly propogated) if the root sequence
Expand All @@ -25,8 +24,7 @@ is explicitly set.
> --clades "$TESTDIR/../data/toy_clades_nuc.tsv" \
> --output-node-data toy_clades_2a.json &>/dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/toy_clades_2.json" toy_clades_2a.json \
> --exclude-paths "root['generated_by']"
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/toy_clades_2.json" toy_clades_2a.json
{}

A clade which exists at the root is identified (and correctly propogated) without a root sequence
Expand All @@ -38,6 +36,5 @@ if the (branch leading to the) root has the clade-defining mutation.
> --clades "$TESTDIR/../data/toy_clades_nuc.tsv" \
> --output-node-data toy_clades_2b.json &>/dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/toy_clades_2.json" toy_clades_2b.json \
> --exclude-paths "root['generated_by']"
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/toy_clades_2.json" toy_clades_2b.json
{}
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,5 @@ Confirm that trees match expected topology and branch lengths, given that the ou
> "$TESTDIR/../data/mutations_per_site_branch_lengths.json" \
> branch_lengths.json \
> --significant-digits 0 \
> --exclude-paths "root['generated_by']['version']" "root['input_tree']"
> --exclude-paths "generated_by" "input_tree"
{}
2 changes: 1 addition & 1 deletion tests/functional/refine/cram/not-timetree-mutations.t
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,5 @@ Confirm that trees match expected topology and branch lengths, given that the ou
> "$TESTDIR/../data/integer_branch_lengths.json" \
> branch_lengths.json \
> --significant-digits 0 \
> --exclude-paths "root['generated_by']['version']" "root['input_tree']" "root['alignment']"
> --exclude-paths "generated_by" "input_tree" "alignment"
{}
4 changes: 2 additions & 2 deletions tests/functional/translate/cram/general.t
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ which validate the output will fail as it's missing a 'nuc' annotation.
$ python3 "$SCRIPTS/diff_jsons.py" \
> "$DATA/aa_muts.json" \
> "aa_muts.json" \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]" "root['meta']['updated']"
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
{}

Same as above but using a GenBank file. This changes the 'type' of the annotations,
Expand All @@ -33,5 +33,5 @@ but this is irrelevant for Auspice's use and simply reflects the reference sourc
$ python3 "$SCRIPTS/diff_jsons.py" \
> "$DATA/aa_muts.json" \
> "aa_muts.genbank.json" \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]" "root\['annotations'\]\['.+'\]\['type'\]" "root['meta']['updated']"
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['(seqid|type)'\]"
{}
11 changes: 5 additions & 6 deletions tests/functional/translate/cram/genes.t
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ as a feature ('nuc' in this case)
amino acid mutations written to .+ (re)

$ python3 "$SCRIPTS/diff_jsons.py" \
> "$DATA/aa_muts.json" \
> "aa_muts.genes-args.json" \
> --exclude-regex-paths "seqid" "gene1" "root['meta']['updated']"
> "$DATA/aa_muts.json" \
> "aa_muts.genes-args.json" \
> --exclude-regex-paths "seqid" "gene1"
{}
Using a text file rather than command line arguments

Expand All @@ -43,7 +43,6 @@ Using a text file rather than command line arguments
amino acid mutations written to .+ (re)

$ python3 "$SCRIPTS/diff_jsons.py" \
> "aa_muts.genes-args.json" \
> "aa_muts.genes-txt.json" \
> --exclude-paths "root['meta']['updated']"
> "aa_muts.genes-args.json" \
> "aa_muts.genes-txt.json"
{}
2 changes: 1 addition & 1 deletion tests/functional/translate/cram/gff.t
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ GFF file with 'region' removed, so the only genome information is the ##sequence
$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$DATA/aa_muts.json" \
> "aa_muts.pragma-only.json" \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]" "root['meta']['updated']"
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
{}

GFF file with no genome coordinate information
Expand Down
2 changes: 1 addition & 1 deletion tests/functional/translate/cram/root-mutations.t
Original file line number Diff line number Diff line change
Expand Up @@ -27,5 +27,5 @@ is unchanged (MPCG*). There is also a mutation E4G at the root node to compensat
$ python3 "$SCRIPTS/diff_jsons.py" \
> "$DATA/aa_muts.json" \
> "aa_muts.json" \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]" "root['meta']['updated']"
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
{'values_changed': {"root['reference']['gene1']": {'new_value': 'MPCE*', 'old_value': 'MPCG*'}}, 'iterable_item_added': {"root['nodes']['node_root']['aa_muts']['gene1'][0]": 'E4G'}}
2 changes: 1 addition & 1 deletion tests/functional/translate/cram/translate-with-genbank.t
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@ Translate amino acids for genes using a GenBank file.
amino acid mutations written to .* (re)

$ python3 "$SCRIPTS/diff_jsons.py" $DATA/zika/aa_muts_genbank.json aa_muts.json \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]" "root['meta']['updated']"
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
{}
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Other than the sequence ids which will include a temporary path, the JSONs
should be identical.

$ python3 "${SCRIPTS}/diff_jsons.py" \
> --exclude-regex-paths "['seqid']" -- \
> --exclude-regex-paths "\['seqid'\]" -- \
> "${DATA}/zika/aa_muts_gff.json" \
> aa_muts.json
{}
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Translate amino acids for genes using a GFF3 file where the gene names are store
amino acid mutations written to .* (re)

$ python3 "${SCRIPTS}/diff_jsons.py" \
> --exclude-regex-paths "['seqid']" -- \
> --exclude-regex-paths "\['seqid'\]" -- \
> "${DATA}/zika/aa_muts_gff.json" \
> aa_muts.json
{}
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ This is an identical test setup as `translate-with-gff-and-gene.t` but using loc
amino acid mutations written to .* (re)

$ python3 "${SCRIPTS}/diff_jsons.py" \
> --exclude-regex-paths "['seqid']" -- \
> --exclude-regex-paths "\['seqid'\]" -- \
> "${DATA}/zika/aa_muts_gff.json" \
> aa_muts.json
{}
2 changes: 1 addition & 1 deletion tests/functional/translate/cram/vcf-with-root-mutation.t
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,5 @@ node is E (and so are all the other nodes))
$ python3 "$SCRIPTS/diff_jsons.py" \
> aa_muts.truth.json \
> aa_muts.json \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]" "root['meta']['updated']"
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
{}
2 changes: 1 addition & 1 deletion tests/functional/translate/cram/vcf.t
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Setup
$ python3 "$SCRIPTS/diff_jsons.py" \
> "$DATA/aa_muts.json" \
> aa_muts.json \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]" "root['meta']['updated']"
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
{}

------------------------------ MISSING TEST ----------------------------------
Expand Down
14 changes: 12 additions & 2 deletions tests/functional/translate/data/zika/aa_muts_gff.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,18 @@
"start": 457,
"strand": "+",
"type": "gene"
},
"nuc": {
"end": 10769,
"seqid": "genemap.gff",
"start": 1,
"strand": "+",
"type": "##sequence-region pragma"
}
},
"generated_by": {
"program": "augur",
"version": "16.0.3"
"version": "26.0.0"
},
"nodes": {
"BRA/2016/FC_6706": {
Expand Down Expand Up @@ -91,7 +98,10 @@
}
},
"NODE_0000006": {
"aa_muts": {},
"aa_muts": {
"CA": [],
"PRO": []
},
"aa_sequences": {
"CA": "MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKEKKRRGADTSVGIVGLLLTTAMA",
"PRO": "AEVTRRGSAYYMYLDRNDAGEAISFPTTLGMNKCYIQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRR"
Expand Down
Loading