Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt verify_taxa() #21

Merged
merged 44 commits into from
Dec 20, 2018
Merged

Adapt verify_taxa() #21

merged 44 commits into from
Dec 20, 2018

Conversation

damianooldoni
Copy link
Collaborator

  1. Function verify_taxa() has been adapted in order to match new mandatory column names:
old names new names
checklist_scientificName scientificName
backbone_scientificName bb_scientificName
backbone_taxonomicStatus bb_taxonomicStatus
backbone_acceptedName bb_acceptedName
backbone_Key bb_key
backbone_acceptedKey bb_acceptedKey
backbone_kingdom bb_kingdom
backbone_issues issues
  1. Changes also applied to unit-tests:
  • ./tests/testthat/test-input_verify_taxa.R
  • ./tests/testthat/test-input_verify_taxa.R
  • all tsv files to test the output of verify_taxa() in ./tests/testthat/data_test_output_verify_taxa/
  1. Documentation compiled via devtools::document()
  2. Check correctness:
> devtools::test(filter = "output_verify_taxa")
Loading trias
Testing trias| OK F W S | Context| 17       | output_verify_taxa [0.3 s]

== Results =====================================================================
Duration: 0.3 s

OK:       17
Failed:   0
Warnings: 0
Skipped:  0
> devtools::test(filter = "input_verify_taxa")
Loading trias
Testing trias| OK F W S | Context|  8       | input_verify_taxa

== Results =====================================================================
Duration: 0.1 s

OK:       8
Failed:   0
Warnings: 0
Skipped:  0

@damianooldoni
Copy link
Collaborator Author

damianooldoni commented Oct 7, 2018

I have to change something more than column names unfortunately.
I see a small bug due to change in issues perspective: I wrote this function taking into account only backbone issues. In other words, I filtered on !is.na(bb_key) to detect changes in issues. But now the situation is different. Running verify_taxa.R in verify_taxa.Rmd of repo unified-checklist@review returns a problem for two taxa with scientificName Spartina x townsendii H. & J. Groves and Galanthus x warei J. Allen and bb_key 5290034 and 7556596. I will work on it tomorrow morning asap.

@damianooldoni
Copy link
Collaborator Author

Bug related to issues: solved.
Bug related to presence of taxa in multiple checklists (>= 3): solved.
Documentation and unit-tests upgraded so taking into account these bugs as encountered in the pipeline for unified checklist.

@damianooldoni
Copy link
Collaborator Author

A little bug emerged while looking in more detail. See issue #23.
Two lines expected. But a third row with checklist value NA,9ff7d317-609b-4c08-bd86-3bc404b77c42 detected which is an error. So, there is still a little bug while trying to collapse multiple rows for taxa present in more checklists.

@damianooldoni
Copy link
Collaborator Author

Based on trias-project/unified-checklist#14 (comment) i should simplify this function: in particular, no multiple comma separated checklists anymore for taxa present in multiple checklists.
@peterdesmet : based on the improved unification workflow could you please help me to assess which input and output parameters should be changed/added? See begin of this PR and function documentation: https://github.com/trias-project/trias/blob/adapt-verify_taxa/R/verify_taxa.R.

Copy link
Member

@peterdesmet peterdesmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New structure output (#24) and many other changes...
Better indentation of itemized elements
Among others: some name changes as discussed in #26  and #27 and apply specific order to taxa in output dfs as discussed in #27.
Changing name and solve bugs about outdated taxa
Unit-tests added as well in this commit.
devtools::document()
@damianooldoni
Copy link
Collaborator Author

@peterdesmet : time to give a look back to the PR. I think I did all changes we discussed. Here a synthesis about the functionality of verify_taxa().

The scope of this PR is the improvement of function verify_taxa(). The main change has been the introduction of taxon keys and its use as primary key jointly with bb_key and bb_acceptedKey instead of relying on scientific names. The very first consequence of this change is the fact that datasetKey column doesn't accept anymore multiple, comma separated, dataset keys.

The function reads two dataframes, taxa and taxa_to_verify and returns:

  1. Updated version of taxa_to_verify.
  2. Updated version of taxa, i.e. same number of rows (number of taxa), column verification_key added with information extracted by taxa_to_verify.
  3. A list of dataframes called info containing ancillary information. Such list is formed by the following dataframes: new_synonyms, new_unmatched_taxa, outdated_taxa, updated_bb_scientificName, updated_bb_acceptedName, duplicates, check_verificationKey.

Main features of the function:

  1. All taxa in taxa with a bb_key (matched taxa) and bb_taxonomicStatus equal to ACCEPTED or DOUBTFUL (not synonyms) don't need any manual verification, so verification_key set equal to bb_key.
  2. Synonyms and unmatched taxa in taxa not present in taxa_to_verify are added to taxa_to_verify.
  3. Information about the taxa the synonyms point to is retrieved and added in following columns: bb_acceptedKingdom, bb_acceptedRank, bb_acceptedTaxonomicStatus.
  4. Backbone scientific names (bb_scientificName) and backbone accepted names (bb_acceptedName) are checked for possible changes and eventually updated.
  5. Synonyms and unmatched taxa in taxa_to_verify not present in taxa are labelled as outdated: outdated set to TRUE, "Outdated taxa." added to remarks.
  6. Taxa in taxa_to_verify labelled as outdated which are present in taxa get outdated = FALSE and "Outdated taxa." is removed from remarks.
  7. All verification keys in taxa_to_verify are checked by TrIAS function gbif_verify_keys(). Results are returned in check_verificationKey df.
  8. duplicates, i.e. taxa with same bb_key - bb_acceptedKey (but different taxonKey) are detected to be shown apart in ancillary df info$duplicates.
  9. The column verificationKey allows comma seperated values (e.g. in case of hybrids).

Unit-tests has been written in two files: test-input_verify_taxa.R and test-output_verify_taxa.R. Standard location: ./tests/testthat/. Example in documentation is also used in some tests and has been saved in file ``.

@peterdesmet : please pull and test! Thanks.

@peterdesmet
Copy link
Member

Getting an error in which additional columns of input df taxa are not recognized:

Error: Unknown columns `taxonID`, `nameType`, `issues`, `validDistribution`, `bb_species` and ...

These columns should just be ignored (and remain in output df).

Copy link
Member

@peterdesmet peterdesmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@peterdesmet peterdesmet merged commit 4449cf7 into master Dec 20, 2018
@peterdesmet peterdesmet deleted the adapt-verify_taxa branch December 20, 2018 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants