Skip to content

Commit

Permalink
docs fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanisoe committed Jan 22, 2020
1 parent e690645 commit 1387a28
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion docs/fetcher.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ The endpoint of the API is https://www.ebi.ac.uk/europepmc/webservices/rest/sear

We can possibly get all `publication parts`_ from the Europe PMC API, except for fulltext_, efo_ and go_ for which we get a ``Y`` or ``N`` indicating if the corresponding part is available at the `Europe PMC fulltext`_ or `Europe PMC mined`_ resource. In addition, we can possibly get values for the publication fields :ref:`oa <oa>`, :ref:`journalTitle <journaltitle>`, :ref:`pubDate <pubdate>` and :ref:`citationsCount <citationscount>`. Europe PMC is currently the only resource we can get the :ref:`citationsCount <citationscount>` value from.

Europe PMC itself has content from multiple sources (see https://europepmc.org/Help#whatserachingEPMC) and in some cases multiple results are returned for a query (each from a different source). In that case the MED (MEDLINE) source is preferred, then PMC (PubMed Central), then PPR (preprints) and then whichever source is first in the list of results.
Europe PMC itself has content from multiple sources (see https://europepmc.org/Help#contentsources) and in some cases multiple results are returned for a query (each from a different source). In that case the MED (MEDLINE) source is preferred, then PMC (PubMed Central), then PPR (preprints) and then whichever source is first in the list of results.

.. _europe_pmc_fulltext:

Expand Down
2 changes: 1 addition & 1 deletion docs/output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Publications can be identified by 3 separate IDs: :ref:`a PMID <id_pmid>`, :ref:

The structure of the values in the publications, webpages and docs stores, i.e. the actual contents_ stored in the database, is best described by the next section `JSON output`_, as the entire content of the database can be exported to an equivalently structured JSON file. To note, all the "empty", "usable", "final", "totallyFinal" and "broken" fields present in the JSON output are not stored in the database, but these values are inferred from actual database values and depend on some :ref:`fetching <fetching>` parameters. Additionally, the fields "version" and "argv" are only specific to JSON.

With a new release of PubFetcher, the structure of the database content might change (this involves code in the package `org.edammap.pubfetcher.core.db <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/db/>`_). Currently, there is no database migration support, which means that the content of existing database files will be become unreadable in case of structure updates. If that content is still required, it would need to be refetched to a new database file (created with the new version of PubFetcher).
With a new release of PubFetcher, the structure of the database content might change (this involves code in the package `org.edammap.pubfetcher.core.db <https://github.com/edamontology/pubfetcher/tree/master/core/src/main/java/org/edamontology/pubfetcher/core/db>`_). Currently, there is no database migration support, which means that the content of existing database files will be become unreadable in case of structure updates. If that content is still required, it would need to be refetched to a new database file (created with the new version of PubFetcher).

.. _json_output:

Expand Down
2 changes: 1 addition & 1 deletion docs/scraping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ Testing of rules

Currently, PubFetcher has no tests or any framework for testing its functionality, except for the scraping rule testing described here. Scraping rules should definitely be tested from time to time, because they depend on external factors, like publishers changing the coding of their web pages.

Tests for `journals.yaml <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/journals.yaml>`_ are at `journals.csv <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/journals.csv>`_ and tests for `webpages.yaml <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/webpages.yaml>`_ are at `webpages.csv <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/webpages.csv>`_. If new rules are added to a YAML, then tests covering them should be added to the corresponding CSV. In addition, tests for hardcoded rules of some other resources can be found in the `resources/test <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/test/>`_ directory. All :ref:`Resources <resources>` except :ref:`Meta <meta>` are covered.
Tests for `journals.yaml <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/journals.yaml>`_ are at `journals.csv <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/journals.csv>`_ and tests for `webpages.yaml <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/webpages.yaml>`_ are at `webpages.csv <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/resources/scrape/webpages.csv>`_. If new rules are added to a YAML, then tests covering them should be added to the corresponding CSV. In addition, tests for hardcoded rules of some other resources can be found in the `resources/test <https://github.com/edamontology/pubfetcher/tree/master/core/src/main/resources/test>`_ directory. All :ref:`Resources <resources>` except :ref:`Meta <meta>` are covered.

The test files are in a simplified CSV format. The very first line is always skipped and should contain a header explaining the columns. Empty lines, lines containing only whitespace and lines starting with ``#`` are also ignored. Otherwise, each line describes a test and columns are separated using ",". Any quoting of fields is not possible and not necessary, as fields are assumed to not contain the "," symbol. Or actually, the number of columns for a given CSV file is fixed in advance, meaning that the last field can contain the "," symbol as its value is taken to be everything from the separating "," to the end of the line.

Expand Down

0 comments on commit 1387a28

Please sign in to comment.