Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Run ETL pipeline and collect stats on downloads #572

Open
5 tasks
FrancescoCasalegno opened this issue Feb 8, 2022 · 1 comment
Open
5 tasks

Run ETL pipeline and collect stats on downloads #572

FrancescoCasalegno opened this issue Feb 8, 2022 · 1 comment
Assignees

Comments

@FrancescoCasalegno
Copy link
Contributor

Context

Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.

Actions

  • Our ETL pipeline is not designed to handled to download papers before a certain date
    # Data conventions and formats are different prior to these dates. We
    # download only if the starting date is more recent or equal to the
    # respective threshold.
    MIN_DATE = {

    So we need to download those old files manually. Hopefully, we have to do this only once.
  • Make sure to download all files in the same location on GPFS in a well-structured way. In this sense, this issue now includes the scope of Restructure articles storage #509.
  • Define the filter_config file by talking to scientists.
    parser.add_argument(
    "filter_config",
    type=Path,
    help="""
    Path to a .JSONL file that defines all the rules for filtering.
    """,
  • Test the ETL pipeline by running on the last month (i.e. --from_date equal to the last month).
  • Collect statistics about downloaded data. In particular, for each source (arxiv, biorxiv, pmc, ...) we want to know:
    • tot n. of papers (any topic / with relevant topic)
    • n. of full-text papers (any topic / with relevant topic)
    • n. of papers by format type, e.g. pdf, xml, ... (any topic / with relevant topic)
@EmilieDel
Copy link
Contributor

EmilieDel commented Mar 10, 2022

Pubmed Analysis

Baseline files

For (half) of the baseline - 562 files:

  • 16860000 articles
  • 16859975 unique UIDs
  • 25 duplicates

Updates Files downloaded

Global numbers

For updates_files: all files between pubmed22n1115.xml.gz and pubmed22n1204.xml.gz (2021-12-13 - 2022-02-22 = 71 days):

  • 90 files
  • 1762178 articles
  • 1012859 unique IDs (57.4776 % of the articles)
  • Out of 1012859 unique IDs, 298536 are already present in the (half) baseline (29.47% - might decrease with time as the baseline was created recently (?))
  • Completely new articles: 714323 (40.54 % of the articles)
  • Articles are sometimes present into several files (one of the them has 46 copies).

What are the changes ?

Analysis between pubmed22n1124.xml (published on 2021-12-19) and pubmed22n1147.xml (published on 2022-01-11)

  • 34102 articles
  • 33692 unique UIDs (98.79 % of the total)
  • 410 duplicates, among those duplicated bids:
    • After parsing, 383 does not contain any difference in the title, in the authors list and in the abstract paragraphs
    • 14 have different titles (but only one left after lowering them - the last consisting in a change of punctuation)
    • 4 have different author lists (one correcting a typo, one has an additional author, remove Prof title, switch name and surname order)
    • 22 have different abstracts (for 21 of them, the number of paragraphs did not change, for the last one, the sentences are split into several paragraphs)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants