You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.
Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.
Actions
Our ETL pipeline is not designed to handled to download papers before a certain date
# Data conventions and formats are different prior to these dates. We
# download only if the starting date is more recent or equal to the
# respective threshold.
MIN_DATE= {
So we need to download those old files manually. Hopefully, we have to do this only once.
Make sure to download all files in the same location on GPFS in a well-structured way. In this sense, this issue now includes the scope of Restructure articles storage #509.
Define the filter_config file by talking to scientists.
For updates_files: all files between pubmed22n1115.xml.gz and pubmed22n1204.xml.gz (2021-12-13 - 2022-02-22 = 71 days):
90 files
1762178 articles
1012859 unique IDs (57.4776 % of the articles)
Out of 1012859 unique IDs, 298536 are already present in the (half) baseline (29.47% - might decrease with time as the baseline was created recently (?))
Completely new articles: 714323 (40.54 % of the articles)
Articles are sometimes present into several files (one of the them has 46 copies).
What are the changes ?
Analysis between pubmed22n1124.xml (published on 2021-12-19) and pubmed22n1147.xml (published on 2022-01-11)
34102 articles
33692 unique UIDs (98.79 % of the total)
410 duplicates, among those duplicated bids:
After parsing, 383 does not contain any difference in the title, in the authors list and in the abstract paragraphs
14 have different titles (but only one left after lowering them - the last consisting in a change of punctuation)
4 have different author lists (one correcting a typo, one has an additional author, remove Prof title, switch name and surname order)
22 have different abstracts (for 21 of them, the number of paragraphs did not change, for the last one, the sentences are split into several paragraphs)
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Context
Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.
Actions
Search/src/bluesearch/entrypoint/database/download.py
Lines 30 to 33 in 5ed9701
So we need to download those old files manually. Hopefully, we have to do this only once.
filter_config
file by talking to scientists.Search/src/bluesearch/entrypoint/database/topic_filter.py
Lines 58 to 63 in e2704e2
--from_date
equal to the last month).arxiv
,biorxiv
,pmc
, ...) we want to know:The text was updated successfully, but these errors were encountered: