Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

bbs_database download arxiv fails with invalid literal for int() #585

Closed
FrancescoCasalegno opened this issue Feb 21, 2022 · 2 comments · Fixed by #586
Closed

bbs_database download arxiv fails with invalid literal for int() #585

FrancescoCasalegno opened this issue Feb 21, 2022 · 2 comments · Fixed by #586
Assignees
Labels
CLI Command line functionality 📥 db-download download articles from various sources

Comments

@FrancescoCasalegno
Copy link
Contributor

🐛 Bug description

The CLI command fails with uncaught exception when trying to download all arXiv papers published after MIN_DATE .

To reproduce

bbs_database download -v arxiv 2007-04 .

Expected behavior

ValueError: invalid literal for int() with base 10: ''.
@FrancescoCasalegno FrancescoCasalegno added CLI Command line functionality 📥 db-download download articles from various sources labels Feb 21, 2022
@FrancescoCasalegno
Copy link
Contributor Author

Full error stack:

~/.local/lib/python3.7/site-packages/bluesearch/entrypoint/database/parent.py in main(argv)
    132
    133     # Run logic
--> 134     return cmds[command].run(**kwargs)

~/.local/lib/python3.7/site-packages/bluesearch/entrypoint/database/download.py in run(source, from_month, output_dir, dry_run)
    206
    207         logger.info("Collecting download URLs")
--> 208         blobs_by_month = get_gcs_urls(bucket, from_month)
    209
    210         if dry_run:

~/.local/lib/python3.7/site-packages/bluesearch/database/download.py in get_gcs_urls(bucket, start_date, end_date)
    262                 for el in iterator
    263             ),
--> 264             columns=["blob", "fullname", "article", "version"],
    265         )
    266

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    561         elif isinstance(data, abc.Iterable) and not isinstance(data, (str, bytes)):
    562             if not isinstance(data, (abc.Sequence, ExtensionArray)):
--> 563                 data = list(data)
    564             if len(data) > 0:
    565                 if is_dataclass(data[0]):

~/.local/lib/python3.7/site-packages/bluesearch/database/download.py in <genexpr>(.0)
    260                     int(el.name.rsplit("v", 1)[1].split(".")[0]),
    261                 )
--> 262                 for el in iterator
    263             ),
    264             columns=["blob", "fullname", "article", "version"],

ValueError: invalid literal for int() with base 10: ''
> /home/casalegn/.local/lib/python3.7/site-packages/bluesearch/database/download.py(262)<genexpr>()
    260                     int(el.name.rsplit("v", 1)[1].split(".")[0]),
    261                 )
--> 262                 for el in iterator
    263             ),
    264             columns=["blob", "fullname", "article", "version"],

ipdb>
ipdb> l
    257                     el,
    258                     el.name,
    259                     el.name.rsplit("v", 1)[0],
    260                     int(el.name.rsplit("v", 1)[1].split(".")[0]),
    261                 )
--> 262                 for el in iterator
    263             ),
    264             columns=["blob", "fullname", "article", "version"],
    265         )
    266
    267         df_latest = df[["article", "version"]].groupby("article", as_index=False).max()

@FrancescoCasalegno
Copy link
Contributor Author

Diagnosis

The error is due to the following line failing

int(el.name.rsplit("v", 1)[1].split(".")[0]),

when el is any of the following files:

1808.02949v1.1.pdf
1808.02949v1.2.pdf
1808.02949v1.3.pdf
1808.02949v1.3a.pdf
1808.02949v1.4.pdf
1808.02949v1.4h.pdf
1808.02949v1.5.pdf
1808.02949v1.6h.pdf
1808.02949v1.6v.pdf

So notice that all these issues are related to only one article: the faulty 1808.02949v1 file. I say "faulty" because even arXiv is not able to retrieve the PDF for this article version, see https://arxiv.org/pdf/1808.02949v1.pdf :
Screen Shot 2022-02-21 at 10 47 41

Proposed Solution

Notice that get_arxiv_id() also fails on the PDF files in the list above. So by consistency, I think we can just avoid downloading these files.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLI Command line functionality 📥 db-download download articles from various sources
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant