`bbs_database download arxiv` fails with `invalid literal for int()` #585

FrancescoCasalegno · 2022-02-21T08:59:28Z

🐛 Bug description

The CLI command fails with uncaught exception when trying to download all arXiv papers published after MIN_DATE .

To reproduce

bbs_database download -v arxiv 2007-04 .

Expected behavior

ValueError: invalid literal for int() with base 10: ''.

The text was updated successfully, but these errors were encountered:

FrancescoCasalegno · 2022-02-21T09:38:39Z

Full error stack:

~/.local/lib/python3.7/site-packages/bluesearch/entrypoint/database/parent.py in main(argv)
    132
    133     # Run logic
--> 134     return cmds[command].run(**kwargs)

~/.local/lib/python3.7/site-packages/bluesearch/entrypoint/database/download.py in run(source, from_month, output_dir, dry_run)
    206
    207         logger.info("Collecting download URLs")
--> 208         blobs_by_month = get_gcs_urls(bucket, from_month)
    209
    210         if dry_run:

~/.local/lib/python3.7/site-packages/bluesearch/database/download.py in get_gcs_urls(bucket, start_date, end_date)
    262                 for el in iterator
    263             ),
--> 264             columns=["blob", "fullname", "article", "version"],
    265         )
    266

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    561         elif isinstance(data, abc.Iterable) and not isinstance(data, (str, bytes)):
    562             if not isinstance(data, (abc.Sequence, ExtensionArray)):
--> 563                 data = list(data)
    564             if len(data) > 0:
    565                 if is_dataclass(data[0]):

~/.local/lib/python3.7/site-packages/bluesearch/database/download.py in <genexpr>(.0)
    260                     int(el.name.rsplit("v", 1)[1].split(".")[0]),
    261                 )
--> 262                 for el in iterator
    263             ),
    264             columns=["blob", "fullname", "article", "version"],

ValueError: invalid literal for int() with base 10: ''
> /home/casalegn/.local/lib/python3.7/site-packages/bluesearch/database/download.py(262)<genexpr>()
    260                     int(el.name.rsplit("v", 1)[1].split(".")[0]),
    261                 )
--> 262                 for el in iterator
    263             ),
    264             columns=["blob", "fullname", "article", "version"],

ipdb>
ipdb> l
    257                     el,
    258                     el.name,
    259                     el.name.rsplit("v", 1)[0],
    260                     int(el.name.rsplit("v", 1)[1].split(".")[0]),
    261                 )
--> 262                 for el in iterator
    263             ),
    264             columns=["blob", "fullname", "article", "version"],
    265         )
    266
    267         df_latest = df[["article", "version"]].groupby("article", as_index=False).max()

FrancescoCasalegno · 2022-02-21T10:00:21Z

Diagnosis

The error is due to the following line failing

Search/src/bluesearch/database/download.py

Line 260 in 5ed9701

int(el.name.rsplit("v", 1)[1].split(".")[0]),

when el is any of the following files:

1808.02949v1.1.pdf
1808.02949v1.2.pdf
1808.02949v1.3.pdf
1808.02949v1.3a.pdf
1808.02949v1.4.pdf
1808.02949v1.4h.pdf
1808.02949v1.5.pdf
1808.02949v1.6h.pdf
1808.02949v1.6v.pdf

So notice that all these issues are related to only one article: the faulty 1808.02949v1 file. I say "faulty" because even arXiv is not able to retrieve the PDF for this article version, see https://arxiv.org/pdf/1808.02949v1.pdf :

Proposed Solution

Notice that get_arxiv_id() also fails on the PDF files in the list above. So by consistency, I think we can just avoid downloading these files.

FrancescoCasalegno added CLI Command line functionality 📥 db-download download articles from various sources labels Feb 21, 2022

FrancescoCasalegno self-assigned this Feb 21, 2022

FrancescoCasalegno mentioned this issue Feb 21, 2022

Skip download for arXiv articles with broken ID or version #586

Merged

4 tasks

FrancescoCasalegno closed this as completed in #586 Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bbs_database download arxiv` fails with `invalid literal for int()` #585

`bbs_database download arxiv` fails with `invalid literal for int()` #585

FrancescoCasalegno commented Feb 21, 2022

FrancescoCasalegno commented Feb 21, 2022

FrancescoCasalegno commented Feb 21, 2022

bbs_database download arxiv fails with invalid literal for int() #585

bbs_database download arxiv fails with invalid literal for int() #585

Comments

FrancescoCasalegno commented Feb 21, 2022

🐛 Bug description

To reproduce

Expected behavior

FrancescoCasalegno commented Feb 21, 2022

FrancescoCasalegno commented Feb 21, 2022

Diagnosis

Proposed Solution

`bbs_database download arxiv` fails with `invalid literal for int()` #585

`bbs_database download arxiv` fails with `invalid literal for int()` #585