This repository has been archived by the owner on Jan 29, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 12
bbs_database download arxiv
fails with invalid literal for int()
#585
Labels
Comments
FrancescoCasalegno
added
CLI
Command line functionality
📥 db-download
download articles from various sources
labels
Feb 21, 2022
Full error stack: ~/.local/lib/python3.7/site-packages/bluesearch/entrypoint/database/parent.py in main(argv)
132
133 # Run logic
--> 134 return cmds[command].run(**kwargs)
~/.local/lib/python3.7/site-packages/bluesearch/entrypoint/database/download.py in run(source, from_month, output_dir, dry_run)
206
207 logger.info("Collecting download URLs")
--> 208 blobs_by_month = get_gcs_urls(bucket, from_month)
209
210 if dry_run:
~/.local/lib/python3.7/site-packages/bluesearch/database/download.py in get_gcs_urls(bucket, start_date, end_date)
262 for el in iterator
263 ),
--> 264 columns=["blob", "fullname", "article", "version"],
265 )
266
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
561 elif isinstance(data, abc.Iterable) and not isinstance(data, (str, bytes)):
562 if not isinstance(data, (abc.Sequence, ExtensionArray)):
--> 563 data = list(data)
564 if len(data) > 0:
565 if is_dataclass(data[0]):
~/.local/lib/python3.7/site-packages/bluesearch/database/download.py in <genexpr>(.0)
260 int(el.name.rsplit("v", 1)[1].split(".")[0]),
261 )
--> 262 for el in iterator
263 ),
264 columns=["blob", "fullname", "article", "version"],
ValueError: invalid literal for int() with base 10: ''
> /home/casalegn/.local/lib/python3.7/site-packages/bluesearch/database/download.py(262)<genexpr>()
260 int(el.name.rsplit("v", 1)[1].split(".")[0]),
261 )
--> 262 for el in iterator
263 ),
264 columns=["blob", "fullname", "article", "version"],
ipdb>
ipdb> l
257 el,
258 el.name,
259 el.name.rsplit("v", 1)[0],
260 int(el.name.rsplit("v", 1)[1].split(".")[0]),
261 )
--> 262 for el in iterator
263 ),
264 columns=["blob", "fullname", "article", "version"],
265 )
266
267 df_latest = df[["article", "version"]].groupby("article", as_index=False).max() |
DiagnosisThe error is due to the following line failing Search/src/bluesearch/database/download.py Line 260 in 5ed9701
when el is any of the following files:
1808.02949v1.1.pdf
1808.02949v1.2.pdf
1808.02949v1.3.pdf
1808.02949v1.3a.pdf
1808.02949v1.4.pdf
1808.02949v1.4h.pdf
1808.02949v1.5.pdf
1808.02949v1.6h.pdf
1808.02949v1.6v.pdf So notice that all these issues are related to only one article: the faulty Proposed SolutionNotice that |
4 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
🐛 Bug description
The CLI command fails with uncaught exception when trying to download all
arXiv
papers published afterMIN_DATE
.To reproduce
bbs_database download -v arxiv 2007-04 .
Expected behavior
The text was updated successfully, but these errors were encountered: