Issue with date_from and date_until #8

vgoel38 · 2020-05-12T06:07:41Z

I copied the following url from the output of the program. The url looks for records between dates 2019-01-01 and 2019-05-10.

URL: http://export.arxiv.org/oai2?verb=ListRecords&from=2019-01-01&until=2019-05-10&metadataPrefix=arXiv&set=cs

But lot of records I got lie outside this date range (e.g. the first record which is from year 2007)

Am I missing something? I am not sure if the issue is with the code or with the arxiv api.

thecheeseontoast · 2020-06-17T10:34:23Z

Having the same issue, only 10% of the returned papers were within the requested date-range

Mahdisadjadi · 2020-06-22T04:08:25Z

@vgoel38 and @thecheeseontoast : Thank you for raising the issue. The scraper returns two date columns for each record:

created
updated

If updates date is within the specified range, it still returns that record even when created date is out of the range. ArXiv specifically mentions this here:

Every OAI-PMH metadata record has a datestamp associated with it, which is the last modification time of that record. Because arXiv has updated metadata records in bulk on several occasions, the OAI-PMH datestamp values do not correspond with the original submission or replacement times for older articles, and may not for newer articles because of administrative and bibliographic updates. The earliest datestamp is given then the element of the Identify response.

If it would be something useful, I can slightly modify the behavior to use earliestDatestamp in addition to the last datastamp.

ChakreshIITGN · 2020-06-22T21:31:34Z

I notice that even some dates in the "updated" section are out of the range

Mahdisadjadi · 2020-06-24T00:37:07Z

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

valayDave · 2020-06-27T02:54:23Z

Hey. Great tool guys!. I found a bug with the Record._get_authors method where sometimes the author tag doesn't have forenames.

Bug Reproduction :

import arxivscraper
scraper = arxivscraper.Scraper(category='cs', date_from='2020-06-25',date_until='2020-06-27')

output = scraper.scrape()

Mahdisadjadi · 2020-07-06T01:05:04Z

@valayDave : Did you use pip to install or the repo?

valayDave · 2020-07-07T00:26:41Z

I installed with pip not from the source.

Mahdisadjadi · 2020-07-07T00:34:06Z

@valayDave Sorry pip version is lagging but this issue should be fixed in source.

Mahdisadjadi · 2020-09-19T18:15:14Z

@valayDave pip version is updated to the latest, so this bug should be fixed.

csrajath · 2020-09-20T10:54:30Z

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

@Mahdisadjadi One way to get around this which I thought of was: The get_metadata() method has a time key in its dictionary output for every record. This time is the original time of submission. Thus, we can pass the value of this key (time) as a conditional checker to from and until

Mahdisadjadi self-assigned this Jun 22, 2020

Mahdisadjadi added enhancement question labels Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with date_from and date_until #8

Issue with date_from and date_until #8

vgoel38 commented May 12, 2020

thecheeseontoast commented Jun 17, 2020

Mahdisadjadi commented Jun 22, 2020 •

edited

Loading

ChakreshIITGN commented Jun 22, 2020

Mahdisadjadi commented Jun 24, 2020 •

edited

Loading

valayDave commented Jun 27, 2020

Mahdisadjadi commented Jul 6, 2020

valayDave commented Jul 7, 2020

Mahdisadjadi commented Jul 7, 2020

Mahdisadjadi commented Sep 19, 2020 •

edited

Loading

csrajath commented Sep 20, 2020 •

edited

Loading

Issue with date_from and date_until #8

Issue with date_from and date_until #8

Comments

vgoel38 commented May 12, 2020

thecheeseontoast commented Jun 17, 2020

Mahdisadjadi commented Jun 22, 2020 • edited Loading

ChakreshIITGN commented Jun 22, 2020

Mahdisadjadi commented Jun 24, 2020 • edited Loading

valayDave commented Jun 27, 2020

Mahdisadjadi commented Jul 6, 2020

valayDave commented Jul 7, 2020

Mahdisadjadi commented Jul 7, 2020

Mahdisadjadi commented Sep 19, 2020 • edited Loading

csrajath commented Sep 20, 2020 • edited Loading

Mahdisadjadi commented Jun 22, 2020 •

edited

Loading

Mahdisadjadi commented Jun 24, 2020 •

edited

Loading

Mahdisadjadi commented Sep 19, 2020 •

edited

Loading

csrajath commented Sep 20, 2020 •

edited

Loading