Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with date_from and date_until #8

Open
vgoel38 opened this issue May 12, 2020 · 10 comments
Open

Issue with date_from and date_until #8

vgoel38 opened this issue May 12, 2020 · 10 comments

Comments

@vgoel38
Copy link

vgoel38 commented May 12, 2020

I copied the following url from the output of the program. The url looks for records between dates 2019-01-01 and 2019-05-10.

URL: http://export.arxiv.org/oai2?verb=ListRecords&from=2019-01-01&until=2019-05-10&metadataPrefix=arXiv&set=cs

But lot of records I got lie outside this date range (e.g. the first record which is from year 2007)

Am I missing something? I am not sure if the issue is with the code or with the arxiv api.

@thecheeseontoast
Copy link

Having the same issue, only 10% of the returned papers were within the requested date-range

@Mahdisadjadi
Copy link
Owner

Mahdisadjadi commented Jun 22, 2020

@vgoel38 and @thecheeseontoast : Thank you for raising the issue. The scraper returns two date columns for each record:

  • created
  • updated

If updates date is within the specified range, it still returns that record even when created date is out of the range. ArXiv specifically mentions this here:

Every OAI-PMH metadata record has a datestamp associated with it, which is the last modification time of that record. Because arXiv has updated metadata records in bulk on several occasions, the OAI-PMH datestamp values do not correspond with the original submission or replacement times for older articles, and may not for newer articles because of administrative and bibliographic updates. The earliest datestamp is given then the element of the Identify response.

If it would be something useful, I can slightly modify the behavior to use earliestDatestamp in addition to the last datastamp.

@ChakreshIITGN
Copy link

I notice that even some dates in the "updated" section are out of the range

@Mahdisadjadi
Copy link
Owner

Mahdisadjadi commented Jun 24, 2020

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

@valayDave
Copy link

Hey. Great tool guys!. I found a bug with the Record._get_authors method where sometimes the author tag doesn't have forenames.

Bug Reproduction :

import arxivscraper
scraper = arxivscraper.Scraper(category='cs', date_from='2020-06-25',date_until='2020-06-27')

output = scraper.scrape()

@Mahdisadjadi
Copy link
Owner

@valayDave : Did you use pip to install or the repo?

@valayDave
Copy link

I installed with pip not from the source.

@Mahdisadjadi
Copy link
Owner

@valayDave Sorry pip version is lagging but this issue should be fixed in source.

@Mahdisadjadi
Copy link
Owner

Mahdisadjadi commented Sep 19, 2020

@valayDave pip version is updated to the latest, so this bug should be fixed.

@csrajath
Copy link

csrajath commented Sep 20, 2020

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

@Mahdisadjadi One way to get around this which I thought of was: The get_metadata() method has a time key in its dictionary output for every record. This time is the original time of submission. Thus, we can pass the value of this key (time) as a conditional checker to from and until

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants