Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries that specify before and after can return a different number of results than reported as available by Pushshift #13

Open
mattpodolak opened this issue Apr 21, 2021 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@mattpodolak
Copy link
Owner

Test Query:

comments = api.search_comments(
                    after=1606262347,
                    before=1618581599,         
                    subreddit="CovidVaccinated",
                    fields=["id","subreddit","link_id","parent_id","is_submitter","author",
                                "author_fullname","body","score","created_utc","permalink"],
                    limit=None
                    )

Results:

40730 result(s) available in Pushshift
Checkpoint:: Success Rate: 71.00% - Requests: 100 - Batches: 10 - Items Remaining: 33898
Checkpoint:: Success Rate: 79.00% - Requests: 200 - Batches: 20 - Items Remaining: 25661
Checkpoint:: Success Rate: 81.67% - Requests: 300 - Batches: 30 - Items Remaining: 18163
Checkpoint:: Success Rate: 81.75% - Requests: 400 - Batches: 40 - Items Remaining: 11467
Checkpoint:: Success Rate: 82.80% - Requests: 500 - Batches: 50 - Items Remaining: 4262
Checkpoint:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
1 result(s) not found in Pushshift

Discovered in #12

@mattpodolak mattpodolak added the bug Something isn't working label Apr 21, 2021
@mattpodolak mattpodolak self-assigned this Apr 21, 2021
@mattpodolak
Copy link
Owner Author

mattpodolak commented Apr 21, 2021

A potential cause could be how the database is queried during time slicing. The oldest item utc_timestamp is used as a before field when generating subsequent timeslices. Pushshift queries the database using gt and lt for the after and before timestamps.

If multiple items have the same exact same utc_timestamp but are not all returned in a single query (due to 100 item limit), we might expect that the items may not be returned in subsequent timeslices.

@mattpodolak mattpodolak changed the title Sometimes items can be missing in results Queries that specify before and after can return a different number of results than reported as available by Pushshift Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant