Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with limit? #57

Open
ranbix666 opened this issue Jan 17, 2023 · 7 comments
Open

Issue with limit? #57

ranbix666 opened this issue Jan 17, 2023 · 7 comments

Comments

@ranbix666
Copy link

ranbix666 commented Jan 17, 2023

Hi Matthew, thank you so much for your great work on PMAW!

I tried to use your example with a limit = 100000. It seems 0 comments will be retrieved if the limit is greater than 1000.

import datetime as dt
before = int(dt.datetime(2021,2,1,0,0).timestamp())
after = int(dt.datetime(2020,12,1,0,0).timestamp())

subreddit="wallstreetbets"
limit=100000
comments = api.search_comments(subreddit=subreddit, limit=limit, before=before, after=after)
print(f'Retrieved {len(comments)} comments from Pushshift')

The log:

WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
Retrieved 0 comments from Pushshift

I have tried with limit = 100, 1000, 1001. It seems 0 comments will be retrieved if the limit is greater than 1000.

Can you please let me know if I missed anything? Thanks!

@eddvrs
Copy link
Contributor

eddvrs commented Jan 25, 2023

Hi @ranbix666

The parameter names for before and after have changed to "until" and "since", so try this line instead:

    comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)

Additionally, the Pushshift API itself is undergoing a major migration, as a result there is not (yet) any data from before November 2022, so along with the above change, try changing the date range also.

The following code returns the expected count for me:

    api = pmaw.PushshiftAPI()

    before = int(dt.datetime(2023, 1, 25, 0, 0).timestamp())
    after = int(dt.datetime(2023, 1, 1, 0, 0).timestamp())

    subreddit = "wallstreetbets"
    limit = 301
    comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)
    print(f'Retrieved {len(comments)} comments from Pushshift')

@hug3874
Copy link

hug3874 commented Feb 1, 2023

Hello,

I have the same issue: request is ok if limit <= 1000. @eddvrs your example works because your limit is under 1000.
This:

import pmaw
import datetime as dt
api = pmaw.PushshiftAPI()

before = int(dt.datetime(2023, 1, 25, 0, 0).timestamp())
after = int(dt.datetime(2023, 1, 1, 0, 0).timestamp())

subreddit = "wallstreetbets"
limit =1000
comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)
print(f'Retrieved {len(comments)} comments from Pushshift')

returns
Retrieved 1000 comments from Pushshift

While this (which is the exact same code but with a limit at 1001 instead of 1000):

import pmaw
import datetime as dt
api = pmaw.PushshiftAPI()
before = int(dt.datetime(2023, 1, 25, 0, 0).timestamp())
after = int(dt.datetime(2023, 1, 1, 0, 0).timestamp())
subreddit = "wallstreetbets"
limit =1001
comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)
print(f'Retrieved {len(comments)} comments from Pushshift')

returns
Not all PushShift shards are active. Query results may be incomplete. Retrieved 0 comments from Pushshift

@hug3874
Copy link

hug3874 commented Feb 4, 2023

Using the parameter "size" instead of "limit" fixed the issue for me. It is probably due to the pushshift migration.

@manu6287
Copy link

manu6287 commented Mar 11, 2023

Using the parameter "size" instead of "limit" fixed the issue for me. It is probably due to the pushshift migration.

I set "size = 2000" and after about 15 minutes of runtime, I interrupted the process to find myself with over86000 results. Could someone please help?

@FamiliarBreakfast
Copy link

Size parameter is doesn't work right now

@Adam-R26
Copy link

Was there ever any resolution to this problem? If both size and limit parameters aren't working as expected, how can we retrieve a desired number of records?

@ranbix666
Copy link
Author

Using the parameter "size" instead of "limit" fixed the issue for me. It is probably due to the pushshift migration.

I set "size = 2000" and after about 15 minutes of runtime, I interrupted the process to find myself with over86000 results. Could someone please help?

You get more than you asked for. Isn't it great? LOL, just joking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants