-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strategy for missed posts? #12
Comments
Hi @ChrisPalmerNZ, the success rate metric which is printed represents how many requests are rejected due to rate-limiting from Pushshift, any failed request is retried automatically. That's interesting that there was a different number of comments. Can you share the query that you ran? Were there any shards down while you ran the query? The |
Hi @mattpodolak Thanks for replying, sorry its taken me a while to get back to you. I used the same subreddit, start, end and parameters, and query for both libraries, they were subreddit='CovidVaccinated', before=1618581599, and after=1606262347. And the query (
I didn't check shards, should I execute an api.metadata_.get('shards') to check them? I got this from psaw - but eventually I got 40,762 comments:
I got this from pmaw, and got 40,630 comments:
|
I'm currently working on troubleshooting what happened to those 100 missing comments. The number of comments returned by
|
Additional update, I ran your query with both
I am currently investigating why there was 1 comment missed, I'll release an update sometime this week once I discover the root cause. |
Thanks for doing this Matthew. I ran the pmaw query a day after the psaw one, and I noticed at that time that it said that fewer (40,730) posts were available than what psaw returned. I measured the number of posts from both libraries by the length of the data, rather than any reporting by the library. I am not currently in front of my PC, but I have saved the data so when I get home tonight I will look at it to see if there were any duplicates returned that might explain the higher psaw number. |
Hi Matthew |
Usually, if shards are down a warning should be printed in both I'm not too sure why there were 100 missing results as I was unable to re-create this, so it could be data inconsistency with Pushshift. I have in the past partially lost I would refer to the number of items available reported by Based on the logs you provided, it appears that Total:: Success Rate: 83.02% - Requests: 583 - Batches: 59 - Items Remaining: 1 # finished the query
Checkpoint:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: -99 # re-tries the query to get the missing item
Total:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: 0 |
Two problems discovered thanks to this issue:
|
Thanks for all of that Matt - I'm glad, and very impressed, that my issue resulted in your devoted attention, and that it led to an improvement - its a great product! BTW, last night I re-ran the query and got all 40,730 results. And, I am familiar with how generators work, I unpacked it straight to CSV, so that wasn't the issue here... |
No problem, thanks for reporting the issue. It's worth noting that the 40,730 results that I'm still working on figuring out the root cause, but |
Hi Matt,
Thanks for pmaw - it's a very nice library you've created!
This is not a problem with pmaw, but with my understanding of how to use it.
I have executed a search_comments with the same before and after parameters in both psaw and pmaw. It was so much faster in psaw, I was amazed! But, success rate varied from 93% to 83%, so at the end I had 40,630 comments using pmaw compared to 40,762 using psaw.
What is the best strategy for retrieving the comments that were missed? Should I assemble a list of submission ids from a search_submissions with the same parameters, based on their having num_comments greater than in the retrieved comments (or not even in the comments), then use search_submission_comment_ids with them? Or, can I utilize safe_exit, and re-run the process to see if I can get more? Or, something else?
Perhaps I it would be best to use search_submission_comment_id from the get-go? I have found that searching by id with psaw much slower that just using a date range, and as I have a range I didn't bother with it in this case. Is it slower to use than search_comments?
Cheers
Chris
The text was updated successfully, but these errors were encountered: