-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instagram is blocking our scraping #665
Comments
changed our user agent to a normal browser string, which seemed to fix this....but i doubt it'll stay fixed for long. app engine still appends our app id to the user agent, so instagram will still be able to identify us. we'll see. |
didn't work :( |
i also dropped instagram max poll freq down to 2h. with that and the new user agent, we're back in business. >75% of active instagram accounts have polled successfully in the last few hrs. eg https://brid.gy/instagram/aaronpk not sure which of the two changes did the trick. we'll see. |
we're still mostly blocked after all. :/ a few fetches went through ok, but they were the exception. i'm going to disable instagram entirely for a day or two to see if that resets anything on their end. |
i wonder if this is all of app engine's (slash google's) IP block, not just bridgy. eg granary-demo sees the same problem: https://granary-demo.appspot.com/?site=instagram |
evidence for that: scraping instagram with bridgy's user agent works fine on my local machine. |
tried switching to sockets instead of urlfetch in the hopes that it used a different IP block, but no luck. one request made it through out of five, but the other four were 429ed. :/ |
pinged the app engine mailing list: https://groups.google.com/d/msg/google-appengine/rpendSIxJMo/_u4G6uXiBQAJ |
i set up a reverse proxy to get around the IP block. |
lets us use a reverse proxy. #665, snarfed/granary@40e4bd1
this has been working ok for a couple days now, yay. we'll see how long it lasts. :P closing. |
I noticed you're scraping the profile page - you should checkout /username/media/. No auth needed. Discovering this blew my mind. Are you using a single IP? How often are you polling? Been working for 21+ days since your fix? |
@gerbz sadly that only works if you're logged in. http://stackoverflow.com/questions/17373886/33783840#comment61481772_33783840 the proxy was a single IP, yes, but instagram actually stopped blocking app engine recently, so i switched back to fetching directly instead. we're polling ~1k users between once a day and once an hr, depending on how active they are. each poll may also fetch up to N individual media pages too though. in practice it looks like we average <1qpm right now, slightly bursty. |
@snarfed that comment is incorrect - try for yourself. I've even hit it unauthed using Tor. Works fine. Haven't polled it excessively but should work. Thanks for the info. |
good point! you're right. thanks! i just realized i was testing on a private account. public accounts work fine. |
@snarfed what the current status of your scraping, Is project still up ? |
@shafikhaan yup! https://brid.gy/ , https://granary.io , and https://instagram-atom.appspot.com are still happily scraping Instagram. |
@snarfed Which one will you pick from the above ? |
@shafikhaan sorry? i don't follow the question. they all share this scraping code, if that helps: https://github.com/snarfed/granary/blob/master/granary/instagram.py#L758-L975 |
happening again. started 8/21, probably due to an ongoing flood of https://granary.io/ instagram fetches for individual profiles via subscriptions in Aperture-based news readers. ugh. i've disabled instagram in granary entirely for now. for the record, and since i might need to use it again, when i proxied requests last time, i used Apache 2.4's
|
interestingly, the symptom this time is different. when it happened originally, back in 2016, we got 429s with a nice Sorry, too many requests. HTML body. now, it's 401s with an empty body. example log. |
back to proxying. working for now. i've re-enabled all affected IG accounts. |
instagram blocked my proxy's IP. whee. |
trying to discourage people from using granary for social feeds, esp due to eg IG's recent blocking, snarfed/bridgy#665 (comment)
...since i had to block instagram in granary due to their rate limiting/blocking. snarfed/bridgy#665 (comment)
inspired by snarfed/instagram-atom@856575b, since i had to block instagram in granary due to their rate limiting/blocking. snarfed/bridgy#665 (comment) UI next!
we've been scraping with a logged in session cookie for a while now. not ideal, maybe not sustainable, but it's been working. tentatively closing. |
The website still reports bridgy is blocked? Is this message still up2date? |
@staabm which web site? where? |
I a getting this erro when trying to add instagram to my brigy account |
ah, got it, thanks. I'll look soon! |
@staabm thanks again for the report. i think i've fixed this. mind trying again? |
Thx for the fast fix. The error is gone.. Seems to work now ✌️ |
... by returning empty 429s to our profile page HTTP requests. seems like it started under 36h ago. may have happened before too though. eg https://brid.gy/log?start_time=1461974780&key=aglzfmJyaWQtZ3lyFgsSCUluc3RhZ3JhbSIHc25hcmZlZAw
instagram-atom isn't having this problem, and it's using the same IPs, so maybe changing user agent might fix it.
The text was updated successfully, but these errors were encountered: