Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redirect not been followed #2502

Open
delijati opened this issue Nov 10, 2017 · 15 comments
Open

Redirect not been followed #2502

delijati opened this issue Nov 10, 2017 · 15 comments

Comments

@delijati
Copy link

Long story short

Redirect not been followed. I always get a 302 back.

Expected behaviour

Get a 200 back with text filled.

Steps to reproduce

import aiohttp
import asyncio


async def fetch(session, url):
    async with session.get(url, allow_redirects=True) as response:
        return await response.text(), response.status


async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
    )

    for ret in results:
        print("status: %s len: %s" % (ret[1], len(ret[0])))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    urls = ['https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python']
    with aiohttp.ClientSession(loop=loop) as session:
        loop.run_until_complete(fetch_all(session, urls, loop))
@kxepal
Copy link
Member

kxepal commented Nov 10, 2017

Try curl -v -L 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python' - it produce redirection loop. It looks like not a aiohttp, but server bug.

@delijati
Copy link
Author

with requests

In [1]: import requests
In [2]: res = requests.get("https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python")
In [3]: res.status_code
Out[3]: 200
In [4]: len(res.text)
Out[4]: 171410

@kxepal
Copy link
Member

kxepal commented Nov 10, 2017

Not sure what requests does, but even browsers reports about improper redirection caused by this URL. I'm sure we can investigate what the request does to handle it - it's very interesting how -, but I guess it's not our problem if reference HTTP client and browsers tells that there is a redirection issue.

@delijati
Copy link
Author

Narrowed it down to setting cookies. http.cookies.__parse_string aka http.cookies.SimpleCookie.load can't parse this cookie [1] str. requests is using http.client.parse_headers.

Possible solutions:

  • fix http.cookies.__parse_string tested to be broken in 3.5, 3.6
  • replace SimpleCookie with a patched version
  • None of above as the cookie is broken and http.client.parse_headers

[1] JSESSION_ID_PROFILE_APP=A3C47AF767CCC88FA36B8292A6FB94A3; Path=/; HttpOnly, gp_intern=(null);path=/;Secure, gp_intern=(null);path=/;Secure

@kxepal
Copy link
Member

kxepal commented Nov 10, 2017

Nice found! So the problem looks like that aiohttp doesn't set / preserve cookies during redirects and that site sets them though thme (horrible solution). But at least now it's clear why happens what we see with curl. Would you like to submit PR?

@asvetlov
Copy link
Member

Not sure if we should fix it.
gp_intern is not standard property for cookies, http.cookies raises an exception for it.
Actually the property name is HttpOnly, gp_intern and value is (null). Another property is Secure, gp_intern with (null) value.
For me it looks like a garbage. Multiple Path properties are not supported by standard too.

@kxepal
Copy link
Member

kxepal commented Nov 10, 2017

@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.

@asvetlov
Copy link
Member

The site is protected from usage by correctly implemented client libraries.
Well, @delijati could pass allow_redirects=False to session.get() and process redirections on his own.

Honestly I have no idea where and when we will stop If we try process every buggy quirk out of the box.

@delijati
Copy link
Author

@asvetlov I'm kind of unsure what would be right solution. Fixing it feels like a ie6 css quirks but any browser I tested sets the the cookie.

Tested with Node:

var request = require('request');

const url = 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python';

request(url, function(error, response, body) {
    console.log('error:', error); // Print the error if one occurred
    console.log('statusCode:', response && response.statusCode); // Print the response status code if a response was received
    console.log('body:', body); // Print the HTML for the Google homepage.
});

Error:

(node:22299) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
error: Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python
    at Redirect.onResponse (/opt/develop/privat/mobile/death/node_modules/request/lib/redirect.js:98:27)
    at Request.onRequestResponse (/opt/develop/privat/mobile/death/node_modules/request/request.js:996:22)
    at emitOne (events.js:96:13)
    at ClientRequest.emit (events.js:188:7)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:473:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
    at TLSSocket.socketOnData (_http_client.js:362:20)
    at emitOne (events.js:96:13)
    at TLSSocket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
statusCode: undefined
body: undefined

@delijati
Copy link
Author

I went a bit deeper the rabbit hole. aiohttp uses SimpleCookie which implements the spec rfc2109. requests uses Cookie the newer rfc2965 witch obsolets spec rfc2109.

Firefox on the other side implements something custom RFC2109/2616

So long story short aiohttp should use the newer Cookie implementation.

@asvetlov
Copy link
Member

Thanks for shading a light on the problem.

aiohttp cannot just use http.cookiejar.Cookie -- it has incompatible interface with http.cookies.SimpleCookie.
rfc2965 is obsoleted by RFC6265 BTW.
The RFC explicitly enumerates all allowed cookie properties (and header name is Set-Cookie, Set-Cookie2 is obsolete).
The list of properties is fixed. For SameSite where is a new RFC draft but it is not accepted yet.

@asvetlov
Copy link
Member

Sorry, the RFC specifies extension-av rule as extension-av = <any CHAR except CTLs or ";">.

Ideas?

@blueyed
Copy link
Contributor

blueyed commented May 2, 2019

SimpleCookie.__parsestring is just too strict: it also fails to handle expiration dates incorrectly (parses them based on the sane-cookie-date only), see https://bugs.python.org/issue35824#msg334392.
(this means that aiohttp is affected by this (

expires = cookie["expires"]
), i.e. it cannot handle the following cookie from GitHub: Set-Cookie: has_recent_activity=1; path=/; expires=Mon, 22 Apr 2019 23:27:18 -0000 (due to "-0000" instead of "GMT" at the end).

The interface for

aiohttp cannot just use http.cookiejar.Cookie -- it has incompatible interface with http.cookies.SimpleCookie.

But couldn't it be changed to be based on http.cookies.CookieJar (similar to how requests does it)?

Another/simpler solution might be to pass an adjusted patt to __parsestring (using it directly).

Currently aiohttp fails to handle 3 out of 4 cookies sent by github.com:

   5     async def main():
   6         cjar = aiohttp.CookieJar()
   7         async with aiohttp.ClientSession(cookie_jar=cjar) as session:
   8             async with session.get('https://github.com') as response:
   9                 pass
  10  ->     __import__('pdb').set_trace()
 return None
(Pdb++) pp response.headers.getall("set-cookie")
['has_recent_activity=1; path=/; expires=Fri, 03 May 2019 00:10:35 -0000',
 '_octo=GH1.1.393119700.1556838635; domain=.github.com; path=/; expires=Sun, 02 May 2021 23:10:35 -0000',
 'logged_in=no; domain=.github.com; path=/; expires=Mon, 02 May 2039 23:10:35 -0000; secure; HttpOnly',
 '_gh_sess=NXXXf; path=/; secure; HttpOnly']
(Pdb++) pp len(response.headers.getall("set-cookie"))
4
(Pdb++) pp list(cjar)
[{'comment': '',
  'domain': 'github.com',
  'expires': '',
  'httponly': True,
  'max-age': '',
  'path': '/',
  'samesite': '',
  'secure': True,
  'version': ''}]
(Pdb++) pp len(list(cjar))
1

@danilobellini
Copy link

@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.

From Section 3 in RFC6265:

[...] User agents MAY ignore Set-Cookie headers contained in responses with 100-level status codes but MUST process Set-Cookie headers contained in other responses [...]

I'm not sure if that server is/was buggy, but the client implementation is broken w.r.t. that RFC when it ignores cookies set in HTTP 3xx responses. On the other hand, the aiohttp documentation states:

Response cookies contain only values, that were in Set-Cookie headers of the last request in redirection chain.

I found it surprising that cookies set on redirection (HTTP 302 where I've tried) are discarded by aiohttp, though I'm always using a ClientSession instance. I still don't know the reason but now the allow_redirects need to be always False for a scraper I'm working in. For now I'm applying the consecutive requests one by one as a workaround (I'm using aiohttp 3.5.4, Python 3.7.3).

@asvetlov
Copy link
Member

If somebody wants to propose a pull request with the fix -- I'll be happy to review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants