Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support stream api #479

Merged
merged 12 commits into from
Aug 29, 2024
Merged

Conversation

ganisback
Copy link
Contributor

@ganisback ganisback commented Jun 18, 2024

currently jupyter-server-proxy does not support the text/event-stream api. when we install https://github.com/hiyouga/LLaMA-Factory in the notebook instance, and access by : http://xxx.xxx/proxy/7860/
the normal request works fine, but the stream API never response.
reference code: https://github.com/ideonate/jhsingle-native-proxy/blob/3e26a4ee0e7318970b7cf6abbd7d88455a9ac621/jhsingle_native_proxy/proxyhandlers.py#L217

@ganisback ganisback changed the title support stream support stream api Jun 18, 2024
@ganisback
Copy link
Contributor Author

@yuvipanda please help review this feature

@ganisback ganisback changed the title support stream api feat: support stream api Jun 18, 2024
@manics
Copy link
Member

manics commented Jun 18, 2024

@ganisback Thanks for your contribution! To help with review please could you expand the PR description to explain what this does, and why it's needed? We'll need some tests if this is going to be merged, but those can be added later if we agree this feature should be added.

@ganisback
Copy link
Contributor Author

@manics description updated

@yuvipanda
Copy link
Contributor

yuvipanda commented Jun 18, 2024

Ah interesting! I think @jmunroe also ran into this recently.

If my understanding is right, the existing code doesn't stream but waits for the entire upstream request to be completed before passing it on. And with this PR, we're special casing the text/eventstream to be streaming.

Instead, what do you think about modifying the existing code (what you have as _proxy_normal) to be streaming? I don't think there's any reason to not stream all requests - I think we initially waited for requests to complete simply because that was the easier thing to do. That would also help with tests - the current tests should catch that and if they pass, we can probably merge that.

So I propose that instead of only streaming eventstreams, we stream everything.

Can't do this, see #479 (comment)


def dump_headers(headers_raw):
for line in headers_raw:
r = re.match('^([a-zA-Z0-9\-_]+)\s*\:\s*([^\r\n]+)[\r\n]*$', line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use https://www.tornadoweb.org/en/stable/httputil.html#tornado.httputil.HTTPHeaders.parse_line here, and if that fails assume it's a status code? regexes in places like this always worry me :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_line is a internal method in HTTPHeaders, we can not use it directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah, I see it doesn't return it but updates it.

The regex here still worries me though. Is it coming from somewhere specific (like an RFC, or an existing implementation?)? If so, I'd appreciate a link to the source as a comment. If not, I'd like us to find some way to farm this out. I would avoid writing code that parses HTTP in any form if possible.

@yuvipanda
Copy link
Contributor

Ah damn, I just realized that won't work, because of our RewriteableResponse implementation :( That implementation's signature depends on being able to buffer requests, rather than being able to stream them. Oof :(

@yuvipanda
Copy link
Contributor

As for testing, I suggest:

  1. Create a simple eventstream emitting server in https://github.com/jupyterhub/jupyter-server-proxy/tree/main/tests/resources (similar to the websocket one perhaps)
  2. Create tests in test_proxy similar to the websocket tests (
    async def test_server_proxy_websocket_messages(
    ).

This can be simpler than websockets, because eventstreams are simpler.


headers_raw = []

def dump_headers(headers_raw):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this function outside _proxy_progressive. headers_raw as used inside this function is an arg, while there's also a local variable with the same name. Moving it outside cleans this up a little.

r = re.match('^([a-zA-Z0-9\-_]+)\s*\:\s*([^\r\n]+)[\r\n]*$', line)
if r:
k,v = r.groups([1,2])
if k not in ('Content-Length', 'Transfer-Encoding',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different list than the one in line 531. Is there a specific reason? If not, I think we should move this to a constant at the top of the file and reference it from both places so it doesn't get out of date.

if response.body: # Likewise, should already be chunked out and flushed
self.write(response.body)

async def _proxy_normal(self, host, port, proxied_path, body):
if self.unix_socket is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unix socket check should probably be common to both the proxying methods, and should be moved up.

@yuvipanda
Copy link
Contributor

ok, I've done one round of review @ganisback! Hope this helps.

@manics I think we should definitely add this feature! Ideally, I'd have liked us to make everything streaming, but that's incompatible with RewriteableResponse :( So making at least eventstreams (which by definition need to be streaming) streaming (and hence incompatible with RewriteableResponse) seems ok to me.

@ganisback
Copy link
Contributor Author

OK, will address your comments in the weekend

@yuvipanda
Copy link
Contributor

Hey @ganisback! Just wanted to check to see if there's anything else I can do to help you move this forward :) I just ran into this again in a different project!

@ganisback
Copy link
Contributor Author

Hi @yuvipanda , I am blocking in testing case, can you help add the testing case?

@yuvipanda
Copy link
Contributor

@ganisback it was slightly tricky, but I think I have a useful test for you at ganisback#1! It fails on main but passes on this branch!

LMK what you think of it. I haven't had time to look at your changes yet unfortunately but will do this coming week!

@ganisback
Copy link
Contributor Author

@yuvipanda I increased resp time and added resp data checking, now the tests passed.

Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Can you address https://github.com/jupyterhub/jupyter-server-proxy/pull/479/files#r1645273654 and https://github.com/jupyterhub/jupyter-server-proxy/pull/479/files#r1645272938, and the couple other style things? Then it's ready to land I think.

Thank you for your contribution and patience, @ganisback!


def dump_headers(headers_raw):
for line in headers_raw:
r = re.match('^([a-zA-Z0-9\-_]+)\s*\:\s*([^\r\n]+)[\r\n]*$', line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah, I see it doesn't return it but updates it.

The regex here still worries me though. Is it coming from somewhere specific (like an RFC, or an existing implementation?)? If so, I'd appreciate a link to the source as a comment. If not, I'd like us to find some way to farm this out. I would avoid writing code that parses HTTP in any form if possible.

# some header appear multiple times, eg 'Set-Cookie'
self.set_header(k,v)
else:
r = re.match('^HTTP[^\s]* ([0-9]+)', line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a example for stream header:

HTTP/1.1 200 OK
Content-Type: text/html
Cache-Control: no-cache
Connection: keep-alive

httputil.HTTPHeaders.parse_line can not parse the first line to get http status, so we use regex to solve this issue.
the source code comes from https://github.com/ideonate/jhsingle-native-proxy/blob/3e26a4ee0e7318970b7cf6abbd7d88455a9ac621/jhsingle_native_proxy/proxyhandlers.py#L265

jupyter_server_proxy/handlers.py Outdated Show resolved Hide resolved
await client.fetch(
url,
headers={"Accept": "text/event-stream"},
request_timeout=22,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explicitly setting timeout can be removed now right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

assert times_called == limit
print(stream_read_intervals)
assert all([0.45 < t < 3.0 for t in stream_read_intervals])
print(stream_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of these extra print statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ganisback
Copy link
Contributor Author

@yuvipanda any other comments?

@krassowski
Copy link
Collaborator

Excited for this! Spent to much of my Saturday debugging why SSE was connected but not sending anything.

@krassowski
Copy link
Collaborator

@yuvipanda @manics is there anything else that needs to be done here to get it merged?

@krassowski
Copy link
Collaborator

Hi folks, just circling back here, is there any way I could help with this PR?

@ganisback
Copy link
Contributor Author

Hi folks, just circling back here, is there any way I could help with this PR?

I'm just waiting for they merge this PR

@yuvipanda
Copy link
Contributor

Sorry for dropping this, travel / vacation time :( I'll try to spend an hour on this later this week.

@krassowski
Copy link
Collaborator

It looks like pre-commit is failing because this PR is behind main branch as of now, for example it does not include the rawsocket.py file which was added last month (https://github.com/jupyterhub/jupyter-server-proxy/blob/main/jupyter_server_proxy/rawsocket.py)

As per https://results.pre-commit.ci/run/github/71295164/1723882819.1ODn2EWuTr2D6HI5NFt0pQ

image

@krassowski
Copy link
Collaborator

I opened a PR against this branch (ganisback#2) to fix pre-commit checks

@yuvipanda
Copy link
Contributor

I still don't like using regexes for parsing headers, but I this is a clear improvement over status quo. So I've rebased this, and will merge. Apologies for the long time this one took, @ganisback!

I've opened #498 to handle removing the regexes.

@yuvipanda yuvipanda merged commit 7f78e1b into jupyterhub:main Aug 29, 2024
15 checks passed
@krassowski
Copy link
Collaborator

Amazing, thanks so much! Do you plan to cut a new release with it now or only after #498 gets addressed?

@yuvipanda
Copy link
Contributor

#498 doesn't need to block a release!

@yuvipanda
Copy link
Contributor

If you make a PR following https://github.com/jupyterhub/jupyter-server-proxy/blob/main/RELEASE.md i'm happy to merge that :)

@yuvipanda
Copy link
Contributor

Although i'm off from monday!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants