Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper encoding of load_url #658

Closed
maeb opened this issue Jul 12, 2021 · 4 comments · Fixed by #659
Closed

Proper encoding of load_url #658

maeb opened this issue Jul 12, 2021 · 4 comments · Fixed by #659

Comments

@maeb
Copy link
Contributor

maeb commented Jul 12, 2021

Formatting of load_url does not encode the url parameter properly if it ends up in the query string of the configured url_field (replay_url):

cdx[self.url_field] = self.replay_url.format(url=cdx['url'],
timestamp=cdx['timestamp'],
src_coll=source_coll)

Some url's does not survive query parameter parsing unscaded when the url parameter is part of the query string of the load_url.

This seems to fix the issue:

        cdx[self.url_field] = res_template(self.replay_url, dict(url=cdx['url'],
                                                                 timestamp=cdx['timestamp'],
                                                                 src_coll=source_coll))

I believe this is a proper fix without breaking changes, but I am not sure. Shall I post a PR?

@maeb
Copy link
Contributor Author

maeb commented Jul 12, 2021

Referencing my previous issue #656 here. That issue concerned encoding of the url parameter in the query string of the request between the frontend and the backend. This issue concerns encoding of the same parameter in the request between the backend and the warcserver (as configured via the replay_url).

@maeb maeb mentioned this issue Jul 12, 2021
8 tasks
@ikreymer
Copy link
Member

Just to confirm, this was for use with OutbackCDX, right? Or some other configuration?

@maeb
Copy link
Contributor Author

maeb commented Jul 19, 2021

We use our own indexer and loader backend https://github.com/nlnwa/gowarcserver.

@maeb
Copy link
Contributor Author

maeb commented Jul 19, 2021

Our config looks something like:

collections:                                                                                                          
  veidemann:                                                                                                          
    index:                                                                                                            
      type: cdx                                                                                                       
      api_url: http://gowarcserver:9999/warcserver/all/index?url={url}&closest={closest}                        
      replay_url: http://gowarcserver:9999/warcserver/all/resource?url={url}&closest={timestamp}&output=content

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants