Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutbackCDX does not get parameters of POST request #585

Closed
kaij opened this issue Oct 11, 2020 · 5 comments
Closed

OutbackCDX does not get parameters of POST request #585

kaij opened this issue Oct 11, 2020 · 5 comments

Comments

@kaij
Copy link
Contributor

kaij commented Oct 11, 2020

Describe the bug

When using OutbackCDX as an index server, the __wb_post_data is not sent with the url to the outbackcdx server. On webpages with multiple XHR POSTs to the same URL, this will return the wrong data. Using a local CDXJ file index works as expected.

Steps to reproduce the bug

  1. Create warc of public page at http://www.corona-data.ch/.
  2. Index with outbackcdx using a command similar to cdx-indexer -p -s corona-data.warc.gz | curl -X POST --data-binary @- http://127.0.0.1:8078/collection
  3. Open the page in pywb replay -> most of the diagrams will stay white.

Expected behavior

The replayed POST requests should contain correct responses (so the diagrams can be drawn)

Screenshots

Replayed page with invalid (white) diagrams. The reason for this is that the CDX information for the POST requests to _dash-update-components are not passed with the query.
image
image

Environment

  • pywb 2.4.2

Additional context

I tried to track this down to the _get_api_url function in warcserver/indexsource.py. The url used does not contain the __wb_post_data. FileIndexSource uses the key parameter. So I see the following options:

  • Passing the key using the urlkey parameter of outbackcdx (and updating documentation)
  • Adding __wb_post_data to the url parameter

There might be also be other options to consider. Also, the __wb_post_data changed to __warc_post_data with cdxj-indexer, so maybe there is more development going on. I'd be interested to contribute a fix, but need some guidance as to the best way.

Update. Quote from the OutbackCDX page: "The canonicalized URL (first field) is ignored, OutbackCDX performs its own canonicalization." - indexing in OutbackCDX seems to ignore the __wb_post_data parameter, so this might need further evaluation/coordination.

@ato
Copy link
Contributor

ato commented Oct 11, 2020

OutbackCDX does not currently have any support for indexing POST requests (pull requests welcome though).

@kaij
Copy link
Contributor Author

kaij commented Oct 29, 2020

@ato I added a first pull request at nla/outbackcdx#91. It would be great if you could have a look at it - feedback and discussion welcome. 😀

@ikreymer
Copy link
Member

Sorry for not responding earlier!

Most of the POST matching is done on form data (application/x-www-form-urlencoded), then the query params can be matched similar to GET query params.

The base64-encoded __wb_post_data was sort of added as a last resort option, in case it will be useful for whatever else kind of data, and looks like it actually is useful here! Often times, the values in the POST do not match exactly, and then it falls back to the fuzzy matching..

Given this use case, I wonder if JSON data should be treated differently as well, perhaps just added as __wb_json_data= which could be more useful and helpful in doing inexact matches?

Just an idea, of course the current PR will probably be a quicker way to get this supported, but may be interesting to consider this change.

@kaij
Copy link
Contributor Author

kaij commented Oct 30, 2020

@ikreymer Thanks for the input! I agree that it would make sense to implement a solution for JSON POST data. But wouldn't this mean breaking existing solutions and require reindexing? (currently having __wb_post_data in the surt).

Independent of the JSON issue, in order for the replay to work with OutbackCDX, a small change is needed in the pywb RemoteIndexSource to pass the __wb_post_data with the url field (since outbackcdx does its own canonicalization and ignores the urlkey parameter). I placed a pull request for this in #587. This currently only works with base64 encoded data, but I could of course change it to allow any format.

@ikreymer
Copy link
Member

ikreymer commented Oct 7, 2021

Closing this, as I believe all issues related to POST here should now be resolved (and, it appears this dashboard has been updated to not use POST anymore)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants