Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POST requests indexing and replay fails with OutbackCDX #768

Open
kaij opened this issue Oct 17, 2022 · 0 comments
Open

POST requests indexing and replay fails with OutbackCDX #768

kaij opened this issue Oct 17, 2022 · 0 comments

Comments

@kaij
Copy link
Contributor

kaij commented Oct 17, 2022

Describe the bug

Replay of web pages that include POST requests fail if the server uses OutbackCDX as its CDX index.

The issue was first described in #585 for pywb 2.4 and can be fixed for 2.4 with pull requests #587 and nla/outbackcdx#91. While nla/outbackcdx#91 was merged back to main branch, development to improve the handling of arbitrary HTTP METHODs and more flexible encoding of POST data continued for pywb 2.6. The currently implemented solution in pywb 2.6 does not work with the mentioned outbackcdx fix anymore.

Some ideas on solving the issue were noted here: webrecorder/replayweb.page#69 (comment).

Steps to reproduce the bug

An example test case is provided with the corona-data.ch dashboard (kindly supplied with the permission of the Swiss National Library and the author of the original archived web page @daenuprobst.

covid-20200528143537.warc.gz
wr-421012-corona-datach-20210126135423.warc.gz

Steps:

  1. Use cdx-indexer or cdxj-indexer on the supplied test case and post data to outbackcdx. Example: ./bin/cdx-indexer -p -s /sample/covid-20200528143537.warc.gz | curl -X POST @- http://outbackcdx-nginx:8078/nb-test (we currently use pywb 2.4 cdx-indexer). It probably should be clarified / documented which tool and which version is best to use while solving this issue.
  2. Replay WARC with pywb (see also environment configuration)

Results: see screenshots

Expected behavior

POST request should return correct data (compare screenshots).

Screenshots

How it looks if POST requests are not resolved correctly (current combination of pywb 2.6 / outbackcdx):

Close-up on POST request (which is returning the wrong data):

How it looks correctly (current replay with pywb 2.4):
image

Environment

  • pywb 2.4.6
  • outbackcdx 0.11.0
  • cdx-indexer

Sample configuration in use for pywb config.yaml - outbackcdx requests

This is what we currently use for pywb 2.4 with applied pull request #587 (correctly working replay)

collections:
  all: $all
  nb-test:
    archive_paths:
      - /path-index.txt
    index:
      type: cdx   
      api_url: http://outbackcdx-nginx:8080/nb-webarchive?closest={closest}&sort=closest&url={url_post}
      replay_url: ""

Additional context

This bug report is a follow-up to our meeting (@ikreymer @edsu @tw4l @kaij) on the Oct 6th 2022. @edsu also provided an issue description for OutbackCDX when using CDXJ at nla/outbackcdx#106.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant