Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add POST data records, for PyWB playback #244

Open
anjackson opened this issue Dec 11, 2020 · 5 comments
Open

Add POST data records, for PyWB playback #244

anjackson opened this issue Dec 11, 2020 · 5 comments
Assignees
Labels
cdx CDX Generator

Comments

@anjackson
Copy link
Contributor

anjackson commented Dec 11, 2020

To get playback working, we need to make HEAD/OPTIONS/POST records like PyWB does. See webrecorder/pywb#585 and related tickets.

It's fairly involved! https://github.com/webrecorder/pywb/blob/54d8bccf4a4eebf305012d49cb7330eaddea9eba/pywb/warcserver/inputrequest.py#L183

Will replace/supercede

// Drop lines that appear to be raw HTTP header 200 responses
// (OPTIONS requests, see
// https://github.com/ukwa/webarchive-discovery/issues/215 -- this
// is likely a rather specific to Twitter API calls but in general
// we would expect HTTP 200 to have a real content type and not just
// be HTTP headers):
if (t.find(" application/http 200 ") > 0) {
this.num_dropped++;
continue;
}

Note that to be useful, we need to upgrade to nlagovau/outbackcdx:0.8.0.

@anjackson anjackson self-assigned this Dec 11, 2020
@thomasegense
Copy link
Contributor

More information from Ilya:

Hi, I am working on trying to standardize POST request indexing across all the different Webrecorder tools, and support additional improvements.. This probably calls for a write-up, but just wanted to share what the idea is so far:

the POST request data, if possible is converted to query (form-encoded) form and treated as part of the URL, in a sense, converting the POST request to a GET
this can also apply to PUT or any other requests

The CDXJ entry would look like this:

org,httpbin)/post?__wb_method=post&another=more^data&test=some+data 20200809195334 {"url": "https://httpbin.org/post", "mime": "application/json", "status": "200", "digest": "7AWVEIPQMCA4KTCNDXWSZ465FITB7LSK", "length": "688", "offset": "0", "filename": "post-test-more.warc", "requestBody": "?__wb_method=POST&test=some+data&another=more%5Edata", "method": "POST"}

the canonicalized key has this extra query appended to it, along with _wb_method
the url field is not modified
the url-encoded query form is stored in requestBody field and also an extra method field is added

the requestBody is for:
application/x-www-form-urlencoded - already in this form use as is
multipart/form-data - convert to url-encoded query
application/json - parse the json and add each primitive to the query, eg. {"a": "b", "foo": {"c": "d"}} becomes a=b&c=d (is better approach possible)
text/plain - assume it may be json, try to parse application/json , otherwise treat as binary/other
binary/all other - base64 encode and add as _wb_post_data=<base 64 data>**

@anjackson
Copy link
Contributor Author

anjackson commented Mar 24, 2021

Thanks @thomasegense - I'm afraid I'm probably going to switch to using the PyWB indexer for now, as modifying this codebase to pull together the request and reponse records is going to mean significant changes to the way it works. I don't current have time to make those changes.

@thomasegense
Copy link
Contributor

I completely agree with you. There is a hard timeconsuming task with only minor benefits to playback in solrwayback.
(Also the url max length of 2048 also has to be changed.)

@thomasegense
Copy link
Contributor

@ato
Copy link

ato commented Aug 9, 2023

I have a Java implementation of pywb compatible POST/PUT request body encoding here: https://github.com/iipc/jwarc/blob/master/src/org/netpreserve/jwarc/cdx/CdxRequestEncoder.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cdx CDX Generator
Projects
None yet
Development

No branches or pull requests

3 participants