Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters. #34

anjackson · 2021-05-19T09:58:58Z

To resolve some complex playback issues (Twitter, HuffPo) we need to be able to play back POST requests.

This requires some coordination with Ilya as he's been changing how he does it.

Once the indexing scheme is stable, we need to use a version of OutbackCDX that supports it, and re-index the CDX data (at least the last couple of years).

Updating the Java stack is quite involved: ukwa/webarchive-discovery#244

Might be time to switch to Python for this MR Job. Use PyWB indexer and POST them to OutbackCDX.

Also need OutbackCDX 0.8.0 to handle the lookups properly.

Some other examples of similar code:

MrJob

Using mapper_raw means MrJob arranges for a copy of each WARC to be placed where we can get to it:
(This breaks data locality, but streaming through large files is not performant because they get read into memory)
(A FileInputFormat that could reliably split block GZip files would be the only workable fix)
(But TBH this is pretty fast as it is)

Play with a WARC processor with https://pypi.org/project/boilerpy3/ and e.g. Spacy

Verify with @ikreymer when the approach has been finalised, and what version of OutbackCDX it works with.
CDXJ Indexer indexes metadata records, which is what we want for video metadata etc. Are those application/warc-fields fields from metadata records from Heritrix3 okay in the CDX?
Index recent material into a fresh Outback (>= 0.8.0) index and check playback.
Convert metadata URIs to embed URNs.
Drop 451/429?

See https://github.com/ukwa/ukwa-hadoop-tasks/tree/master/warc_indexing

The text was updated successfully, but these errors were encountered:

anjackson · 2021-10-07T20:55:34Z

On POST request handling, see webrecorder/replayweb.page#69

anjackson · 2021-12-10T13:20:51Z

Going to have to defer this as it's still unclear what to do. PyWB-based indexing does work, but these specific issues remain unresolved.

anjackson self-assigned this Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters. #34

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters. #34

anjackson commented May 19, 2021 •

edited

Loading

anjackson commented Oct 7, 2021

anjackson commented Dec 10, 2021

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters. #34

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters. #34

Comments

anjackson commented May 19, 2021 • edited Loading

anjackson commented Oct 7, 2021

anjackson commented Dec 10, 2021

anjackson commented May 19, 2021 •

edited

Loading