Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters. #34

Open
5 tasks
anjackson opened this issue May 19, 2021 · 2 comments
Assignees

Comments

@anjackson
Copy link
Contributor

anjackson commented May 19, 2021

To resolve some complex playback issues (Twitter, HuffPo) we need to be able to play back POST requests.

This requires some coordination with Ilya as he's been changing how he does it.

Once the indexing scheme is stable, we need to use a version of OutbackCDX that supports it, and re-index the CDX data (at least the last couple of years).


Updating the Java stack is quite involved: ukwa/webarchive-discovery#244

Might be time to switch to Python for this MR Job. Use PyWB indexer and POST them to OutbackCDX.

Also need OutbackCDX 0.8.0 to handle the lookups properly.

Some other examples of similar code:

MrJob

Using mapper_raw means MrJob arranges for a copy of each WARC to be placed where we can get to it:
(This breaks data locality, but streaming through large files is not performant because they get read into memory)
(A FileInputFormat that could reliably split block GZip files would be the only workable fix)
(But TBH this is pretty fast as it is)

Play with a WARC processor with https://pypi.org/project/boilerpy3/ and e.g. Spacy

  • Verify with @ikreymer when the approach has been finalised, and what version of OutbackCDX it works with.
  • CDXJ Indexer indexes metadata records, which is what we want for video metadata etc. Are those application/warc-fields fields from metadata records from Heritrix3 okay in the CDX?
  • Index recent material into a fresh Outback (>= 0.8.0) index and check playback.
  • Convert metadata URIs to embed URNs.
  • Drop 451/429?

See https://github.com/ukwa/ukwa-hadoop-tasks/tree/master/warc_indexing

@anjackson
Copy link
Contributor Author

On POST request handling, see webrecorder/replayweb.page#69

@anjackson
Copy link
Contributor Author

Going to have to defer this as it's still unclear what to do. PyWB-based indexing does work, but these specific issues remain unresolved.

@anjackson anjackson self-assigned this Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant