You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To resolve some complex playback issues (Twitter, HuffPo) we need to be able to play back POST requests.
This requires some coordination with Ilya as he's been changing how he does it.
Once the indexing scheme is stable, we need to use a version of OutbackCDX that supports it, and re-index the CDX data (at least the last couple of years).
Using mapper_raw means MrJob arranges for a copy of each WARC to be placed where we can get to it:
(This breaks data locality, but streaming through large files is not performant because they get read into memory)
(A FileInputFormat that could reliably split block GZip files would be the only workable fix)
(But TBH this is pretty fast as it is)
Verify with @ikreymer when the approach has been finalised, and what version of OutbackCDX it works with.
CDXJ Indexer indexes metadata records, which is what we want for video metadata etc. Are those application/warc-fields fields from metadata records from Heritrix3 okay in the CDX?
Index recent material into a fresh Outback (>= 0.8.0) index and check playback.
To resolve some complex playback issues (Twitter, HuffPo) we need to be able to play back POST requests.
This requires some coordination with Ilya as he's been changing how he does it.
Once the indexing scheme is stable, we need to use a version of OutbackCDX that supports it, and re-index the CDX data (at least the last couple of years).
Updating the Java stack is quite involved: ukwa/webarchive-discovery#244
Might be time to switch to Python for this MR Job. Use PyWB indexer and POST them to OutbackCDX.
Also need OutbackCDX 0.8.0 to handle the lookups properly.
Some other examples of similar code:
MrJob
Using mapper_raw means MrJob arranges for a copy of each WARC to be placed where we can get to it:
(This breaks data locality, but streaming through large files is not performant because they get read into memory)
(A FileInputFormat that could reliably split block GZip files would be the only workable fix)
(But TBH this is pretty fast as it is)
Play with a WARC processor with https://pypi.org/project/boilerpy3/ and e.g. Spacy
application/warc-fields
fields frommetadata
records from Heritrix3 okay in the CDX?See https://github.com/ukwa/ukwa-hadoop-tasks/tree/master/warc_indexing
The text was updated successfully, but these errors were encountered: