Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to remove headers with illegal characters. #536

Merged
merged 1 commit into from
Apr 30, 2020

Conversation

danielbicho
Copy link
Contributor

Description

Try to remove headers with illegal characters. Fixing the following issue arquivo/pwa-technologies#774

Motivation and Context

Pywb is not able replay records with illegal characters (outside latin1) in the replay headers.
In example above there is a Trademark character on the Server header that breaks the streaming of the response.

This change try to fix those badly formed headers, detecting them and removing them. Its a best enforce approach, in last case the record will fail to load as it was.

arquivo/pwa-technologies#774

Screenshots (if appropriate):

image

After this fix:
image

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

I am not sure if this is the best approach, open to do any change needed to make it go upstream.

@danielbicho danielbicho marked this pull request as ready for review February 10, 2020 11:27
@ikreymer
Copy link
Member

Thanks! I think this makes sense, though do you know what's causing the invalid headers? Is it UTF-8 encoding, just wondering?

@danielbicho
Copy link
Contributor Author

Yes it is utf8 encoding.

Opening the WARC Record with VIM we can see the problematic line (My vim is assuming latin1 encoding):
Server: Casper Cacheâ<84>¢ V2 (CAST TIME)^M

Everything goes well, from reading the WARC record to the building of the StatusAndHeaders, but then the WSGI start_response call breaks.

@ikreymer ikreymer merged commit 6b014d0 into webrecorder:develop Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants