Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(chunker): Fix chunk length calculation for unicode characters #726

Merged
merged 3 commits into from
Jan 30, 2024

Conversation

SyntheticGoop
Copy link
Contributor

What kind of change does this PR introduce?

Bug fix

What is the current behavior?

The length of cookies with URI escaped characters can exceed the valid cookie length of 4096, causing parts of the split cookies to be rejected by the browser.

What is the new behavior?

Chunked cookies after URI encoding now fit within the chunk size specified.

Additional context

The reason for this issue occurring is that we are chunking base on lengths before escaping. While we have some buffer in the MAX_CHUNK_SIZE for longer unicode characters, this does not always work.

For example has a length in javascript of 1 as well as match /(.{1})/g.exec("﷽") to be $1 = ﷽, but it will escape to %EF%B7%BD which is 9 characters long.

The current approach of chunking with a buffer for some unicode characters will never be "correct" unless we implement a buffer that accounts for every character being a maximum size unicode.

I've reimplemented the createChunks function to do the following:

  1. Split the chunk after encoding it.
  2. Ensure that chunks are split correctly between unicode characters instead of inside of it.
  3. Not use regex as it is hard to determine what a "correct" unicode character boundary is in regex when it is escaped.

There are currently other related solutions and issues:

PRs

Issues

@kangmingtay kangmingtay requested review from dijonmusters, hf, kangmingtay and a team and removed request for hf and kangmingtay January 26, 2024 09:34
Copy link

@0xfacad3 0xfacad3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssr/src/utils/chunker.ts createChunks() in the diffs of this Pull Request should be updated as follows, due to TypeScript constraints:

export function createChunks(key: string, value: string, chunkSize?: number): Chunk[] {
	const resolvedChunkSize = chunkSize ?? MAX_CHUNK_SIZE;

	let encodedValue = encodeURIComponent(value);

	if (encodedValue.length <= resolvedChunkSize) {
		return [{ name: key, value }];
	}

	const chunks: Chunk[] = [];

	while (encodedValue.length > 0) {
		let encodedChunkHead = encodedValue.slice(0, resolvedChunkSize);

		const lastEscapePos = encodedChunkHead.lastIndexOf('%');

		// Check if the last escaped character is truncated.
		if (lastEscapePos > resolvedChunkSize - 3) {
			// If so, reslice the string to exclude the whole escape sequence.
			// We only reduce the size of the string as the chunk must
			// be smaller than the chunk size.
			encodedChunkHead = encodedChunkHead.slice(0, lastEscapePos);
		}

		let valueHead;

		// Check if the chunk was split along a valid unicode boundary.
		while (encodedChunkHead.length > 0) {
			try {
				// Try to decode the chunk back and see if it is valid.
				// Stop when the chunk is valid.
				valueHead = decodeURIComponent(encodedChunkHead);
				break;
			} catch (error) {
				if (
					error instanceof URIError &&
					encodedChunkHead.at(-3) === '%' &&
					encodedChunkHead.length > 3
				) {
					encodedChunkHead = encodedChunkHead.slice(0, encodedChunkHead.length - 3);
				} else {
					throw error;
				}
			}
		}

		chunks.push({
			name: `${key}.${chunks.length}`,
			value: valueHead as string,
		  });
		encodedValue = encodedValue.slice(encodedChunkHead.length);
	}

	return chunks.map((value, i) => ({
		name: `${key}.${i}`,
		value: value.value as string,
	  }));
}

@0xfacad3
Copy link

@SyntheticGoop
Hey! Your pull request was exactly what I wanted. However, when I forked and used it, a problem seemed to occur when deploying Vercel. The functions to make it work are presented above. The problem was a simple Typescript constraint.

@SyntheticGoop
Copy link
Contributor Author

@SyntheticGoop Hey! Your pull request was exactly what I wanted. However, when I forked and used it, a problem seemed to occur when deploying Vercel. The functions to make it work are presented above. The problem was a simple Typescript constraint.

I've updated the types so that the package build on my system with pnpm build.

const lastEscapePos = encodedChunkHead.lastIndexOf('%');

// Check if the last escaped character is truncated.
if (lastEscapePos > resolvedChunkSize - 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify, we subtract 3 here because an escaped character would take up 3 characters in the string right (a % followed by 2 hexadecimal characters)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's exactly it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants