Transferring data through the RPC stream skips the second message #70

aryanjassal · 2024-11-08T04:53:35Z

Description

The second value is always skipped when the messages are being processed by RPCServer. This might be happening due to a potential race condition which happens when we try to disconnect the stream from one pipe and connect it to another. When we disconnect the stream, the stream has already partially loaded the next chunk of the message. This results in the data being lost.

This is also reflected in the tests. Weirdly, all the tests in the feature branch pass for multiple commits, but anytime we merge, this failure is brought back. More interestingly, the MacOS machines are the only machines which have a failure due to this. However, once I pulled down the code from staging and ran the tests, I consistently got the same error.

Interestingly, this would have caused basically all RPC calls to be non-functional, but that is not the case. For Polykey and Polykey CLI, the RPC system has been working pretty reliably, so it's even more weird why this is happening only in the tests, but is not an issue anywhere else.

I went back to the first commit in this repo, ran npm install, then ran the tests again, but the tests still had the same result of failing in the same way, so this issue could be in the repo from the beginning, but no one ever ran into this for some reason.

To Reproduce

Run the CI on a separate branch
The CI should pass
Merge the branch into staging
The CI should fail on the MacOS machine
Try running CI on other branch
CI fails again

Additional context

My merge attempt for adding better serialisation: js-rpc (run 11733584104)
Brynley's merge attempt for adding indirect flake support: js-rpc (run 11266517328)

Details of the commented out test

// Temporarily commenting out to allow the CI to make a release
test.prop({
  messages: specificMessageArb,
})('forward middlewares', async ({ messages }) => {
  const stream = rpcTestUtils.messagesToReadableStream(messages);
  class TestMethod extends DuplexHandler {
    public handle = async function* (
      input: AsyncGenerator<JSONRPCRequestParams>,
      _cancel: (reason?: any) => void,
      _meta: Record<string, JSONValue> | undefined,
      _ctx: ContextTimed,
    ): AsyncGenerator<JSONRPCResponseResult> {
      yield* input;
    };
  }
  const middlewareFactory = rpcUtilsMiddleware.defaultServerMiddlewareWrapper(
    () => {
      return {
        forward: new TransformStream({
          transform: (chunk, controller) => {
            chunk.params = { value: 1 };
            controller.enqueue(chunk);
          },
        }),
        reverse: new TransformStream(),
      };
    },
  );
  const rpcServer = new RPCServer({
    middlewareFactory: middlewareFactory,
    logger,
    idGen,
  });
  await rpcServer.start({
    manifest: {
      testMethod: new TestMethod({}),
    },
  });
  const [outputResult, outputStream] = rpcTestUtils.streamToArray();
  const readWriteStream: RPCStream<Uint8Array, Uint8Array> = {
    cancel: () => {},
    readable: stream,
    writable: outputStream,
  };
  rpcServer.handleStream(readWriteStream);
  const out = await outputResult;
  expect(out.map((v) => v!.toString())).toStrictEqual(
    messages.map(() =>
      JSON.stringify({
        jsonrpc: '2.0',
        result: { value: 1 },
        id: null,
      }),
    ),
  );
  await rpcServer.stop({ force: true });
});
test.prop(
  {
    messages: specificMessageArb,
  },
  { numRuns: 1 },
)('reverse middlewares', async ({ messages }) => {
  const stream = rpcTestUtils.messagesToReadableStream(messages);
  class TestMethod extends DuplexHandler {
    public handle = async function* (
      input: AsyncGenerator<JSONRPCRequestParams<{ value: number }>>,
      _cancel: (reason?: any) => void,
      _meta: Record<string, JSONValue> | undefined,
      _ctx: ContextTimed,
    ): AsyncGenerator<JSONRPCResponseResult<{ value: number }>> {
      yield* input;
    };
  }
  const middleware = rpcUtilsMiddleware.defaultServerMiddlewareWrapper(() => {
    return {
      forward: new TransformStream(),
      reverse: new TransformStream({
        transform: (chunk, controller) => {
          if ('result' in chunk) chunk.result = { value: 1 };
          controller.enqueue(chunk);
        },
      }),
    };
  });
  const rpcServer = new RPCServer({
    middlewareFactory: middleware,
    logger,
    idGen,
  });
  await rpcServer.start({
    manifest: {
      testMethod: new TestMethod({}),
    },
  });
  const [outputResult, outputStream] = rpcTestUtils.streamToArray();
  const readWriteStream: RPCStream<Uint8Array, Uint8Array> = {
    cancel: () => {},
    readable: stream,
    writable: outputStream,
  };
  rpcServer.handleStream(readWriteStream);
  const out = await outputResult;
  expect(out.map((v) => v!.toString())).toStrictEqual(
    messages.map(() =>
      JSON.stringify({
        jsonrpc: '2.0',
        result: { value: 1 },
        id: null,
      }),
    ),
  );
  await rpcServer.stop({ force: true });
});

Platform

Device: CI / Dell Precision 3480
OS: MacOS Github Runner / NixOS
Version: 6.0

Methods of Resolution

Instead of relying on switching pipes, use something more reliable
Update the tests to allow better identification. For example, give each input value a separate number, which can help identify exactly which items are being left out.

Notify Maintainers

@aryanjassal @tegefaulkes

The text was updated successfully, but these errors were encountered:

linear · 2024-11-08T04:53:38Z

ENG-465 Transferring data through the RPC stream skips the second message

aryanjassal · 2024-11-08T04:56:25Z

Temporarily, to allow deployment for continuing work on Polykey#838, I have commented out the tests. From what can be observed, nothing is failing outside the tests, so it seems safe to do.

CMCDragonkai · 2024-11-08T05:32:53Z

Can we fastcheck the sequential testing?

aryanjassal · 2024-11-08T06:31:24Z

Should be able to. Currently, the testing is taking in a randomly generated value for params and returning 1 consistently. By using the default middleware and setting the params to the value we want to test, I was able to simplify the test. This can be a way to use fastcheck.

However, this is a result of a very brief testing and investigation. I will properly go through this when I will actually do this issue.

CMCDragonkai · 2024-11-11T19:45:42Z

This deserves plenty of fuzz testing, use chatgpt to help.

aryanjassal · 2024-11-11T22:28:38Z

Previously, it already had fastcheck tests, so fuzz testing wasn't an issue before. The issue is that, for some reason, the second input is being skipped. Brian and I think that this is caused by us disconnecting and reconnecting the pipe when we extract the header message only, then cancel the stream. To fix this issue, I will have to update that implementation to be better and more robust.

The sequential testing is mostly for convenience, to be able to easily locate where things went wrong.

CMCDragonkai · 2024-12-01T23:05:21Z

This is a regression?

CMCDragonkai · 2024-12-01T23:06:30Z

This is a bug, the fastcheck has discovered a bug here.

CMCDragonkai · 2024-12-01T23:06:40Z

Why is this not a bug issue?

CMCDragonkai · 2024-12-01T23:07:06Z

This should be added into linear todos.

CMCDragonkai · 2024-12-01T23:08:42Z

The 2 test names should be more specific:

    ✕ forward middlewares (with seed=1292472631) (9 ms)
    ✕ reverse middlewares (with seed=1292472631) (8 ms)

aryanjassal · 2024-12-02T02:08:16Z

The reason this happens is most likely due to how the header messages are handled. Currently, a new transform stream is created to fetch the header, then the stream is cancelled to allow changing the consumer from the header consumer to another consumer which parses the actual content.

The weird decision of creating a stream and cancelling it just after getting the first message needs to be changed for another approach, which handles this more elegantly. Instead of cancelling the stream, a promise can be returned for the header, and something like an async iterable can be returned for content data.

This would remove the janky solution and implement an elegant one, and should fix the CI issues.

This is a bug, the fastcheck has discovered a bug here.

This is probably not the case, as fastcheck won't discover the same bug when running multiple tests. Moreover, the second message is skipped, which is something fastcheck can't induce, so there is most likely some other underlying reason for this.

This deserves plenty of fuzz testing, use chatgpt to help.

The current fastcheck tests are already pretty robust and extensive in checking, so I don't believe this to be the issue here.

aryanjassal · 2024-12-02T02:14:19Z

I have done the following in Polykey#847 to extract a header message from an async iterable.

// Extracts the header message from the iterator
const headerMessage = await (async () => {
  const iterator = input[Symbol.asyncIterator]();
  const header = (await iterator.next()).value;
  if (header.type === 'VaultNamesHeaderMessage') {
    if (header == null) throw new clientErrors.ErrorClientInvalidHeader();
    return header;
  }
})();

// Do stuff on rest of the messages
for await (const message of input) {
  // The header has been consumed, so all other messages will be
  // returned from the loop.
}

CMCDragonkai · 2024-12-02T03:07:28Z

Why do const headerMessage = await (async () => { ... }();, seems redundant, are you sure this is the right structure of the code?

aryanjassal · 2024-12-02T03:22:02Z

Why do const headerMessage = await (async () => { ... }();, seems redundant, are you sure this is the right structure of the code?

There are easier ways to do this, but most of them include something like this let headerMessage: Type | undefined;, which is something I wanted to avoid, as that would lead to a lot of extra checks down the line. One per message during iteration, I believe. This can add up, and moreover, random null checks sprawled everywhere just look weird, so I decided on this approach.

The code I've provided creates and calls an async function inline. It's basically doing a bunch of calculations inside a block to return a singular value. This lets us avoid needing to type-check everywhere. The syntax looks weird because we await an async function that we have created. It is functionally similar to this, but more concise.

const extractHeader = async (): HeaderType => {
  const iterator = input[Symbol.asyncIterator]();
  const header = (await iterator.next()).value;
  if (header.type === 'VaultNamesHeaderMessage') {
    if (header == null) throw new clientErrors.ErrorClientInvalidHeader();
    return header;
  }
}

const header = await extractHeader();

This makes it easier to see how this looks spread out, and how it would look like if extractHeader was substituted with the async call block instead.

tegefaulkes · 2024-12-03T01:05:58Z

You shouldn't have to check it's type everywhere even if you unwrapped the logic out of that arrow function.

Also, does that code even work? Why check header == null after checking header.type === ''. If type is defined then header can't be null.

Isn't this simpler?

    const head = await input.next();
    if (head.done) utils.never();
    if (head.value.type === 'VaultNamesHeaderMessage') throw new clientErrors.ErrorClientInvalidHeader();
    const headerMessage = head.value;

aryanjassal added the development Standard development label Nov 8, 2024

aryanjassal self-assigned this Nov 8, 2024

aryanjassal mentioned this issue Dec 1, 2024

Allow vault efs resource acquisition to operate on multiple vaults in parallel MatrixAI/Polykey#847

Open

10 tasks

aryanjassal linked a pull request Dec 2, 2024 that will close this issue

Fixing random CI failures due to message skips #74

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transferring data through the RPC stream skips the second message #70

Transferring data through the RPC stream skips the second message #70

aryanjassal commented Nov 8, 2024 •

edited

Loading

linear bot commented Nov 8, 2024

aryanjassal commented Nov 8, 2024

CMCDragonkai commented Nov 8, 2024

aryanjassal commented Nov 8, 2024

CMCDragonkai commented Nov 11, 2024

aryanjassal commented Nov 11, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

aryanjassal commented Dec 2, 2024

aryanjassal commented Dec 2, 2024

CMCDragonkai commented Dec 2, 2024

aryanjassal commented Dec 2, 2024

tegefaulkes commented Dec 3, 2024

Transferring data through the RPC stream skips the second message #70

Transferring data through the RPC stream skips the second message #70

Comments

aryanjassal commented Nov 8, 2024 • edited Loading

Description

To Reproduce

Additional context

Platform

Methods of Resolution

Notify Maintainers

linear bot commented Nov 8, 2024

aryanjassal commented Nov 8, 2024

CMCDragonkai commented Nov 8, 2024

aryanjassal commented Nov 8, 2024

CMCDragonkai commented Nov 11, 2024

aryanjassal commented Nov 11, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

CMCDragonkai commented Dec 1, 2024

aryanjassal commented Dec 2, 2024

aryanjassal commented Dec 2, 2024

CMCDragonkai commented Dec 2, 2024

aryanjassal commented Dec 2, 2024

tegefaulkes commented Dec 3, 2024

aryanjassal commented Nov 8, 2024 •

edited

Loading