Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[event-hubs] investigate TypeError thrown from rhea-promise awaitable sender #8584

Closed
chradek opened this issue Apr 28, 2020 · 4 comments · Fixed by #8884
Closed

[event-hubs] investigate TypeError thrown from rhea-promise awaitable sender #8584

chradek opened this issue Apr 28, 2020 · 4 comments · Fixed by #8884
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Milestone

Comments

@chradek
Copy link
Contributor

chradek commented Apr 28, 2020

Recently I was testing the impact of creating batches and sending them faster than the service could accept them using the EventHubProducerClient in @azure/event-hubs v5.

As expected, memory usage increased and eventually I could see that some of the send calls would time out.

However if I let the script continue long enough, I would eventually see an uncatchable error thrown by rhea-promise:

TypeError: Cannot read property 'address' of undefined
    at AwaitableSender.get address [as address] (C:\workplace\perf-testing\eh-send-oom\node_modules\rhea-promise\dist\lib\link.js:62:28)
    at Timeout.setTimeout [as _onTimeout] (C:\workplace\perf-testing\eh-send-oom\node_modules\rhea-promise\dist\lib\awaitableSender.js:129:69)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)

The error is being thrown from a function that is triggered by a setTimeout. The function is meant to be treated as an operation timeout, and throws an error that includes the sender's address in the message. The getter for the address property fails because the link's source property is undefined.

This issue is meant to track the investigation into how the rhea-promise awaitable sender can get into a state where source is undefined, and if needed either log an issue or create a PR to fix this in rhea-promise.

Script used to reproduce issue:

const {EventHubProducerClient} = require("@azure/event-hubs");

const connectionString = process.env["connection_string"];

let batchesInFlight = 0;
let batchSizeInFlight = 0;
let totalSuccessfulSends = 0;

const clients = [];
for (let i = 0; i < 40; i++) {
  const client = new EventHubProducerClient(connectionString);
  clients.push(client);
}

/**
 * 
 * @param {EventHubProducerClient} client 
 */
async function fillBatch(client) {
  const eventsToSend = [];
  for (let i = 0; i < 50; i++) {
    eventsToSend.push({
      "name": "Standalone test ${randstr.repeat(1000)}",
      "timestamp": Date.now(),
      "event": "MS_test ${randstr.repeat(1000)}",
      "message": "Metric No. ${randstr.repeat(1000)}",
      "message1": "Metric No. ${randstr.repeat(1000)}",
      "message2": "Metric No. ${randstr.repeat(1000)}"
    });
  }
  const batch = await client.createBatch()
  for (let i = 0; i < eventsToSend.length; i++) {
    if (!batch.tryAdd({body: eventsToSend[i]})) {
      console.error(`Failed to add event ${i + 1}`);
      break;
    }
  }
  return batch;
}

async function sendEvents(client) {
  const batch = await fillBatch(client);
  batchesInFlight++;
  batchSizeInFlight += batch.sizeInBytes;
  try {
    await client.sendBatch(batch);
    totalSuccessfulSends++;
  } catch (err) {
    console.error(`Received an error while sending ${batch.count} events: ${err.message}`);
  } finally {
    batchesInFlight--;
    batchSizeInFlight -= batch.sizeInBytes;
  }
}

async function run() {
  const promises = [];
  for (const client of clients) {
    promises.push(fullThrottle(client));
  }

  return Promise.all(promises);
}

async function fullThrottle(client) {
  for (let i = 0; i < 100000; i++) {
    sendEvents(client);
    await new Promise(r => setTimeout(r, 25));
  }

  await client.close();
}

run().catch((err) => {
  console.error(err);
  process.exit(1);
});

setInterval(() => {
  console.log(`Total successful sends: ${totalSuccessfulSends}, Batches in flight: ${batchesInFlight}, Size in flight: ${batchSizeInFlight} bytes`);
}, 1000);

@chradek chradek added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Apr 28, 2020
@chradek chradek added this to the [2020] June milestone Apr 28, 2020
@aangelisc
Copy link

I am also currently facing this error.

I have multiple clients instantiated attempting to push thousands of events (batched) to an Event Hub. We have been trying to achieve an Event Hub throughput of a few hundred thousand events/s, however this is proving to be quite difficult using the node sdk. I have attempted to use batching to better control the sending but with 5 clients and a batch size of ~500 events/batch I end up running into this error after around 3 minutes of running my script.

Any help pushing significant throughput to Event Hub via the node sdk will be greatly appreciated.

@chradek
Copy link
Contributor Author

chradek commented May 8, 2020

I started looking into this today, and found that at some point the underlying connection has its remote state reset. This can happen when connect is called on a connection.

Sometimes this isn't a problem because the remote state is repopulated once the links are reattached by rhea. However if the send timeout is hit before this happens, there's a TypeError thrown because that remote state doesn't have the source field defined which is where the address comes from.

So now the question is, why is connect being called multiple times like this? I can tell that it is being triggered inside the cbs init() method, but I'm not sure yet why that's being called more than once.

One easy 'fix' would be to submit a PR to rhea-promise so that it wraps the this.address getter in a try/catch before accessing it in the error message constructed when the send operation times out. However I think the root cause is related to us calling connection.open() multiple times, and we should figure out why that's happening.

@chradek
Copy link
Contributor Author

chradek commented May 11, 2020

In my test case I'm seeing the following take place:

  1. User calls createBatch, connection is established with the service, cbs session is created.
  2. Call sendBatch in a tight loop. In a tight enough loop, the main thread can stay busy enough where the connection idle timeout is reached before a heartbeat is received from the service.
  3. idle_timeout is triggered. This closes the local side of the rhea connection (but the remote doesn't get closed and disconnect doesn't fire for another 1 second due to a hard-coded setTimeout in rhea.)
  4. sendBatch operations hit the 60 second timeout (this matches the idle_timeout) and start returning an OperationTimeoutError.
  5. OperationTimeoutError is retried. Connection is no longer perceived as open due to the idle_timeout, so during the cbs session negotiating, the rhea-promise connection is opened again.
  6. rhea connection object has it's remote state reset due to the SDK calling open on the rhea-promise connection again.
  7. The next already in-flight sendBatch operation that times out throws the TypeError because the remote state was reset, so the address is trying to be accessed from an object that is undefined.

I think we can work around this if we use a new rhea-promise connection when we detect that we are going to call open on it, and it's already been opened before. This is similar to the change we made for service-bus where we create a new rhea-promise connection anytime a disconnect is encountered. However, this will be different in that in the above scenario, disconnect isn't emitted until after the SDK call open again on the same connection, so we'll need another way to determine if the connection should be replaced.

@chradek
Copy link
Contributor Author

chradek commented Jun 8, 2020

The fix for this issue is now available in @azure/event-hubs version 5.2.1:
https://www.npmjs.com/package/@azure/event-hubs/v/5.2.1

Here's some more info on all the changes that went into this version:
https://github.com/Azure/azure-sdk-for-js/blob/master/sdk/eventhub/event-hubs/CHANGELOG.md#521-2020-06-08

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants