Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix video freezing with sync required when transport is connected #787

Closed
wants to merge 1 commit into from

Conversation

ggarber
Copy link
Contributor

@ggarber ggarber commented Mar 6, 2022

There are cases where the consumers video is frozen for some seconds (2-4s) after starting to consume and having had proper video for 1s already. This happens specially when users round trip time is not low.

After some debugging it looks like the root cause is that Consumer::UserOnTransportConnected is being called multiple times (I think once when the DTLS connection is established and another one when the client sends the 'connect' message) and every time mediasoup resets the video stream sync (syncRequired = true).

The first time that happens it is not a big deal because during the initial setup mediasoup will switch spatial layers from -1 to 1 and request a new keyFrame so video will be unfrozen soon. The problem is the second that that happens (when the client sends the 'connect' message) because in that case nobody is requesting a keyframe so the video will be stuck until the client timeout for no video is triggered (2-3s) and it requests a new PLI.

There could be different solutions for this:

  • Request a key frame in UserOnTransportConnected => More keyframes
  • Avoid UserOnTransportConnected being called multiple times => Not sure if it is doable but looks something interesting to review
  • Don't break the sync (don't mark syncRequired = true) when the transport is connected. There is no need to couple transport state with stream state as far as I can tell.

This PR implements the last one of those approaches.

@nazar-pc
Copy link
Collaborator

nazar-pc commented Mar 6, 2022

Documentation recommends creation of consumer with paused: true and resuming it later to avoid random timeouts and fundamentally described issue. It doesn't seem that you're using it, am I right?

@ibc
Copy link
Member

ibc commented Mar 6, 2022

I completely agree with @nazar-pc here. It's documented. The change you have done in this PR probably break video SingleConsumer since indeed the Consumer maybe already created at the time the transport is connected and hence a keyframe is required.

Waiting for Gustavo'a response but this is not an issue if you follow those recommendations in the docs.

@ggarber
Copy link
Contributor Author

ggarber commented Mar 7, 2022

@nazar-pc @ibc Thx for the quick response.

We totally follow that recommendation, there is no other way to make it reliable with WebRTC. Consumers are always created paused and then resumed from our application to avoid receiving video from unsignaled ssrcs in the browser side.

But the problem is that in some cases consumers are resumed before the transport is connected. Connecting a transport takes 4 roundtrips (1 ICE, 2 DTLS, 1 signaling) and resuming a transport takes 1 roundtrip so it is easy to have it resumed sooner than the transport is connected, right?

In our testing scenario, where we were able to reproduce the issue reported with customers (simulating a rtt of ~300ms), the consumer is resumed in 0.5 secs and the consumer's transport is connected after 1 secs more or less.

Does it makes sense?

Why does a call to consumer.connect() needs to reset the RTP stream forcing a re-sync that implies some video freezing (specially long if no PLI is sent like in this case)? Could transport and stream be more independent to avoid these issues?

@ibc
Copy link
Member

ibc commented Mar 7, 2022

But the problem is that in some cases consumers are resumed before the transport is connected. Connecting a transport takes 4 roundtrips (1 ICE, 2 DTLS, 1 signaling) and resuming a transport takes 1 roundtrip so it is easy to have it resumed sooner than the transport is connected, right?

Here I assume you meant "and resuming a consumer takes 1 roundtrip".
And yes, that can happen.

Why does a call to consumer.connect() needs to reset the RTP stream forcing a re-sync that implies some video freezing

Here I assume you mean Consumer::TransportConnected(), right? That's called just in two scenarios from Transport.cpp (look for consumer->TransportConnected() in Transport.cpp):

  1. When a Consumer is created via Channel::ChannelRequest::MethodId::TRANSPORT_CONSUME. In that case, if the transport is already connected (WebRtcTransport is connected when ICE is connected and DTLS is established), we need to tell the new Consumer that its transport is connected via Consumer::TransportConnected().
  2. When the transport is connected. In this case we iterate all its Consumers and call Consumer::TransportConnected() on them.

NOTE: Perhaps there is an issue here (spoiler: I don't see it, keep reading) since we may be calling consumer->TransportConnected() con Consumers that were created before the transport was connected, so we are calling TransportConnected() twice. Let's see what Consumer::TransportConnected() does. It basically sets this->transportConnected = true; in Consumer.cpp and calls SimulcastConsumer::UserOnTransportConnected(), which looks like this:

void SimulcastConsumer::UserOnTransportConnected()
{
	MS_TRACE();

	this->syncRequired                 = true;
	this->keyFrameForTsOffsetRequested = false;

	if (IsActive())
		MayChangeLayers();
}

Here we are setting this->syncRequired = true;. We assume that the Consumer is paused at this point since we are creating them paused in the app as recommended, so IsActive() returns false.

When the Consumer is created, your app resumes it (NOTE: no matter transport connection state), so it sends Channel::ChannelRequest::MethodId::CONSUMER_RESUME to Consumer.cpp which does this:

case Channel::ChannelRequest::MethodId::CONSUMER_RESUME:
{
	if (!this->paused)
	{
		request->Accept();

		return;
	}

	this->paused = false;

	MS_DEBUG_DEV("Consumer resumed [consumerId:%s]", this->id.c_str());

	if (IsActive())
		UserOnResumed();

	request->Accept();

	break;
}

Note that SimulcastConsumer::IsActive() is:

bool IsActive() const override
{
	// clang-format off
	return (
		RTC::Consumer::IsActive() &&
		std::any_of(
			this->producerRtpStreams.begin(),
			this->producerRtpStreams.end(),
			[](const RTC::RtpStream* rtpStream)
			{
				return (rtpStream != nullptr && (rtpStream->GetScore() > 0u || rtpStream->HasDtx()));
			}
		)
	);
	// clang-format on
}

and it also calls parent's Consumer::IsActive() which is:

virtual bool IsActive() const
{
	// The parent Consumer just checks whether Consumer and Producer are
	// not paused and the transport connected.
	// clang-format off
	return (
		this->transportConnected &&
		!this->paused &&
		!this->producerPaused &&
		!this->producerClosed
	);
	// clang-format on
}

This is: a SimulcastConsumer is active if it's not paused and if its transport is connected (among other things).

So here two things can happen:

  1. We have resumed the Consumer and Transport was already connected so we call UserOnResume() which will end requesting a keyframe because the consumer is active:
    void SimulcastConsumer::UserOnResumed()
     {
     	MS_TRACE();
    
     	this->syncRequired                 = true;
     	this->keyFrameForTsOffsetRequested = false;
    
     	if (IsActive())
     		MayChangeLayers();
     }
  2. We have resumed the Consumer but the transport was not connected so the consumer is not active so we don't request any keyframe.

In the first case this is obviously fine since this is the perfect time for requesting a keyframe. In the second case, we are not requesting since the transport is connected, and once the transport is connected, mediasoup will call SimulcastConsumer::UserOnTransportConnected() which will request the keyframe. So I don't see the issue here.

@ggarber
Copy link
Contributor Author

ggarber commented Mar 7, 2022

Thx for the detailed review.

NOTE: Perhaps there is an issue here (spoiler: I don't see it, keep reading) since we may be calling consumer->TransportConnected() con Consumers that were created before the transport was connected, so we are calling TransportConnected() twice.

Yes, that's what is happening in this case and why I suggest that an alternative solution (or workaround) could be to avoid the second SimulcastConsumer::UserOnTransportConnected() invocation.

  1. We have resumed the Consumer but the transport was not connected so the consumer is not active so we don't request any keyframe.

The issue we have is only in this case, we can ignore the rest scenarios.

In the first case this is obviously fine since this is the perfect time for requesting a keyframe. In the second case, we are not requesting since the transport is connected, and once the transport is connected, mediasoup will call SimulcastConsumer::UserOnTransportConnected() which will request the keyframe. So I don't see the issue here.

Can you clarify where exactly SimulcastConsumer::UserOnTransportConnected() will request the keyframe if the consumer was already resumed?

@ibc
Copy link
Member

ibc commented Mar 7, 2022

Yes, that's what is happening in this case and why I suggest that an alternative solution (or workaround) could be to avoid the second SimulcastConsumer::UserOnTransportConnected() invocation.

To be clear, what I meant is that I still don't see how this can be a problem.

Can you clarify where exactly SimulcastConsumer::UserOnTransportConnected() will request the keyframe if the consumer was already resumed?

The thing is that both scenarios (when a consumer is resumed and when its transport is connected) trigger exactly the same code:

	void SimulcastConsumer::UserOnResumed()
	{
		MS_TRACE();

		this->syncRequired                 = true;
		this->keyFrameForTsOffsetRequested = false;

		if (IsActive())
			MayChangeLayers();
	}
	void SimulcastConsumer::UserOnTransportConnected()
	{
		MS_TRACE();

		this->syncRequired                 = true;
		this->keyFrameForTsOffsetRequested = false;

		if (IsActive())
			MayChangeLayers();
	}

In your case, the consumer is created and resumed before the transport is connected, so IsActive() returns false. Later, when the transport is connected, IsActive() returns true and MayChangeLayers() is invoked:

	void SimulcastConsumer::MayChangeLayers(bool force)
	{
		MS_TRACE();

		int16_t newTargetSpatialLayer;
		int16_t newTargetTemporalLayer;

		if (RecalculateTargetLayers(newTargetSpatialLayer, newTargetTemporalLayer))
		{
			// If bitrate externally managed, don't bother the transport unless
			// the newTargetSpatialLayer has changed (or force is true).
			// This is because, if bitrate is externally managed, the target temporal
			// layer is managed by the available given bitrate so the transport
			// will let us change it when it considers.
			if (this->externallyManagedBitrate)
			{
				if (newTargetSpatialLayer != this->targetSpatialLayer || force)
					this->listener->OnConsumerNeedBitrateChange(this);
			}
			else
			{
				UpdateTargetLayers(newTargetSpatialLayer, newTargetTemporalLayer);
			}
		}
	}

Here RecalculateTargetLayers() will compute newTargetSpatialLayer and newTargetTemporalLayer and those will be different than -1 (their initial value since the consumer was created in paused state, see SimulcastConsumer.hpp and look for targetSpatialLayer and targetTemporalLayer).

So if we follow the flow above, this->listener->OnConsumerNeedBitrateChange(this); will be called. It won't directly called UpdateTargetLayers(newTargetSpatialLayer, newTargetTemporalLayer) since bandwidth is BWE managed externally, so Transport::OnConsumerNeedBitrateChange() will end calling Transport::DistributeAvailableOutgoingBitrate() which will call SimulcastConsumer::ApplyLayers() which will end calling SimulcastConsumer::UpdateTargetLayers().

Yes, this flow won't trigger any PLI. It never does it here. This doesn't happen here. That happens in SimulcastConsumer::SendRtpPacket() (note that it ignores any RTP packet if IsActive() is false) in which RequestKeyFrameForTargetSpatialLayer() will be called when needed. It's never been in the "transport connected" or "consumer resumed" logic.

@ggarber
Copy link
Contributor Author

ggarber commented Mar 7, 2022

Yes, that's what is happening in this case and why I suggest that an alternative solution (or workaround) could be to avoid the second SimulcastConsumer::UserOnTransportConnected() invocation.

To be clear, what I meant is that I still don't see how this can be a problem.

Can you clarify where exactly SimulcastConsumer::UserOnTransportConnected() will request the keyframe if the consumer was already resumed?

The thing is that both scenarios (when a consumer is resumed and when its transport is connected) trigger exactly the same code:

	void SimulcastConsumer::UserOnResumed()
	{
		MS_TRACE();

		this->syncRequired                 = true;
		this->keyFrameForTsOffsetRequested = false;

		if (IsActive())
			MayChangeLayers();
	}
	void SimulcastConsumer::UserOnTransportConnected()
	{
		MS_TRACE();

		this->syncRequired                 = true;
		this->keyFrameForTsOffsetRequested = false;

		if (IsActive())
			MayChangeLayers();
	}

In your case, the consumer is created and resumed before the transport is connected, so IsActive() returns false. Later, when the transport is connected, IsActive() returns true and MayChangeLayers() is invoked:

	void SimulcastConsumer::MayChangeLayers(bool force)
	{
		MS_TRACE();

		int16_t newTargetSpatialLayer;
		int16_t newTargetTemporalLayer;

		if (RecalculateTargetLayers(newTargetSpatialLayer, newTargetTemporalLayer))
		{
			// If bitrate externally managed, don't bother the transport unless
			// the newTargetSpatialLayer has changed (or force is true).
			// This is because, if bitrate is externally managed, the target temporal
			// layer is managed by the available given bitrate so the transport
			// will let us change it when it considers.
			if (this->externallyManagedBitrate)
			{
				if (newTargetSpatialLayer != this->targetSpatialLayer || force)
					this->listener->OnConsumerNeedBitrateChange(this);
			}
			else
			{
				UpdateTargetLayers(newTargetSpatialLayer, newTargetTemporalLayer);
			}
		}
	}

Here RecalculateTargetLayers() will compute newTargetSpatialLayer and newTargetTemporalLayer and those will be different than -1 (their initial value since the consumer was created in paused state, see SimulcastConsumer.hpp and look for targetSpatialLayer and targetTemporalLayer).

So if we follow the flow above, this->listener->OnConsumerNeedBitrateChange(this); will be called. It won't directly called UpdateTargetLayers(newTargetSpatialLayer, newTargetTemporalLayer) since bandwidth is BWE managed externally, so Transport::OnConsumerNeedBitrateChange() will end calling Transport::DistributeAvailableOutgoingBitrate() which will call SimulcastConsumer::ApplyLayers() which will end calling SimulcastConsumer::UpdateTargetLayers().

Yes, this flow won't trigger any PLI. It never does it here. This doesn't happen here. That happens in SimulcastConsumer::SendRtpPacket() (note that it ignores any RTP packet if IsActive() is false) in which RequestKeyFrameForTargetSpatialLayer() will be called when needed. It's never been in the "transport connected" or "consumer resumed" logic.

All the above is correct and I agree with it.

The problem is what happens with the next call received to SimulcastConsumer::UserOnTransportConnected() 1 second later.

The first call happens when the DTLS connection is established (WebRtcTransport::OnDtlsTransportConnected) and the second one happens when the ICE connection is completed (WebRtcTransport::OnIceServerCompleted)

That second call to SimulcastConsumer::UserOnTransportConnected() resets the stream as you showed in the code above (this->syncRequired = true) but in this case there is no change in the layers selected (it is still 1:1) so nothing will trigger the request of a new KeyFrame and that's what creates the 2s freeze.

@ibc
Copy link
Member

ibc commented Mar 7, 2022

I'm reviewing your comment but for now I'm lost here:

The first call happens when the DTLS connection is established (WebRtcTransport::OnDtlsTransportConnected) and the second one happens when the ICE connection is completed (WebRtcTransport::OnIceServerCompleted)

Definitely we don't call consumer->UserOnTransportConnected() twice first when DTLS is connected then when ICE is completed. Do we?

@ibc
Copy link
Member

ibc commented Mar 7, 2022

Definitely we don't call consumer->UserOnTransportConnected() twice first when DTLS is connected then when ICE is completed. Do we?

Wait, you are right. In WebRtcTransport.cpp we call RTC::Transport::Connected(); many times. Let me thing about this...

@ibc
Copy link
Member

ibc commented Mar 7, 2022

Wait, you are right. In WebRtcTransport.cpp we call RTC::Transport::Connected(); many times. Let me thing about this...

So the thing is that we call this method twice due to ICE and DTLS events (in Consumer.cpp):

void Consumer::TransportConnected()
{
	MS_TRACE();

	this->transportConnected = true;

	MS_DEBUG_DEV("Transport connected [consumerId:%s]", this->id.c_str());

	UserOnTransportConnected();
}

And indeed SimulcastConsumer::UserOnTransportConnected() resets this->syncRequired to true and hence we discard next RTP packets in SimulcastConsumer::SendRtpPacket() due to this code:

// If we need to sync and this is not a key frame, ignore the packet.
if (this->syncRequired && !packet->IsKeyFrame())
	return;

Ok, @ggarber please try this. Unless I miss something this should be the proper way to go:

diff --git a/worker/src/RTC/Consumer.cpp b/worker/src/RTC/Consumer.cpp
index 52b3f2257..c12299cd1 100644
--- a/worker/src/RTC/Consumer.cpp
+++ b/worker/src/RTC/Consumer.cpp
@@ -373,6 +373,9 @@ namespace RTC
 	{
 		MS_TRACE();
 
+		if (this->transportConnected)
+			return;
+
 		this->transportConnected = true;
 
 		MS_DEBUG_DEV("Transport connected [consumerId:%s]", this->id.c_str());
@@ -384,6 +387,9 @@ namespace RTC
 	{
 		MS_TRACE();
 
+		if (!this->transportConnected)
+			return;
+
 		this->transportConnected = false;
 
 		MS_DEBUG_DEV("Transport disconnected [consumerId:%s]", this->id.c_str());

ibc added a commit that referenced this pull request Mar 7, 2022
…nnected()

Rationale given here: #787 (comment)

We always check flags and return fast if the flag was already set (or unset, it depends), then we set or unset the flag and run code. Here we failed and didn't honor this pattern.
@ibc
Copy link
Member

ibc commented Mar 7, 2022

Just in case here a PR with the above diff: #788

@vpalmisano
Copy link
Contributor

I tested the previous patch from @ibc with large RTT conditions, it works for me.

@ibc
Copy link
Member

ibc commented Mar 7, 2022

Thanks, I'll merge and release.

@ibc
Copy link
Member

ibc commented Mar 7, 2022

Closing in favour of PR #788.

@ibc ibc closed this Mar 7, 2022
ibc added a commit that referenced this pull request Mar 7, 2022
…nnected() (#788)

Rationale given here: #787 (comment)

We always check flags and return fast if the flag was already set (or unset, it depends), then we set or unset the flag and run code. Here we failed and didn't honor this pattern.
@ibc
Copy link
Member

ibc commented Mar 7, 2022

Published in 3.9.9. Thanks guys!

@ggarber
Copy link
Contributor Author

ggarber commented Mar 7, 2022

Thx @ibc . Not calling transport connected twice was one of the alternatives proposed so I'm good with it.

But for your consideration changing the "stream state" (syncRequired = true) based on transport state maybe is more complicated than needed and those staes can be independent.

Thx

@ibc
Copy link
Member

ibc commented Mar 7, 2022

But for your consideration changing the "stream state" (syncRequired = true) based on transport state maybe is more complicated than needed and those staes can be independent.

Actually we should not set a flag that was already set, meaning that we should not run code twice. We just missed to check flag value before doing things as we do everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants