Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DepUsrSCTP::Checker::timer not being freed #576

Merged
merged 15 commits into from
Jun 2, 2021
Merged

Fix DepUsrSCTP::Checker::timer not being freed #576

merged 15 commits into from
Jun 2, 2021

Conversation

ibc
Copy link
Member

@ibc ibc commented Jun 1, 2021

UPDATE: THIS PR IS NOW READY TO BE MERGED

Rationale and solution below in this comment.


Related to PR #575.

THIS PR IS NOT INTENDED TO BE MERGED AT ALL

I've added this in DepLibUV.cpp:

inline static void onWalk(uv_handle_t* handle, void* arg)
{
	MS_ERROR_STD(
		"---- handle [type:%d, active:%d, closing:%d, has_ref:%d]",
		handle->type,
		uv_is_active(handle),
		uv_is_closing(handle),
		uv_has_ref(handle)
	);

	if (!uv_is_closing(handle))
		uv_close(handle, onClose);
}

What I see:

  • In mediasoup Node tests, onWalk() is never called (this is good and expected/desired behavior).

  • When running mediasoup in a real application (example, mediasoup-demo), if I call worker.close() in Node, neither I see any call to onWalk() (this is good and expected/desired behavior).

  • However, when running the mediasoup-worker tests (this is: make test -C worker) I see this output:

Done.
./out/Release/mediasoup-worker-test --invisibles --use-colour=yes
===============================================================================
All tests passed (1606 assertions in 39 test cases)

DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | ---- handle [type:13, active:0, closing:1, has_ref:1]

Note that handle->type 13 means UV_TIMER: http://docs.libuv.org/en/v1.x/handle.html?highlight=uv_handle_type#c.uv_handle_type

And this seems to be because, in worker tests, we initiate (maybe among others) some KeyFrameRequestManager instances (although the problem may be somewhere else) and it may happen that we are not closing some of them properly so they remain active.

So what I mean is that, AFAIS, the only issue I see in in mediasoup/worker/tests/.

@ibc
Copy link
Member Author

ibc commented Jun 1, 2021

OHHHHH the problem in tests is that TestKeyFrameRequestManager.cpp is calling DepLibUV::ClassInit(); which was already called by the tests main file so it's replacing the loop with another one!

Fixing here and in v3.

@jmillan cough cough

@ibc
Copy link
Member Author

ibc commented Jun 1, 2021

Sorry, I was testing in v3... I should go to bed

@nazar-pc
Copy link
Collaborator

nazar-pc commented Jun 1, 2021

It actually happens with Node tests as well, you just don't see it due to the way Node API works.
If you write a piece of code that, let's say, creates a file instead of printing, you'll see that file is indeed created. It just all happens when Node no longer cares about the process.

@nazar-pc
Copy link
Collaborator

nazar-pc commented Jun 1, 2021

I used this for testing:

diff --git a/worker/src/DepLibUV.cpp b/worker/src/DepLibUV.cpp
--- a/worker/src/DepLibUV.cpp	(revision 45729a4f3a1cfcbca2ead52362115d7f6467ad92)
+++ b/worker/src/DepLibUV.cpp	(date 1622587923886)
@@ -4,11 +4,28 @@
 #include "DepLibUV.hpp"
 #include "Logger.hpp"
 #include <cstdlib> // std::abort()
+#include <iostream>
+#include <fstream>
+
+using std::cout; using std::ofstream;
+using std::endl; using std::string;
+using std::cerr;
+using std::fstream;
 
 /* Static variables. */
 
 thread_local uv_loop_t* DepLibUV::loop{ nullptr };
 
+void on_uv_walk(uv_handle_t* handle, void* arg)
+{
+	string filename("output.txt");
+	fstream output_fstream;
+
+	output_fstream.open(filename, std::ios_base::out);
+
+	output_fstream << "Maecenas accumsan purus id \norci gravida pellentesque." << endl;
+}
+
 /* Static methods. */
 
 void DepLibUV::ClassInit()
@@ -27,10 +44,28 @@
 {
 	MS_TRACE();
 
-	// This should never happen.
 	if (DepLibUV::loop != nullptr)
 	{
-		uv_loop_close(DepLibUV::loop);
+		int err;
+
+		uv_stop(DepLibUV::loop);
+		uv_walk(DepLibUV::loop, on_uv_walk, NULL);
+
+		while (true)
+		{
+			err = uv_loop_close(DepLibUV::loop);
+
+			if (err != UV_EBUSY)
+			{
+				break;
+			}
+
+			uv_run(DepLibUV::loop, UV_RUN_NOWAIT);
+		}
+
+		if (err != 0)
+			MS_ABORT("failed to close libuv loop: %s", uv_err_name(err));
+
 		delete DepLibUV::loop;
 	}
 }

After running node tests you'll see output.txt in the working directory.

@ibc
Copy link
Member Author

ibc commented Jun 1, 2021

I'll do more tomorrow. Could you please print (into file) handle->type and paste it here?

@ibc
Copy link
Member Author

ibc commented Jun 1, 2021

Ok, I've done. BTW use output_fstream.open(filename, std::ios_base::app); so lines are appended to existing file content.

  • Running worker tests I see 15 handle type:13 lines.
  • Running the demo I just see a single handle type:13 line.

There is some culprit here.

BTW: uv handles have a data custom field in which we can add custom data to identify them.

@ibc
Copy link
Member Author

ibc commented Jun 1, 2021

Culprit is somewhere in DepUsrSCTP.cpp. I've made its methods do nothing (and do not create its internal Checker singleton, and the output file is empty.

@ibc
Copy link
Member Author

ibc commented Jun 1, 2021

Also, if I comment DepUsrSCTP::ClassInit(); and DepUsrSCTP::ClassDestroy(); in tests.cpp, the file has 15 handle type:13 lines.

If I do not comment those lines, the file has 16 lines. Which proves that this is the only bug that really affects mediasoup (the other 15 are just in worker tests).

@ibc
Copy link
Member Author

ibc commented Jun 1, 2021

I see what the problem is:

  1. The Timer instance in the Checker singleton in DepUsrSCTP is deleted (so its uv handle closed) in DepUsrSCTP::ClassDestroy().
  2. However, DepUsrSCTP::ClassDestroy() is called after uv_run() in DepLibUV::RunLoop() has returned. How is this possible? because a non running uv_timer is not active so uv_run() can terminate. However the uv_timer handle MUST be closed later (to not leak).
  3. And we close it.... where? in DepUsrSCTP::ClassDestroy(), but at this point, our UV loop is no longer running so the uv_close of the uv_timer is never called.

Solutions:

  1. Your PR, but I don't like it much since I don't want to "hide" future wrong libuv usages.
  2. Closing the Timer of the DepUsrSCTP::Checker singleton in the Worker::Closer() method. Of course this requires splitting the body of DepUsrSCTP::ClassInit() and DepUsrSCTP::ClassDestroy() in 4 static methods instead, this is, 2 new functions that just create and close the Checker instance. Those are the ones that should be called by Worker.cpp in its constructor and Close() method.

@ibc
Copy link
Member Author

ibc commented Jun 2, 2021

ok, I've done solution 2.

THIS PR IS READY TO BE MERGED IN V3 BRANCH

@nazar-pc @jmillan please review (although I've already tested it).

@ibc ibc added this to the v3 updates milestone Jun 2, 2021
@ibc ibc requested a review from jmillan June 2, 2021 00:29
@ibc ibc changed the title uv_run() and inactive UV handles investigation Fix DepUsrSCTP::Checker::timer not being freed Jun 2, 2021
Copy link
Collaborator

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on my end with valgrind, looks good!

worker/src/DepLibUV.cpp Outdated Show resolved Hide resolved
worker/src/DepUsrSCTP.cpp Show resolved Hide resolved
@ibc
Copy link
Member Author

ibc commented Jun 2, 2021

Ok, I get this in the worker tests: (ABORT) DepLibUV::ClassDestroy() | failed to close libuv loop: EBUSY

And it makes sense because we have to fix the tests. I'll change the abort with a error log.

@jmillan
Copy link
Member

jmillan commented Jun 2, 2021

The remaining open handle in the tests happens for the Timer of NackGenerator, in the following test:

	SECTION("require key frame")
	{
		RtpStreamRecvListener listener;
		RtpStreamRecv rtpStream(&listener, params);

		packet->SetSequenceNumber(1);
		rtpStream.ReceivePacket(packet);

		// Seq different is bigger than MaxNackPackets in NackGenerator, so it
		// triggers a key frame.
		packet->SetSequenceNumber(1003);
		listener.shouldTriggerPLI = true;
		listener.shouldTriggerFIR = false;
		rtpStream.ReceivePacket(packet);
	}

It reproduces by simly creating a RtpStreamRecv instance:

        SECTION("require key frame")
	{
		RtpStreamRecvListener listener;
		RtpStreamRecv rtpStream(&listener, params);
         }

The following logs gives a glimpse of what's going on:

NackGenerator::NackGenerator
NackGenerator::~NackGenerator
Timer::Close()
type: 13, active: 0, closing: 1, has_ref :1
Timer onClose()

The Timer static onClose() method is executed in the next iteration after closing the timer handle, which is an expeted behavior.

@ibc
Copy link
Member Author

ibc commented Jun 2, 2021

The Timer static onClose() method is executed in the next iteration after closing the timer handle, which is an expeted behavior.

Of course, this is how libuv works, but we are doing it wrong here by not runnning DepLibUV::RunLoop() after closing/freeing the NackGenerator instance.

@ibc
Copy link
Member Author

ibc commented Jun 2, 2021

@jmillan running DepLibUV::RunLoop() at the end of TestNackGenerator.cpp fixes 9 issues (expected since this is the number of NackGenerator instances created in that file).

Do you know where else the remaining 6 may be? I mean: they must be classes that create a Timer inside.

@ibc
Copy link
Member Author

ibc commented Jun 2, 2021

@jmillan @nazar-pc I've pushed a commit to always check uv_walk (this is, whether there is open handles) and print errors if so.

As told in my previous comment, there are 6 remaining issues in worker tests. It can be shown (in this branch) by running test as follows (in mediasoup/worker/ folder):

MS_TEST_LOG_LEVEL=error make test

Output:

[...]
===============================================================================
All tests passed (1606 assertions in 39 test cases)

DepLibUV::onWalk() | alive UV handle found (this shouldn't happen) [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | alive UV handle found (this shouldn't happen) [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | alive UV handle found (this shouldn't happen) [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | alive UV handle found (this shouldn't happen) [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | alive UV handle found (this shouldn't happen) [type:13, active:0, closing:1, has_ref:1]
DepLibUV::onWalk() | alive UV handle found (this shouldn't happen) [type:13, active:0, closing:1, has_ref:1]

@ibc
Copy link
Member Author

ibc commented Jun 2, 2021

EVERYTHING FIXED

@ibc ibc merged commit 78763f4 into v3 Jun 2, 2021
@ibc ibc deleted the pr575_research branch June 2, 2021 11:34
@ibc
Copy link
Member Author

ibc commented Jun 2, 2021

Released in mediasoup 3.7.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants