Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Diagnostics component #727

Merged
merged 55 commits into from
Oct 17, 2021
Merged

Conversation

desyncr
Copy link
Contributor

@desyncr desyncr commented Apr 5, 2021

This PR updates the /diagnostics section (DiagnosticToadlet) to display thread information, including CPU usage, to the user. The intention is to show information to determine the cause of CPU usage (ex. plugins, Log, UPnP etc).

Threads (155):
 Thread ID          Job ID Name                                                                                       Prio.      Group Status               % CPU
       482      -600434685 NodeDiagnostics: thread monitor(61)                                                            5       main RUNNABLE             16.20
       337     -1605458390 RequestSender for UID -4819448343849908691 on 29966(41)                                        7       main TIMED_WAITING        4.17
       293       820065101 RequestSender for UID -3007138528378398579 on 29966(14)                                        7       main TIMED_WAITING        3.20
       510       283081595 RequestSender for UID 611930334131416017 on 29966(122)                                         7       main TIMED_WAITING        1.42
       613     -1700317102 RequestSender for UID -3867706999542896026 on 29966(175)                                       7       main TIMED_WAITING        1.29
       608      -101505222 RequestSender for UID 3554502223033864374 on 29966(173)                                        7       main TIMED_WAITING        1.27
        28              28 PacketSender thread for 29966                                                                 10       main TIMED_WAITING        1.27
       343      -486051315 RequestSender for UID 7371023460396097846 on 29966(48)                                         7       main TIMED_WAITING        1.15
        80      -971461974 UdpSocketHandler for port 30081(1)                                                            10       main RUNNABLE             0.84
       647     -1318572914 RequestSender for UID -7623304779548324689 on 29966(208)                                       7       main TIMED_WAITING        0.73
        78     -1374273574 <noname>(1)                                                                                    6       main WAITING              0.28
        29              29 Ticker thread for 29966                                                                       10       main TIMED_WAITING        0.10
       550      2024818125 Slow callback for freenet.io.xfer.BlockReceiver$3@49a11870(83)                                 5       main WAITING              0.04
        15              15 Wrapper-Control-Event-Monitor                                                                  5       main TIMED_WAITING        0.03
        24      1587637561 Network Interface Acceptor for ServerSocket[addr=/127.0.0.1,localport=8888](0)                 5       main RUNNABLE             0.02
        40      -960067964 SSK Request starter (29966) (bulk)(1)                                                          5       main WAITING              0.02
       770      1988340316 CHKInsertSender for UID -899544705615949250 on 29966 at 1628945874470(290)                     7       main TIMED_WAITING        0.01
        25              25 SimpleToadletServer                                                                            5       main WAITING              0.01
         3               3 Finalizer                                                                                      8     system WAITING              0.01

@ArneBab
Copy link
Contributor

ArneBab commented Apr 5, 2021

the code looks good. Can you check whether there is something sensitive in the thread name when you insert a file? (any info about the target key)

can you check that you don’t change the indentation? The current format isn’t ideal, but better strange but consistent indentation than inconsistent indentation.

@desyncr
Copy link
Contributor Author

desyncr commented Apr 6, 2021

Some more comments over IRC:

we usually attach instrumentation to the JVM's socket.. jconsole, yourkit, eclipse. That being said, #727 looks cool. You may want to make the sleep() in the middle configurable (profiling over 1s may not always make sense) using a get parameter or something...

src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
@desyncr
Copy link
Contributor Author

desyncr commented Apr 10, 2021

the code looks good. Can you check whether there is something sensitive in the thread name when you insert a file? (any info about the target key)

From what I can see using the thread info there's no sensitive information but I'll look closely at the different places this is being used.

can you check that you don’t change the indentation? The current format isn’t ideal, but better strange but consistent indentation than inconsistent indentation.

I corrected the code styles and imports, should be following the convention now 👍

Edit: There's this one maybe: Network Interface Acceptor for ServerSocket[addr=/127.0.0.1, not sure if it's too sensitive.

@desyncr desyncr changed the title Add thread cpu to diagnostics Add Diagnostics component Apr 11, 2021
Copy link
Contributor

@Bombe Bombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m sorry that my reviews always turn into these please-change-everything requests… 😕

src/freenet/node/Node.java Outdated Show resolved Hide resolved
@desyncr desyncr marked this pull request as draft April 23, 2021 08:57
@desyncr desyncr marked this pull request as ready for review April 24, 2021 02:08
@desyncr desyncr requested a review from Bombe April 24, 2021 16:59
@Thynix
Copy link
Contributor

Thynix commented Apr 28, 2021

This is a useful feature!

I'm wondering if the CPU time metrics could be misleading when combined with the pooled executor? If I understand correctly it changes jobs within a given thread, so could it report a high CPU utilization for a thread that just got done doing something intensive, and misleadingly attribute that to an idle thread or something that actually doesn't use much CPU? Would it be worth it to modify the pooled executor to set a zero point when switching jobs to avoid this?

@desyncr
Copy link
Contributor Author

desyncr commented May 3, 2021

Few things missing:

@Bombe
Copy link
Contributor

Bombe commented May 4, 2021

Would it be worth it to modify the pooled executor to set a zero point when switching jobs to avoid this?

I don’t think the effort would be worth it. If you split up a thread’s runtime into A (the previous job, large runtime) and B (current job, small runtime) then you still wouldn’t see where the larger amount of time has been spent; it would just be a little less misleading but that only until the next measurement is complete because by then either we get the real runtime for the job or the thread has already been replaced by something else. Plus, if a thread can have two jobs running in a measurement interval it can also have 1000 jobs running in a measurement interval and I don’t think anybody wants to keep track of that. :)

We would have to modify both our own executor (which I think we should get rid off in favour of one of the JDK implementations) and the diagnostics for something that has questionable gain to begin with…

@Thynix
Copy link
Contributor

Thynix commented May 4, 2021 via email

@Thynix
Copy link
Contributor

Thynix commented May 6, 2021

Is it worth adding a notice to the UI about this limitation? It doesn't seem evident from the UI, and seems only especially meaningful if a given job regularly shows up with high CPU time. Maybe something like:

This CPU utilization information mostly consists of threads that work on many tasks. High utilization only becomes attributable to a specific job, as opposed to a specific worker thread, if it regularly has high utilization attributed to it.

@desyncr desyncr requested a review from Bombe May 9, 2021 20:10
src/freenet/l10n/freenet.l10n.en.properties Outdated Show resolved Hide resolved
src/freenet/node/Node.java Outdated Show resolved Hide resolved
src/freenet/node/diagnostics/threads/NodeThreadInfo.java Outdated Show resolved Hide resolved
src/freenet/clients/http/DiagnosticToadlet.java Outdated Show resolved Hide resolved
@Bombe
Copy link
Contributor

Bombe commented Jun 8, 2021

Edit: There's this one maybe: Network Interface Acceptor for ServerSocket[addr=/127.0.0.1, not sure if it's too sensitive.

This could actually be sensitive if people use the most secure proxy-method where the bindTo-Address is 127.x.y.z, not 127.0.0.1 — getting this address would allow remote websites to search for the locally running service much more efficiently. But this could be solvable with simple special-casing for the specific thread.

Shouldn't that more sensibly be handled by making the diagnostics toadlet available to connections with full access only?

Yes, you‘re right. Also if the website manages to connect to the toadlet, it must already have the correct IP.

Good news! According to FProxyToadlet#1245 this already happens so we're good here. :)

@desyncr
Copy link
Contributor Author

desyncr commented Jun 24, 2021

I think in terms of functionality all is done. I'll do some clean up on syntax though.

@desyncr desyncr requested review from Bombe and ArneBab June 24, 2021 20:13
Copy link
Contributor

@Bombe Bombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what? I think this is it. 🙂

Thank you for your efforts!

@ArneBab
Copy link
Contributor

ArneBab commented Jul 2, 2021

Looking at the frequency of name changes shows that the majority of threads do not report the name of the actual runnable. How viable would it be to reduce the collection timeframe? Would there be enough info in 100ms or in 30ms?

@desyncr
Copy link
Contributor Author

desyncr commented Jul 2, 2021

Looking at the frequency of name changes shows that the majority of threads do not report the name of the actual runnable. How viable would it be to reduce the collection timeframe? Would there be enough info in 100ms or in 30ms?

Such slow intervals to be seems like a lot of work. Mostly because I imagine the user only watching at this a couple of seconds to figure out something and then leaving it alone. So a lot of wasted CPU cycles.

Maybe this value can be configured, like turning this functionality on/off is now.

@ArneBab
Copy link
Contributor

ArneBab commented Jul 2, 2021 via email

@Bombe
Copy link
Contributor

Bombe commented Jul 3, 2021

desyncr @.***> writes:

Looking at the frequency of name changes shows that the majority of threads do not report the name of the actual runnable. How viable would it be to reduce the collection timeframe? Would there be enough info in 100ms or in 30ms? Such slow intervals to be seems like a lot of work.
I meant milliseconds. Can we make it so short that only a fraction of the threads switch runnables?

I think you’re still trying to fix a problem that is an edge case. We can of course take another look at it when we see how it performs in the wild but for now I think both the current code and the default interval of 1s is fine.

@ArneBab
Copy link
Contributor

ArneBab commented Jul 5, 2021

Could you check this patch and adjust the measurement period to minimize the switching threads?

diff --git a/src/freenet/node/diagnostics/threads/DefaultThreadDiagnostics.java b/src/freenet/node/diagnostics/threads/DefaultThreadDiagnostics.java
index c58795291..b65019807 100644
--- a/src/freenet/node/diagnostics/threads/DefaultThreadDiagnostics.java
+++ b/src/freenet/node/diagnostics/threads/DefaultThreadDiagnostics.java
@@ -37,6 +37,8 @@ public class DefaultThreadDiagnostics implements Runnable, ThreadDiagnostics {
 
     /** Map to track thread's CPU differences between intervals of time */
     private final Map<Long, Long> threadCpu = new HashMap<>();
+    /** Key = Thread ID; Value = System.currentTimeMillis() at which threadCpu was updated. */
+    private final Map<Long, Long> lastThreadCpuMeasurementTime = new HashMap<>();
 
     /**
      * @param nodeStats Used to retrieve data points
@@ -108,10 +110,12 @@ public class DefaultThreadDiagnostics implements Runnable, ThreadDiagnostics {
      * @return Delta CPU time (nanoseconds)
      */
     private long getCpuTimeDelta(long threadId) {
+        long time = System.currentTimeMillis();
         long current = threadMxBean.getThreadCpuTime(threadId);
 
         long cpuUsage = current - threadCpu.getOrDefault(threadId, current);
         threadCpu.put(threadId, current);
+        lastThreadCpuMeasurementTime.put(threadId, time);
 
         return cpuUsage;
     }
@@ -128,6 +132,9 @@ public class DefaultThreadDiagnostics implements Runnable, ThreadDiagnostics {
 
         threadCpu.keySet()
                 .removeIf(key -> !activeThreads.contains(key));
+        
+        lastThreadCpuMeasurementTime.keySet()
+                .removeIf(key -> !activeThreads.contains(key));
     }
 
     @Override
@@ -137,10 +144,11 @@ public class DefaultThreadDiagnostics implements Runnable, ThreadDiagnostics {
             .filter(thread -> thread.getThreadGroup() != null)
             .map(thread -> new NodeThreadInfo(
                     thread.getId(),
-                    thread.getName(),
+                    getThreadNameIfAccurate(thread),
                     thread.getPriority(),
                     thread.getThreadGroup().getName(),
                     thread.getState().toString(),
+                    // Must be called after getThreadNameIfAccurate() to not destroy its input!
                     getCpuTimeDelta(thread.getId())
                 )
             )
@@ -153,4 +161,30 @@ public class DefaultThreadDiagnostics implements Runnable, ThreadDiagnostics {
         purgeInactiveThreads(threads);
         scheduleNext();
     }
+
+    /**
+     * Must be called before {@link #getCpuTimeDelta(long)} to ensure
+     * {@link #lastThreadCpuMeasurementTime} has not been changed yet for the given thread.
+     */
+    private String getThreadNameIfAccurate(Thread thread) {
+        if(!(thread instanceof PooledExecutor.MyThread))
+            return thread.getName();
+        
+        PooledExecutor.MyThread pooledThread = (PooledExecutor.MyThread)thread;
+        String allegedName = pooledThread.getName();
+        long timeOfLastNameChange = pooledThread.lastDisruptiveNameChangeMillis.get();
+        long timeOfBeginningOfMeasurement
+            = lastThreadCpuMeasurementTime.getOrDefault(thread.getId(), Long.MIN_VALUE);
+        
+        if(timeOfBeginningOfMeasurement > timeOfLastNameChange)
+            return allegedName;
+        else {
+            return "Name cannot be determined: It changed during measurement!";
+        }
+    }
 }
diff --git a/src/freenet/support/PooledExecutor.java b/src/freenet/support/PooledExecutor.java
index 4891136b2..e099cdc96 100644
--- a/src/freenet/support/PooledExecutor.java
+++ b/src/freenet/support/PooledExecutor.java
@@ -9,6 +9,7 @@ import java.util.ArrayList;
 import java.util.concurrent.atomic.AtomicLong;
 
 import freenet.node.PrioRunnable;
+import freenet.node.diagnostics.threads.DefaultThreadDiagnostics;
 import freenet.support.Logger.LogLevel;
 import freenet.support.io.NativeThread;
 
@@ -168,12 +169,18 @@ public class PooledExecutor implements Executor {
 		}
 	}
 
-	private class MyThread extends NativeThread {
+	public class MyThread extends NativeThread {
 		final String defaultName;
 		volatile boolean alive = true;
 		Job nextJob;
 		final long threadNo;
 		private boolean removed = false;
+		
+		/**
+		 * Allows {@link DefaultThreadDiagnostics} to determine if the Runnable this thread executes
+		 * changed while trying to measure CPU usage percentage of the thread.
+		 */
+		public final AtomicLong lastDisruptiveNameChangeMillis = new AtomicLong(0);
 
 		public MyThread(String defaultName, Job firstJob, long threadCounter, int prio, boolean dontCheckRenice) {
 			super(defaultName, prio, dontCheckRenice);
@@ -213,7 +220,14 @@ public class PooledExecutor implements Executor {
 					}
 					synchronized(this) {
 						if(nextJob == null) {
-							this.setName(defaultName);
+							String oldName = this.getName();
+							String newName = defaultName;
+							this.setName(newName);
+							long changeTimeMillis = System.currentTimeMillis();
+							if(!newName.equals(oldName)) {
+								lastDisruptiveNameChangeMillis.set(changeTimeMillis);
+							}
+							
 							try {
 								wait(TIMEOUT);
 							} catch(InterruptedException e) {
@@ -245,7 +259,24 @@ public class PooledExecutor implements Executor {
 
 				// Run the job
 				try {
-					setName(job.name + "(" + threadNo + ")");
+					String oldName = getName();
+					String newName = job.name + "(" + threadNo + ")";
+					setName(newName);
+					long changeTimeMillis = System.currentTimeMillis();
+					if(!newName.equals(oldName)) {
+						if(!defaultName.equals(oldName)) {
+							lastDisruptiveNameChangeMillis.set(changeTimeMillis);
+						}
+					}
+					
 					job.runnable.run();
 				} catch(Throwable t) {
 					Logger.error(this, "Caught " + t + " running job " + job, t);

I got a report with 90% of Threads changing names when checked like this.

@desyncr
Copy link
Contributor Author

desyncr commented Aug 1, 2021

@Bombe @ArneBab I think that by synchronizing with the PoolerExecutor's thread we can avoid it switching jobs during measurement. This way we'll have the correct job's name for the measure. This should only be a penalty in performance if the Diagnostics module is on.

@ArneBab
Copy link
Contributor

ArneBab commented Aug 3, 2021

@Bombe @ArneBab I think that by synchronizing with the PoolerExecutor's thread we can avoid it switching jobs during measurement. This way we'll have the correct job's name for the measure. This should only be a penalty in performance if the Diagnostics module is on.

Does this prevent a name change between two measurements. If I understand it correctly, this gets the threads CPU-time and subtracts the previous CPU-time of the thread — whichever name it had then.

If I do get this right, then the only way to make this exact would be to take a snapshot of the CPU-time for each name in a secondary structure (i.e. hashmap) and change the executor to record a delta everytime a name changes. The overhead of that could be significant, though, that’s why I’d rather opt to keep it simple and reduce the measuring time enough that name changes during measurement are rare. Maybe avoid wrong output by aggregating all the names the thread had during measurement (on every name change also add the new name to a list of names per thread).

@desyncr
Copy link
Contributor Author

desyncr commented Aug 3, 2021

@Bombe @ArneBab I think that by synchronizing with the PoolerExecutor's thread we can avoid it switching jobs during measurement. This way we'll have the correct job's name for the measure. This should only be a penalty in performance if the Diagnostics module is on.

Does this prevent a name change between two measurements. If I understand it correctly, this gets the threads CPU-time and subtracts the previous CPU-time of the thread — whichever name it had then.

If I do get this right, then the only way to make this exact would be to take a snapshot of the CPU-time for each name in a secondary structure (i.e. hashmap) and change the executor to record a delta everytime a name changes. The overhead of that could be significant, though, that’s why I’d rather opt to keep it simple and reduce the measuring time enough that name changes during measurement are rare. Maybe avoid wrong output by aggregating all the names the thread had during measurement (on every name change also add the new name to a list of names per thread).

I see your point. I think if PoolerExecutor implemented some sort of Job id (some pseudo random int), the ThreadDiagnostics could track the CPU per job id, rather than per thread id.

@desyncr
Copy link
Contributor Author

desyncr commented Aug 14, 2021

Applied the logic to keep track of CPU usage per job id (rather than thread id). Synchronizing and keeping a copy of the jobs name at the time of measurement.

@ArneBab ArneBab merged commit bb0abef into hyphanet:next Oct 17, 2021
@desyncr desyncr deleted the thread-diagnostics-cpu branch October 17, 2021 22:12
@desyncr
Copy link
Contributor Author

desyncr commented Oct 17, 2021

Yeah! Glad to see this merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants