Threads get stuck in confluence #951

Marty · 2020-11-23T15:29:11Z

We have a confluence plugin that runs a job that calls a REST endpoint of confluence once for every user sequentially.

When ocelot is attached, only the first ~100 requests are successful and the remaining couple of hundreds run on a timeout and get a 502 error.

2020-11-23 15:58:53,789 ERROR [Caesium-1-1] [xyz.profile.Display] generate user: {}
java.io.IOException:
        at tld.company.xyz.profile.ConfluenceContent.updatePageContent(ConfluenceContent.java:69)
        at tld.company.xyz.profile.Display.generate(Display.java:51)
        at tld.company.xyz.data.RebuildMasterProfiles.rebuild(RebuildMasterProfiles.java:19)
        at tld.company.xyz.job.RebuildMasterProfilesRunner.runJob(RebuildMasterProfilesRunner.java:46)
        at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.doRunJob(JobRunnerWrapper.java:117)
        at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.lambda$runJob$0(JobRunnerWrapper.java:87)
        at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContextInternal(VCacheRequestContextManager.java:84)
        at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContext(VCacheRequestContextManager.java:68)
        at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.runJob(JobRunnerWrapper.java:87)
        at com.atlassian.scheduler.core.JobLauncher.runJob(JobLauncher.java:134)
        at com.atlassian.scheduler.core.JobLauncher.launchAndBuildResponse(JobLauncher.java:106)
        at com.atlassian.scheduler.core.JobLauncher.launch(JobLauncher.java:90)
        at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.launchJob(CaesiumSchedulerService.java:435)
        at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeLocalJob(CaesiumSchedulerService.java:402)
        at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:380)
        at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66)
        at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: https://confluence-mirror.tld/rest/api/content/41225737
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
        at tld.company.xyz.profile.ContentConnection.getResponse(ContentConnection.java:57)
        at tld.company.xyz.profile.ContentConnection.put(ContentConnection.java:39)
        at tld.company.xyz.profile.ConfluenceContent.updatePageContent(ConfluenceContent.java:66)
        ... 18 more

At the same time I get the following error in catalina.out.

23-Nov-2020 15:57:58.321 WARNING [Catalina-utility-3] org.apache.catalina.valves.StuckThreadDetectionValve.notifyStuckThreadDetected Thread [http-nio-127.0.0.1-8090-exec-61] (id=[720]) has been active
 for [65,362] milliseconds (since [11/23/20 3:56 PM]) to serve the same request for [https://confluence-mirror.tld/rest/api/content/41225737] and may be stuck (configured threshold for this StuckT
hreadDetectionValve is [60] seconds). There is/are [58] thread(s) in total that are monitored by this Valve and may be stuck.
        java.lang.Throwable
                at sun.misc.Unsafe.park(Native Method)
                at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
                at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:393)
                at org.apache.http.pool.AbstractConnPool.access$300(AbstractConnPool.java:70)

After some time all http threads are stuck and the instance is unresponsive.
If I disable ocelot and restart confluence, the job runs successfully for all users repeatedly.

Do you have any ideas how I can debug this or what's causing this?
I already disabled a few scopes (see #948).

The text was updated successfully, but these errors were encountered:

MariusBrill · 2020-11-24T08:04:25Z

Hey @Marty,
i suspect that the reason why it always gets stuck at around 100 requests is because the instumentation of the agent always takes some time.
So the first ~100 requests maybe done while inspectIT is not yet instrumenting your Server.

For the Problem itself:
We also had issues with the Special Sensor in Jira and Confluence. You can give it a try to also disable them. To do so, just add

special:
    thread-start-context-propagation: false
    executor-context-propagation: false
    scheduled-executor-context-propagation: false
    class-loader-delegation: false

to the path inspectit.instrumentation in the yaml file you have created to solve the last issue.
Be aware, that this may again restrict functionality of the agent. If that does the trick i advise you to try to reenable the special sensors one at a time to find out which one causes the issues.

Cheers
Marius

Marty · 2020-11-24T09:19:09Z

This config results in a bunch of errors and the instance not being reachable due to a content encoding problem.

catalina.out.log
ocelot-config.yml.txt

Marty · 2020-11-24T09:27:51Z

As a reference of a successful startup without ocelot attached.
catalina.out.ok.log

The parameters I use in setenv.sh to attach ocelot:

CATALINA_OPTS="-javaagent:/export/home/confluence/monitoring/inspectit-ocelot-agent-1.6.1.jar ${CATALINA_OPTS}"
CATALINA_OPTS="-Dinspectit.config.file-based.path=/export/home/confluence/monitoring/config ${CATALINA_OPTS}"

MariusBrill · 2020-11-24T15:09:19Z

Sorry for the late reply. I have taken a look into that and found a few rules that may cause problems when the special sensors are turned off.

If you add this config, jira should start normally again :

   r_jdbc_prepared_sql_stop_propagation:
     scopes:
             's_jdbc_preparedstatement_execute': false
             's_jdbc_preparedstatement_executeBatch': false

   r_jdbc_servicegraph_record:
     scopes:
           's_jdbc_preparedstatement_execute': false
           's_jdbc_preparedstatement_executeBatch': false
           's_jdbc_statement_execute': false
           's_jdbc_statement_executeBatch': false

   r_jdbc_tracing_preparedstatement:
     scopes:
           's_jdbc_preparedstatement_execute': false
           's_jdbc_preparedstatement_executeBatch': false

You can then check if the Threads still get stuck.

Marty · 2020-11-30T13:53:48Z

Hi @MariusBrill,
I put them under rules is that correct?
Also I added quotes around the names to match the format of the other rules in that section.

But now I receive all sorts of errors like this:

Error creating generic action 'a_entrypoint_check' in context of class net.java.ao.Par
ameterMetadataCachingPreparedStatement! Using a No-Operation action instead!

On the upside, the instance is reachable again :)

Marty · 2020-11-30T15:18:43Z

... and I don't get timeouts and stuck threads anymore, thanks!

MariusBrill · 2020-12-01T10:06:25Z

Hey @Marty,
great to heat that it fixed your issue! =)
Yes it was correct to put them under rules.

The errors you now see come from functions of the default configuration that may be broken now that these special sensors are turned off. However, the agent turns them off itself. This is what is meant by Using a No-Operation action instead!.

If you don't want these messages to be displayed, you need to find the rules implementing these actions (e.g. here search for a rule containing "a_entrypoint_check") and turn them off or overwrite them to not use the method.

Cheers
Marius

mariusoe · 2021-01-18T08:03:46Z

Problem seems to be solved. Feel free to open a new ticket if there are further problems.

MariusBrill self-assigned this Nov 24, 2020

mariusoe closed this as completed Jan 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threads get stuck in confluence #951

Threads get stuck in confluence #951

Marty commented Nov 23, 2020

MariusBrill commented Nov 24, 2020 •

edited

Loading

Marty commented Nov 24, 2020 •

edited

Loading

Marty commented Nov 24, 2020

MariusBrill commented Nov 24, 2020 •

edited

Loading

Marty commented Nov 30, 2020 •

edited

Loading

Marty commented Nov 30, 2020

MariusBrill commented Dec 1, 2020

mariusoe commented Jan 18, 2021

Threads get stuck in confluence #951

Threads get stuck in confluence #951

Comments

Marty commented Nov 23, 2020

MariusBrill commented Nov 24, 2020 • edited Loading

Marty commented Nov 24, 2020 • edited Loading

Marty commented Nov 24, 2020

MariusBrill commented Nov 24, 2020 • edited Loading

Marty commented Nov 30, 2020 • edited Loading

Marty commented Nov 30, 2020

MariusBrill commented Dec 1, 2020

mariusoe commented Jan 18, 2021

MariusBrill commented Nov 24, 2020 •

edited

Loading

Marty commented Nov 24, 2020 •

edited

Loading

MariusBrill commented Nov 24, 2020 •

edited

Loading

Marty commented Nov 30, 2020 •

edited

Loading