Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads get stuck in confluence #951

Closed
Marty opened this issue Nov 23, 2020 · 8 comments
Closed

Threads get stuck in confluence #951

Marty opened this issue Nov 23, 2020 · 8 comments
Assignees

Comments

@Marty
Copy link

Marty commented Nov 23, 2020

We have a confluence plugin that runs a job that calls a REST endpoint of confluence once for every user sequentially.

When ocelot is attached, only the first ~100 requests are successful and the remaining couple of hundreds run on a timeout and get a 502 error.

2020-11-23 15:58:53,789 ERROR [Caesium-1-1] [xyz.profile.Display] generate user: {}
java.io.IOException:
        at tld.company.xyz.profile.ConfluenceContent.updatePageContent(ConfluenceContent.java:69)
        at tld.company.xyz.profile.Display.generate(Display.java:51)
        at tld.company.xyz.data.RebuildMasterProfiles.rebuild(RebuildMasterProfiles.java:19)
        at tld.company.xyz.job.RebuildMasterProfilesRunner.runJob(RebuildMasterProfilesRunner.java:46)
        at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.doRunJob(JobRunnerWrapper.java:117)
        at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.lambda$runJob$0(JobRunnerWrapper.java:87)
        at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContextInternal(VCacheRequestContextManager.java:84)
        at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContext(VCacheRequestContextManager.java:68)
        at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.runJob(JobRunnerWrapper.java:87)
        at com.atlassian.scheduler.core.JobLauncher.runJob(JobLauncher.java:134)
        at com.atlassian.scheduler.core.JobLauncher.launchAndBuildResponse(JobLauncher.java:106)
        at com.atlassian.scheduler.core.JobLauncher.launch(JobLauncher.java:90)
        at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.launchJob(CaesiumSchedulerService.java:435)
        at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeLocalJob(CaesiumSchedulerService.java:402)
        at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:380)
        at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66)
        at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: https://confluence-mirror.tld/rest/api/content/41225737
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
        at tld.company.xyz.profile.ContentConnection.getResponse(ContentConnection.java:57)
        at tld.company.xyz.profile.ContentConnection.put(ContentConnection.java:39)
        at tld.company.xyz.profile.ConfluenceContent.updatePageContent(ConfluenceContent.java:66)
        ... 18 more

At the same time I get the following error in catalina.out.

23-Nov-2020 15:57:58.321 WARNING [Catalina-utility-3] org.apache.catalina.valves.StuckThreadDetectionValve.notifyStuckThreadDetected Thread [http-nio-127.0.0.1-8090-exec-61] (id=[720]) has been active
 for [65,362] milliseconds (since [11/23/20 3:56 PM]) to serve the same request for [https://confluence-mirror.tld/rest/api/content/41225737] and may be stuck (configured threshold for this StuckT
hreadDetectionValve is [60] seconds). There is/are [58] thread(s) in total that are monitored by this Valve and may be stuck.
        java.lang.Throwable
                at sun.misc.Unsafe.park(Native Method)
                at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
                at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:393)
                at org.apache.http.pool.AbstractConnPool.access$300(AbstractConnPool.java:70)

After some time all http threads are stuck and the instance is unresponsive.
If I disable ocelot and restart confluence, the job runs successfully for all users repeatedly.

Do you have any ideas how I can debug this or what's causing this?
I already disabled a few scopes (see #948).

@MariusBrill
Copy link
Member

MariusBrill commented Nov 24, 2020

Hey @Marty,
i suspect that the reason why it always gets stuck at around 100 requests is because the instumentation of the agent always takes some time.
So the first ~100 requests maybe done while inspectIT is not yet instrumenting your Server.

For the Problem itself:
We also had issues with the Special Sensor in Jira and Confluence. You can give it a try to also disable them. To do so, just add

special:
    thread-start-context-propagation: false
    executor-context-propagation: false
    scheduled-executor-context-propagation: false
    class-loader-delegation: false

to the path inspectit.instrumentation in the yaml file you have created to solve the last issue.
Be aware, that this may again restrict functionality of the agent. If that does the trick i advise you to try to reenable the special sensors one at a time to find out which one causes the issues.

Cheers
Marius

@MariusBrill MariusBrill self-assigned this Nov 24, 2020
@Marty
Copy link
Author

Marty commented Nov 24, 2020

This config results in a bunch of errors and the instance not being reachable due to a content encoding problem.

catalina.out.log
ocelot-config.yml.txt

@Marty
Copy link
Author

Marty commented Nov 24, 2020

As a reference of a successful startup without ocelot attached.
catalina.out.ok.log

The parameters I use in setenv.sh to attach ocelot:

CATALINA_OPTS="-javaagent:/export/home/confluence/monitoring/inspectit-ocelot-agent-1.6.1.jar ${CATALINA_OPTS}"
CATALINA_OPTS="-Dinspectit.config.file-based.path=/export/home/confluence/monitoring/config ${CATALINA_OPTS}"

@MariusBrill
Copy link
Member

MariusBrill commented Nov 24, 2020

Sorry for the late reply. I have taken a look into that and found a few rules that may cause problems when the special sensors are turned off.

If you add this config, jira should start normally again :

   r_jdbc_prepared_sql_stop_propagation:
     scopes:
             's_jdbc_preparedstatement_execute': false
             's_jdbc_preparedstatement_executeBatch': false

   r_jdbc_servicegraph_record:
     scopes:
           's_jdbc_preparedstatement_execute': false
           's_jdbc_preparedstatement_executeBatch': false
           's_jdbc_statement_execute': false
           's_jdbc_statement_executeBatch': false

   r_jdbc_tracing_preparedstatement:
     scopes:
           's_jdbc_preparedstatement_execute': false
           's_jdbc_preparedstatement_executeBatch': false

You can then check if the Threads still get stuck.

@Marty
Copy link
Author

Marty commented Nov 30, 2020

Hi @MariusBrill,
I put them under rules is that correct?
Also I added quotes around the names to match the format of the other rules in that section.

But now I receive all sorts of errors like this:

Error creating generic action 'a_entrypoint_check' in context of class net.java.ao.Par
ameterMetadataCachingPreparedStatement! Using a No-Operation action instead!

On the upside, the instance is reachable again :)

@Marty
Copy link
Author

Marty commented Nov 30, 2020

... and I don't get timeouts and stuck threads anymore, thanks!

@MariusBrill
Copy link
Member

Hey @Marty,
great to heat that it fixed your issue! =)
Yes it was correct to put them under rules.

The errors you now see come from functions of the default configuration that may be broken now that these special sensors are turned off. However, the agent turns them off itself. This is what is meant by Using a No-Operation action instead!.

If you don't want these messages to be displayed, you need to find the rules implementing these actions (e.g. here search for a rule containing "a_entrypoint_check") and turn them off or overwrite them to not use the method.

Cheers
Marius

@mariusoe
Copy link
Member

Problem seems to be solved. Feel free to open a new ticket if there are further problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants