Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] query the dead-letter queue information in Pulsar Manager, the broker's GC grows rapidly #23805

Open
2 of 3 tasks
g0715158 opened this issue Jan 3, 2025 · 9 comments
Open
2 of 3 tasks
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@g0715158
Copy link

g0715158 commented Jan 3, 2025

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

3.0.1

Minimal reproduce step

When I query the dead-letter queue information in Pulsar Manager, the broker's GC grows rapidly. The same happens when I use admin commands.

What did you expect to see?

The broker GC pause does not show significant fluctuations.

What did you see instead?

When I query the dead-letter queue information in Pulsar Manager, the broker's GC grows rapidly. The same happens when I use admin commands.
570a3c88ab71c9424aacb35f112aea9

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@g0715158 g0715158 added the type/bug The PR fixed a bug or issue reported a bug label Jan 3, 2025
@g0715158 g0715158 changed the title [Bug] [Bug] query the dead-letter queue information in Pulsar Manager, the broker's GC grows rapidly Jan 3, 2025
@lhotari
Copy link
Member

lhotari commented Jan 6, 2025

Minimal reproduce step

When I query the dead-letter queue information in Pulsar Manager, the broker's GC grows rapidly. The same happens when I use admin commands.

@g0715158 Please share the admin commands to reproduce the issue.

@lhotari
Copy link
Member

lhotari commented Jan 6, 2025

3.0.1

@g0715158 Please attempt to reproduce on latest 3.0.x release, 3.0.8 . Can you reproduce on 3.0.8 ?

@g0715158
Copy link
Author

g0715158 commented Jan 7, 2025

Minimal reproduce step

When I query the dead-letter queue information in Pulsar Manager, the broker's GC grows rapidly. The same happens when I use admin commands.

@g0715158 Please share the admin commands to reproduce the issue.

@lhotari
The following API is causing a rapid increase in GC (Garbage Collection) time.
/admin/v2/persistent/pulsar/iagw/housekeeping_session-DLQ-partition-16/stats?getPreciseBacklog=false&authoritative=false&subscriptionBacklogSize=true&getEarliestTimeInBacklog=false

This phenomenon of GC time increase is particularly noticeable:
1、Not all DLQ topics lead to a significant increase in GC time. Currently, out of all the topics in the cluster, only 2 DLQ topics are experiencing this issue.
2、Through the logs, it was found that when there is a significant increase in GC time, the API being called is handled by the ForkJoinPool.commonPool-worker thread.
Partial log as follows:
2025-01-03T09:54:12,851+0800 [BookKeeperClientWorker-OrderedExecutor-43-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/ims/persistent/17_rate_out-partition-13] Created new ledger 10720342
2025-01-03T09:54:13,911+0800 [bookkeeper-ml-scheduler-OrderedScheduler-104-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/ims/persistent/17_rate_out-partition-13] End TrimConsumedLedgers. ledgers=1 totalSize=194152
2025-01-03T09:54:13,911+0800 [bookkeeper-ml-scheduler-OrderedScheduler-104-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/ims/persistent/17_rate_out-partition-13] Removing ledger 10718720 - size: 58916963
2025-01-03T09:54:14,715+0800 [pulsar-io-4-4] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -9, ackedMessages : -3, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/gprs/11_rate_out_b-partition-5, name=housekeeping_session}, consumerId=1025, consumerName=6bfaa, address=/10.33.234.38:27173}
2025-01-03T09:54:14,799+0800 [pulsar-web-37-85] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/non-persistent/pulsar/autoftp/0x2eeeeeec_0x33333330?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:14,829+0800 [pulsar-web-37-5] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/non-persistent/pulsar/iagw/0x91111108_0x9555554c?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 3
2025-01-03T09:54:14,830+0800 [pulsar-web-37-70] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/non-persistent/pulsar/iagw/0xd999998c_0xddddddd0?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:14,830+0800 [pulsar-web-37-140] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/non-persistent/pulsar/iagw/0x9dddddd4_0xa2222218?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:14,839+0800 [pulsar-web-37-9] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/non-persistent/pulsar/fx/0x7ffffff8_0x8444443c?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:14,839+0800 [pulsar-web-37-19] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/non-persistent/pulsar/fx/0xe2222214_0xe6666658?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:14,840+0800 [pulsar-web-37-140] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/non-persistent/pulsar/fx/0xb777776c_0xbbbbbbb0?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:14,895+0800 [ForkJoinPool.commonPool-worker-2267] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/persistent/pulsar/iagw/housekeeping_session-DLQ-partition-16/stats?getPreciseBacklog=false&authoritative=false&subscriptionBacklogSize=true&getEarliestTimeInBacklog=false HTTP/1.1" 200 8071538 "-" "Pulsar-Java-v3.0.1" 100
2025-01-03T09:54:14,895+0800 [ForkJoinPool.commonPool-worker-2250] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:14 +0800] "GET /admin/v2/persistent/pulsar/iagw/housekeeping_session-DLQ-partition-12/stats?getPreciseBacklog=false&authoritative=false&subscriptionBacklogSize=true&getEarliestTimeInBacklog=false HTTP/1.1" 200 8071520 "-" "Pulsar-Java-v3.0.1" 100
2025-01-03T09:54:15,188+0800 [pulsar-web-37-141] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:15 +0800] "GET /admin/v2/non-persistent/pulsar/gprs/0x88888880_0x8cccccc4?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:15,191+0800 [pulsar-web-37-140] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:15 +0800] "GET /admin/v2/non-persistent/pulsar/gprs/0xaaaaaaa0_0xaeeeeee4?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:15,191+0800 [pulsar-web-37-236] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:15 +0800] "GET /admin/v2/non-persistent/pulsar/gprs/0xeaaaaa9c_0xeeeeeee0?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:17,132+0800 [pulsar-io-4-33] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -1, ackedMessages : -1, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/gprs/21_rate_out_b-partition-4, name=housekeeping_session}, consumerId=664, consumerName=9b30a, address=/10.33.234.138:16505}
2025-01-03T09:54:18,798+0800 [pulsar-io-4-95] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -1, ackedMessages : -8, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/ims/20_rate_in-partition-4, name=controlNode-20}, consumerId=444, consumerName=ab1f8, address=/10.33.235.42:22679}
2025-01-03T09:54:19,295+0800 [pulsar-io-4-100] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -1, ackedMessages : -6, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/ims/20_rate_in-partition-4, name=controlNode-20}, consumerId=524, consumerName=68e10, address=/10.33.235.55:26249}
2025-01-03T09:54:19,900+0800 [pulsar-io-4-100] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -1, ackedMessages : -73, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/ims/20_rate_in-partition-4, name=controlNode-20}, consumerId=504, consumerName=a6fc2, address=/10.33.235.55:26249}
2025-01-03T09:54:20,436+0800 [pulsar-io-4-18] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -1, ackedMessages : -1, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/gprs/21_rate_out_b-partition-4, name=housekeeping_session}, consumerId=1004, consumerName=b5cae, address=/10.33.234.51:62346}
2025-01-03T09:54:20,909+0800 [pulsar-io-4-100] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -1, ackedMessages : -100, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/ims/20_rate_in-partition-4, name=controlNode-20}, consumerId=584, consumerName=2b966, address=/10.33.235.55:26249}
2025-01-03T09:54:21,022+0800 [pulsar-io-4-95] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -1, ackedMessages : -100, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/ims/20_rate_in-partition-4, name=controlNode-20}, consumerId=484, consumerName=c0491, address=/10.33.235.42:22679}
2025-01-03T09:54:21,901+0800 [pulsar-io-4-42] WARN org.apache.pulsar.broker.service.Consumer - unackedMsgs is : -4, ackedMessages : -1, consumer : Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/gprs/21_rate_out_b-partition-4, name=housekeeping_session}, consumerId=784, consumerName=36628, address=/10.33.234.158:24326}
2025-01-03T09:54:23,749+0800 [pulsar-web-37-19] INFO org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [null][persistent://pulsar/iagw/housekeeping_session-DLQ-partition-12] Creating subscription test at message id 9223372036854775807:9223372036854775807:-1 with properties null
2025-01-03T09:54:23,749+0800 [pulsar-web-37-141] INFO org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [null][persistent://pulsar/iagw/housekeeping_session-DLQ-partition-16] Creating subscription test at message id 9223372036854775807:9223372036854775807:-1 with properties null
2025-01-03T09:54:23,752+0800 [pulsar-web-37-141] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:23 +0800] "PUT /admin/v2/persistent/pulsar/iagw/housekeeping_session-DLQ-partition-16/subscription/test?authoritative=false&replicated=false HTTP/1.1" 409 50 "-" "Pulsar-Java-v3.0.1" 3
2025-01-03T09:54:23,752+0800 [pulsar-web-37-19] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:23 +0800] "PUT /admin/v2/persistent/pulsar/iagw/housekeeping_session-DLQ-partition-12/subscription/test?authoritative=false&replicated=false HTTP/1.1" 409 50 "-" "Pulsar-Java-v3.0.1" 4
2025-01-03T09:54:27,991+0800 [pulsar-web-37-140] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:27 +0800] "GET /admin/v2/non-persistent/pulsar/autoftp/0x2eeeeeec_0x33333330?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:27,996+0800 [pulsar-web-37-19] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:27 +0800] "GET /admin/v2/non-persistent/pulsar/iagw/0x91111108_0x9555554c?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:27,999+0800 [pulsar-web-37-140] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:27 +0800] "GET /admin/v2/non-persistent/pulsar/iagw/0xd999998c_0xddddddd0?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:28,004+0800 [pulsar-web-37-19] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/non-persistent/pulsar/iagw/0x9dddddd4_0xa2222218?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:28,008+0800 [pulsar-web-37-9] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/non-persistent/pulsar/fx/0xe2222214_0xe6666658?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:28,010+0800 [pulsar-web-37-70] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/non-persistent/pulsar/fx/0x7ffffff8_0x8444443c?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:28,010+0800 [pulsar-web-37-140] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/non-persistent/pulsar/fx/0xb777776c_0xbbbbbbb0?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:28,103+0800 [ForkJoinPool.commonPool-worker-2267] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/persistent/pulsar/iagw/housekeeping_session-DLQ-partition-12/stats?getPreciseBacklog=false&authoritative=false&subscriptionBacklogSize=true&getEarliestTimeInBacklog=false HTTP/1.1" 200 8071520 "-" "Pulsar-Java-v3.0.1" 99
2025-01-03T09:54:28,110+0800 [ForkJoinPool.commonPool-worker-2250] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/persistent/pulsar/iagw/housekeeping_session-DLQ-partition-16/stats?getPreciseBacklog=false&authoritative=false&subscriptionBacklogSize=true&getEarliestTimeInBacklog=false HTTP/1.1" 200 8071538 "-" "Pulsar-Java-v3.0.1" 105
2025-01-03T09:54:28,297+0800 [pulsar-web-37-236] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/non-persistent/pulsar/gprs/0x88888880_0x8cccccc4?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 2
2025-01-03T09:54:28,300+0800 [pulsar-web-37-140] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/non-persistent/pulsar/gprs/0xaaaaaaa0_0xaeeeeee4?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1
2025-01-03T09:54:28,301+0800 [pulsar-web-37-19] INFO org.eclipse.jetty.server.RequestLog - xx.xx.xx.xx - - [03/Jan/2025:09:54:28 +0800] "GET /admin/v2/non-persistent/pulsar/gprs/0xeaaaaa9c_0xeeeeeee0?authoritative=false HTTP/1.1" 200 2 "-" "Pulsar-Java-v3.0.1" 1

@g0715158
Copy link
Author

g0715158 commented Jan 7, 2025

3.0.1

@g0715158 Please attempt to reproduce on latest 3.0.x release, 3.0.8 . Can you reproduce on 3.0.8 ?

@lhotari
This phenomenon did not recur in my local environment on version 3.0.1, nor did it occur in my local environment on version 3.0.7. I believe this is because the data volume in the local environment is relatively small, and the production environment handles 200TB of data per day, which cannot be simulated with the smaller data volume in the local environment.

@lhotari
Copy link
Member

lhotari commented Jan 7, 2025

3.0.1

@g0715158 Please attempt to reproduce on latest 3.0.x release, 3.0.8 . Can you reproduce on 3.0.8 ?

@lhotari This phenomenon did not recur in my local environment on version 3.0.1, nor did it occur in my local environment on version 3.0.7. I believe this is because the data volume in the local environment is relatively small, and the production environment handles 200TB of data per day, which cannot be simulated with the smaller data volume in the local environment.

@g0715158 There are hundreds of improvements between 3.0.1 and 3.0.8, including critical security fixes. You might run into many of the bugs that have already been fixed if you keep running on old versions.

@g0715158
Copy link
Author

g0715158 commented Jan 7, 2025

There are hundreds of improvements between 3.0.1 and 3.0.8, including critical security fixes. You might run into many of the bugs that have already been fixed if you keep running on old versions.

@lhotari I checked the release logs between versions 3.0.1 and 3.0.8 and couldn't find any repair instructions for this issue

@lhotari
Copy link
Member

lhotari commented Jan 7, 2025

@lhotari I checked the release logs between versions 3.0.1 and 3.0.8 and couldn't find any repair instructions for this issue

@g0715158 Due to the nature of an issue such as GC pressure, it's very unlikely that an improvement would be explicitly listed as a "repair instruction" in the changes. When there are hundreds of improvements, it's impossible to state what the behavior is with the latest releases such as Pulsar 3.0.8 or Pulsar 4.0.1 unless you really do the work and test it.
There are also critical security issues that have been fixed, so ignoring upgrades is not a very useful strategy for running a Pulsar cluster. In the Apache Pulsar opensource project, the expectation is that bug reporters would do they part of the work and at least test with the latest releases. I hope you could consider that as the highest priority if you'd like to see improvements in the issue that you have reported.

@lhotari
Copy link
Member

lhotari commented Jan 7, 2025

@g0715158 If you'd like to find out the reason for the memory pressure, profiling tools such as AsyncProfiler and Java Flight Recorder are useful since they can be used to find the source of memory allocations. I have some personal tools that I have shared in https://github.com/lhotari/k8s-diagnostics-toolbox to do profiling in k8s clusters. There are many ways to use AsyncProfiler and Java Flight Recorder, that provides one way.

@g0715158
Copy link
Author

g0715158 commented Jan 7, 2025

3.0.1

@g0715158 Please attempt to reproduce on latest 3.0.x release, 3.0.8 . Can you reproduce on 3.0.8 ?

@lhotari This phenomenon did not recur in my local environment on version 3.0.1, nor did it occur in my local environment on version 3.0.7. I believe this is because the data volume in the local environment is relatively small, and the production environment handles 200TB of data per day, which cannot be simulated with the smaller data volume in the local environment.

@g0715158 There are hundreds of improvements between 3.0.1 and 3.0.8, including critical security fixes. You might run into many of the bugs that have already been fixed if you keep running on old versions.

@lhotari I understand. Thank you for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

2 participants