Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icestorm: Max Subscriptions at 2048 for single topic? #1261

Closed
wilstoff opened this issue Feb 17, 2021 · 2 comments
Closed

Icestorm: Max Subscriptions at 2048 for single topic? #1261

wilstoff opened this issue Feb 17, 2021 · 2 comments

Comments

@wilstoff
Copy link
Contributor

wilstoff commented Feb 17, 2021

We recently had a segfault on icestorm, and luckily we had trace logs on, but we did lose the core dump (machine didn't have enough disk to store it).

segfault at 4030 ip 00007ffff7184f48 sp 00007fff9a7fac20 error 4 in libstdc++.so.6.0.19[7ffff70c6000+e9000]

After analyzing the trace logs i noticed that the last valid log was adding a subscription to one topic and it had exactly 2048 current subscriptions. This number seemed highly suspect, so i started looking into the configs and code for any setting that might correlate, but was not able to find anything.

We're running 3.6.2 version of the libraries in a cluster of 3 nodes running on centos 7.3. Our icestorm configs are as follows:

# Server configuration
Ice.Admin.ServerId=$NodeName[1/2/3]
Ice.ProgramName=$NodeName[1/2/3]
IceBox.Service.IceStorm=IceStormService,36:createIceStorm --Ice.Config='/etc/zeroc/db3/node/servers/$NodeName[1/2/3]/config/config_IceStorm'
IceBox.LoadOrder=IceStorm
# Server descriptor properties
IceMX.Metrics.Debug.GroupBy=id
IceMX.Metrics.Debug.Disabled=1
IceMX.Metrics.ByParent.GroupBy=parent
IceMX.Metrics.ByParent.Disabled=1
Ice.MessageSizeMax=2048000
Ice.Admin.Endpoints=tcp -p 10003
Ice.Admin.InstanceName=server
IceMX.Metrics.icebox.GroupBy=id
IceMX.Metrics.icebox.Disabled=1
Ice.StdErr=/var/log/IceGrid/$NodeName[1/2/3].err
Ice.StdOut=/var/log/IceGrid/$NodeName[1/2/3].out
Ice.Default.Locator=$IceGridName/Locator:default -h $Node1IP -p $Node1NormPort:ws -h $Node1IP -p $Node1WSPort:default -h $Node2IP -p $Node2NormPort:ws -h $Node2IP -p $Node2WSPort:default -h $Node3IP -p $Node3NormPort:ws -h $Node3IP -p $Node3WSPort

We're going to try updating the Ice.MessageSizeMax in case there is some internal messaging between the nodes that is also using this config, but we're not sure this is actually the cause. We've also had other segfaults more recently that have not recently had 2048 subscribers at the time. These might be attributed to this pull request, but that is much of a shot in the dark: #1259, #1260

Any help pointing towards where this issue might lie, would be greatly appreciated. We're having multiple segfaults about once every 3 weeks or so, some auto recover (we try and restart the service if it goes down) most require a full restart.

@bentoi
Copy link
Member

bentoi commented Feb 18, 2021

The segfault could indeed be caused by the lack of synchronization for the tracing, thanks for finding this one! The best way to confirm this would be to get a core dump of the crash and see where it occurs.

It would be good to figure out why you have such a high number of subscribers. Are you making sure to unsubscribe subscribers when they terminate gracefully or no longer need the subscription?

For subscribers that don't unsubscribe (possibly because the subscriber crashed), they should eventually be automatically reaped if they are no longer reachable. The subscriber RetryCount QoS controls when these are reaped, see https://doc.zeroc.com/ice/3.7/ice-services/icestorm/icestorm-quality-of-service#id-.IceStormQualityofServicev3.7-RetryCountQoSforIceStorm

Are you perhaps setting retryCount to -1?

You could enable subscriber tracing to trace when subscribers are added/removed and when the retries occur (with the .Trace.Subscriber property, see https://doc.zeroc.com/ice/3.7/property-reference/icestorm-properties#id-.IceStormPropertiesv3.7-service.Trace.Subscriber)

Another option would be to enable metrics and check the number of subscribers with the IceGrid GUI using the metrics functionality (see https://doc.zeroc.com/ice/3.7/ice-services/icegrid/icegrid-gui-tool)

@wilstoff
Copy link
Contributor Author

We do see regular purges of dropped subscribers, so I believe our system is correctly working. The topic in question is our most subscribed to topic and every single app and service that runs in our system will have 1 subscription (singleton based), so it isn't surprising that this topic would have 2048 or more subscriptions at any one time in the regular course of a day. We will try and setup our servers so we can retrieve these core dumps and then hopefully provide more information.

@externl externl closed this as completed Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants