ModSecurity: collections_remove_stale: Failed deleting collection #576

rcbarnett-zz · 2013-10-17T20:24:21Z

MODSEC-428: When server is under load of 1500 concurrent users Modsecurity starts to show the below error in apache's error log and the server CPU usage goes up to dangerous levels.

[Wed Oct 02 21:21:52 2013] [error] [client 109.127.81.10] ModSecurity: collections_remove_stale: Failed deleting collection (name "ip", key "109.127.81.138_9e4f93d096e2ca3744251c41dde47a1a7b26fa75"): Internal error [hostname "www.regayzanko.com"] [uri "/php5-fcgi/setting/student_collage_list.php"] [unique_id "UkxkMMaIMEoAADg9zw0AAAgr"]

I can see that this issue was patched see: https://www.modsecurity.org/tracker/browse/MODSEC-97 but i am using the latest modsecurity version "2.7.5" so i do not know why it is happening.

rcbarnett-zz · 2013-10-17T20:24:21Z

Original reporter: zertux

rcbarnett-zz · 2013-10-17T20:24:21Z

zertux: Can this process be disabled ? or is it a necessity for mod_Security's operation ?

kurktchiev · 2013-10-29T14:15:39Z

I am having the exact same issue here on 2.7.5

wneessen · 2013-11-18T11:06:02Z

Seeing similar behaviour on FreeBSD 8.4, Apache/2.2.25, mod_sec 2.7.5.
Machine is running fine for a day or two (even just in DetectionOnly mode). Then all of a sudden, the CPU load increases extremely and Apache-Server status shows lots of "L"s. At some point all slots are exhaused and Apache will be unresponsive. Can be fixed by restarting apache, but it will happen again after some time.

mod_sec debug.log shows a couple of these:
[15/Nov/2013:21:10:30 +0000] [www.somehost.com/sid#805894478][rid#809e3a0a0][/some/uri/excel_2010_advanced_popup.jpg][1] collection_retrieve_ex: Failed deleting collection (name "ip", key "123.123.123.123_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[15/Nov/2013:21:13:10 +0000] [www.somehost.com/sid#80178fdb0][rid#80924e0a0][/some/uri/cookie.htm][1] collection_retrieve_ex: Failed deleting collection (name "ip", key "122.122.122.122_81548c7486a53bd388be273db20199283dfac871"): Internal error

wneessen · 2014-01-06T13:48:28Z

Hi, is there any update on this issue? Let me know if I can provide any further information.

marcstern · 2014-01-16T15:28:52Z

Same problem with 2.7.6 :-(
Can we expect a fix

klydel · 2014-01-29T21:50:56Z

same issue. is there a particular way to disable this behavior altogether?

ekenberg · 2014-02-10T12:47:30Z

Same issue 2.7.7, any news?

oparoz · 2014-03-14T13:40:35Z

Soon after that error message, we get a few "Failed reading DBM file" and Apache is unreachable until the process is killed.

oparoz · 2014-03-21T15:38:14Z

We're using ZFS on FreeBSD. How about the other people affected by this bug?

marcstern · 2014-03-21T15:39:15Z

Same problem with Redhat, CentOS & Windows

oparoz · 2014-03-21T15:41:48Z

Thanks!

lifeforms · 2014-07-01T15:56:00Z

I think we are having the same issue on ModSecurity 2.7.7 on FreeBSD 9.2 UFS with a reasonably busy server (but not SUPER busy, e.g. 100 httpd clients but lots of heavy database work).

A few times a week, "something" happens that makes all Apache processes seem to spin on disk activity. Disk I/O by Apache shoots up to crazy levels (100MB/s) with CPU time spent mostly in system. In Apache status, children get stuck in "Logging" (L) state. It stays that way for 10-20 minutes, then calms down.

I narrowed it down to the user of collections. The problem completely goes away when I disable an anti-bruteforce rule which stores a block flag/timer in a collection. When I disable this rule, the problem is immediately gone. When I kill Apache and re-enable the rule, the problem is back.

When looking in modsec_audit.log, I see all kinds of "Internal errors" such as the following:

Message: collections_remove_stale: Failed deleting collection (name "ip", key "198.167.x.x_x"): Internal error

I now think that this condition probably happens when some clients hit the collection concurrently. The issue has become more problematic since switching to a twin-site SAN (I imagine that it takes longer before finishing a sync write because of the speed of light), so it might depend on certain I/O latency to happen.

The fact that the issue always disappears by itself after a number of minutes (but NOT by restarting Apache!), leads me to think that probably it ends by some collection item being deprecated. I have no reliable way to reproduce the problem unfortunately, besides using collections in production and waiting until the site is crippled.

antoxxa · 2015-01-14T15:59:35Z

stil no answer ?

lifeforms · 2015-01-14T16:07:57Z

I don't have this problem myself anymore since we're now not using persistent collections on that server anymore.

However, I found that the server was using /tmp to store the data files. This directory happened to be huge (75000 other entries). My suspicion is that it might take too long to do a database lookup or store, and this sends the code into sad country. It was probably made worse due to our SAN with (relatively) high latency storage.

Could other people experiencing this issue try the following:

Check where your data files are stored (SecDataDir) and see if this might be slow due to an overfull directory? Maybe cleaning up the tmp directory helps already.
Try setting SecDataDir to its own dedicated directory.
Try setting SecDataDir to a tmpfs or memory filesystem.

I wonder if this makes the issue go away for you.

antoxxa · 2015-02-02T08:20:42Z

finding in SecDataDir ip.pag and clearing it out helps ... but not sure why apache cannot do it itslef ... despite the fact that it is creating that file

pgib · 2015-04-16T00:49:08Z

@antoxxa Thanks. Had the exact same problem with an ip.pag file that had grown to 115MB. (FreeBSD 9.3, Apache 2.2)

nobleclem · 2015-05-06T13:12:10Z

Do we know if this is being worked on? Seems like a but like this thats been open for about 2 years would have been investigated by now.

onewaypub · 2015-05-17T09:10:58Z

Same issue here with version 2.8.0. But error message is diffrent

ModSecurity: collection_retrieve_ex: Failed deleting collection (name "ip", key "217.254.155.4_6246985e080008849486b7f63d4773c0b4b3f2dd") ...

Also tried to delete the ip.pag file in the SecDataDir. But it do not help

lebagvondouche · 2015-05-21T13:21:02Z

Has anyone seen this issue in 2.9.0? I don't have the time to test it.

celesteking · 2015-05-23T14:30:23Z

"ModSecurity: collections_remove_stale: Failed deleting collection XXX"

2.8.0, apache 2.2 mpm worker fixed thread count, suphp.
SecTmpDir writable by apache user, separate dir, (always empty?).
SecDataDir writable by apache user, separate dir.

Suggestions on solving this bug are welcome. Modsec devs, you might want to run a special separate cleaning thread/process or create another solution maybe?

zimmerle · 2015-05-25T12:15:36Z

Hi @celesteking,

You can cleanup your sdbm using a separated process: https://github.com/SpiderLabs/modsec-sdbm-util

Still, you may want to reduce the timeout and the amount of data to save in the collections. If you try to save a lot of data for a extended period of time, it eventually fails. There are some studies to use memcache to centralize the storage here:

https://github.com/SpiderLabs/ModSecurity/tree/memcache_collections

celesteking · 2015-05-25T14:34:53Z

That might work, thanks, we're gonna incorporate that util to run periodically.
SecCollectionTimeout has been set to 600 from day 1 here. Still, ip.pag file was growing up to 1gb.

celesteking · 2015-05-25T16:29:43Z

Speaking of, when modsec is in detectiononly mode, will the stale/expired entries be removed?

zcwang3 · 2016-09-09T01:47:02Z

We are also having this issue with mod_security 2.9.0, any update/plan on this bug?
What's the real root cause for it?

pgib · 2016-09-09T02:06:05Z

I've had to resort to deleting the file in a cronjob nightly. O_o

zcwang3 · 2016-09-09T07:14:44Z

if a IP address is detected to exceed the maximum number of IP request per Sec, request from this IP address will be blocked by Mod_security.
How long does this IP address request continues to be blocked?
Delete the file by cron job will impact the blocking behavior and then impact the preventing Dos attack effect?

refer to Mod_security Notes, collection's default expired time is one hour.
The documentation is very unclear about expiration of persistant collections. The documentation for expirevar seems to imply that collections by default do not expire. However, this is not the case. The default expiration time is given later in the documentation as 3600 seconds (1 hour), but my first attempts at increasing the expiration time (useful for spam blacklists) failed mysteriously.

dhaupin · 2016-09-09T17:11:20Z

@zcwang3 Did you ever figure out what causes this issue? Just noticed that its happening on one of our srvs. Modsec 2.9.0. The ip.pag file has hit 3.1G

zcwang3 · 2016-09-12T01:46:30Z

@dhaupin one of my colleague find that "collections_remove_stale: Failed deleting collection" happens when mod_security use apr_sdbm_delete() method to delete expired keys. Mod_security try to delete an sdbm record by key with Apache Portable Runtime Utility Library, but we do not know what causing this failure.

Issue existing in the storage of mod_security, alternative solution is change to use other storage like Redis or Memcached. Unfortunately, refer to Implement Redis support as Collection backend on libmodsecurity, it is not implemented.

Some one suggest using cron job to delete pag file every day and then restart apache, but it does not make sense to me.

Since this issue exists 3 years without solution, and we find other similar issues, like: IP persistence storage seems to not clean up . We are struggling if it is reasonable to keep using mod_security in our production.

Any other suggestion to fix this issue is appreciated!

related source code is as following:
if (expiry_time <= now) {
rc = apr_sdbm_delete(dbm, key);
if (rc != APR_SUCCESS) {
msr_log(msr, 1, "collections_remove_stale: Failed deleting collection (name "%s", "
"key "%s"): %s", log_escape(msr->mp, col_name),
log_escape_ex(msr->mp, key.dptr, key.dsize - 1), get_apr_error(msr->mp, rc));
msr->msc_sdbm_delete_error = 1;
goto error;
}

zimmerle · 2016-09-21T03:00:52Z

Hi @zcwang3,

We have already implemented support for LMDB [#1141] and InMemory back-end inside ModSecurity version 3. The support for Redis [#1139] and Memcache [#1140] are on our backlog, to be implemented soon.

I think the root cause of this problem is that we are abusing the usage of the apr_sdbm implementation. IHMO was not designed to be used so intensively. That is why we advice to reduce the expiration time to a lower value, therefore reducing the amount of register in a given collection. Anything that you can change in the rules in that sense will reduce the size of the collection.

The persistent collection keep a in-memory cache and from time to time it sync the memory content with the sdbm storage. Notice that each process from your webserver [worker, if you prefer] has a different in-memory collection that is synced with the sdbm from time to time. The error message that you saw, most likely is not really related to the size (in disk) of your collection. It is just saying that it is trying to delete a collection that was already deleted, probably by other process [that is one possibility].

The part of the code that calls the memory cleanup is available here:
https://github.com/SpiderLabs/ModSecurity/blob/master/apache2/modsecurity.c#L240-L253

There was a initial/experimental support for memcache built on top of 2.x, which is available here:
https://github.com/SpiderLabs/ModSecurity/tree/v2/dev/memcache_collections

marcstern · 2016-09-21T06:18:57Z

Felipe,
From your explanation, it looks obvious that this warning should be disabled by default in 2.x.
The added value of this warning is nul.

zimmerle · 2016-09-26T17:55:11Z

Hi @marcstern,

I don't think so, the warning is just a warning. It does not means that there is an error. In order circumstances the warning may be more meaningful.

mturk · 2016-10-06T11:39:00Z

Not sure what kind of mpm is used in case of corruption, but in case of threaded mpm's the file locking does not work for threads belonging to the same process. In such case threads can be doing different operations on the sdbm although those operations presume they are protected by lock. I have a patch (#1224) that solves persistent storage data corruption issues, so it could probably fix this one as well in case the issue is observed with threaded (worker, event) mpm's.

nrob81 · 2016-11-27T20:54:38Z

Hello,

I had the same problem (debian 8.5, modsecurity 2.8.0).
I am disabled SecResponseBodyAccess and the warnings disappeared.

bostrt · 2017-02-15T03:44:06Z

This issue is resolved by @mturk's #1224 pull request. The issue is that multiple threads are trying to delete the same collection. I dropped in some debug logging to prove this.

I added the following (line offsets may be a bit off from master):

@@ -676,6 +699,11 @@ int collections_remove_stale(modsec_rec *msr, const char *col_name) {
                 }
 
                 if (expiry_time <= now) {
+                    if (msr->txcfg->debuglog_level >= 9) {
+                        msr_log(msr, 9, "collections_remove_stale: Attempting to delete collection (name \"%s\", "
+                            "key \"%s\")", log_escape(msr->mp, col_name),
+                            log_escape_ex(msr->mp, key.dptr, key.dsize - 1));
+                    }
                     rc = apr_sdbm_delete(dbm, key);
                     if (rc != APR_SUCCESS) {
                         msr_log(msr, 1, "collections_remove_stale: Failed deleting collection (name \"%s\", "
@@ -698,13 +726,14 @@ int collections_remove_stale(modsec_rec *msr, const char *col_name) {

I ran a simple test using serial requests and saw this logging output (scroll horizontally to see session ID):

[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee144150][/index.html][9] collections_remove_stale: Attempting to delete collection (name "SESSION", key "session9271")
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee142140][/index.html][9] collections_remove_stale: Attempting to delete collection (name "SESSION", key "session9271")
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee142140][/index.html][9] collections_remove_stale: Attempting to delete collection (name "SESSION", key "session9271")
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee144150][/index.html][1] collections_remove_stale: Failed deleting collection (name "SESSION", key "session9271"): Internal error (specific information not available)
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee142140][/index.html][1] collections_remove_stale: Failed deleting collection (name "SESSION", key "session9271"): Internal error (specific information not available)

As you can see, the same session9271 is attempting to be delete by 3 different threads. The first thread actually deletes it from storage and the subsequent two threads receive the "Failed deleting collection" message.

…ging of collection delete problem in audit log when log level < 9 in audit log [Issue #576 - Marc Stern]

alexgoldsilver · 2017-05-04T14:52:53Z

The fix in PR 1224 has been outstanding for a while with no comments. These entries make parsing our audit logs for real threats very very difficult without heavily modifying our tools. Is there at least a way to suppress the warnings for this in the audit log using a configuration directive until some fix is applied to the threading issues?

zimmerle · 2017-05-05T01:25:00Z

@alexgoldsilver recently #1380 was merged into our mainline.

zimmerle · 2017-05-21T21:01:38Z

You may want to have a look at: #1224 by enabling the global mutex for collections, this problem should be resolved.

ghost assigned zimmerle Oct 17, 2013

micha137 mentioned this issue Apr 16, 2015

collection_store,collections_remove_stale and collection_retrieve_ex bugs #867

Closed

This was referenced Feb 15, 2017

IP persistence storage seems to not clean up #574

Closed

GC frequency option #1298

Closed

marcstern mentioned this issue Apr 10, 2017

Option to disable logging of collection delete problem #1380

Closed

zimmerle pushed a commit that referenced this issue Apr 12, 2017

{dis|en}able-collection-delete-problem-logging: Option to disable log…

22b46a1

…ging of collection delete problem in audit log when log level < 9 in audit log [Issue #576 - Marc Stern]

zimmerle pushed a commit that referenced this issue Apr 26, 2017

{dis|en}able-collection-delete-problem-logging: Option to disable log…

b6a7061

…ging of collection delete problem in audit log when log level < 9 in audit log [Issue #576 - Marc Stern]

zimmerle pushed a commit that referenced this issue Apr 26, 2017

{dis|en}able-collection-delete-problem-logging: Option to disable log…

d243818

…ging of collection delete problem in audit log when log level < 9 in audit log [Issue #576 - Marc Stern]

zimmerle closed this as completed May 21, 2017

itbuiductai mentioned this issue Sep 22, 2017

Apache + Modsecurity + Mlogc make crash in logging phase #1567

Closed

zimmerle mentioned this issue Dec 20, 2017

collections_remove_stale: Failed deleting collection #1635

Closed

victorhora mentioned this issue Aug 22, 2018

IIS 10 - collections_remove_stale #1878

Closed

victorhora mentioned this issue Sep 25, 2018

collection_retrieve_ex: Failed to lock proc mutex #1913

Closed

ModSecurity: collections_remove_stale: Failed deleting collection #576

ModSecurity: collections_remove_stale: Failed deleting collection #576

Comments

rcbarnett-zz commented Oct 17, 2013

rcbarnett-zz commented Oct 17, 2013

rcbarnett-zz commented Oct 17, 2013

kurktchiev commented Oct 29, 2013

wneessen commented Nov 18, 2013

wneessen commented Jan 6, 2014

marcstern commented Jan 16, 2014

klydel commented Jan 29, 2014

ekenberg commented Feb 10, 2014

oparoz commented Mar 14, 2014

oparoz commented Mar 21, 2014

marcstern commented Mar 21, 2014

oparoz commented Mar 21, 2014

lifeforms commented Jul 1, 2014

antoxxa commented Jan 14, 2015

lifeforms commented Jan 14, 2015

antoxxa commented Feb 2, 2015

pgib commented Apr 16, 2015

nobleclem commented May 6, 2015

onewaypub commented May 17, 2015

lebagvondouche commented May 21, 2015

celesteking commented May 23, 2015

zimmerle commented May 25, 2015

celesteking commented May 25, 2015

celesteking commented May 25, 2015

zcwang3 commented Sep 9, 2016

pgib commented Sep 9, 2016

zcwang3 commented Sep 9, 2016

dhaupin commented Sep 9, 2016 • edited Loading

zcwang3 commented Sep 12, 2016

zimmerle commented Sep 21, 2016

marcstern commented Sep 21, 2016

zimmerle commented Sep 26, 2016

mturk commented Oct 6, 2016

nrob81 commented Nov 27, 2016

bostrt commented Feb 15, 2017

alexgoldsilver commented May 4, 2017

zimmerle commented May 5, 2017

zimmerle commented May 21, 2017

dhaupin commented Sep 9, 2016 •

edited

Loading