Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModSecurity: collections_remove_stale: Failed deleting collection #576

Closed
rcbarnett-zz opened this issue Oct 17, 2013 · 38 comments
Closed
Assignees
Labels
bug It is a confirmed bug

Comments

@rcbarnett-zz
Copy link
Contributor

MODSEC-428: When server is under load of 1500 concurrent users Modsecurity starts to show the below error in apache's error log and the server CPU usage goes up to dangerous levels.

[Wed Oct 02 21:21:52 2013] [error] [client 109.127.81.10] ModSecurity: collections_remove_stale: Failed deleting collection (name "ip", key "109.127.81.138_9e4f93d096e2ca3744251c41dde47a1a7b26fa75"): Internal error [hostname "www.regayzanko.com"] [uri "/php5-fcgi/setting/student_collage_list.php"] [unique_id "UkxkMMaIMEoAADg9zw0AAAgr"]

I can see that this issue was patched see: https://www.modsecurity.org/tracker/browse/MODSEC-97 but i am using the latest modsecurity version "2.7.5" so i do not know why it is happening.

@rcbarnett-zz
Copy link
Contributor Author

Original reporter: zertux

@rcbarnett-zz
Copy link
Contributor Author

zertux: Can this process be disabled ? or is it a necessity for mod_Security's operation ?

@ghost ghost assigned zimmerle Oct 17, 2013
@kurktchiev
Copy link

I am having the exact same issue here on 2.7.5

@wneessen
Copy link

Seeing similar behaviour on FreeBSD 8.4, Apache/2.2.25, mod_sec 2.7.5.
Machine is running fine for a day or two (even just in DetectionOnly mode). Then all of a sudden, the CPU load increases extremely and Apache-Server status shows lots of "L"s. At some point all slots are exhaused and Apache will be unresponsive. Can be fixed by restarting apache, but it will happen again after some time.

mod_sec debug.log shows a couple of these:
[15/Nov/2013:21:10:30 +0000] [www.somehost.com/sid#805894478][rid#809e3a0a0][/some/uri/excel_2010_advanced_popup.jpg][1] collection_retrieve_ex: Failed deleting collection (name "ip", key "123.123.123.123_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[15/Nov/2013:21:13:10 +0000] [www.somehost.com/sid#80178fdb0][rid#80924e0a0][/some/uri/cookie.htm][1] collection_retrieve_ex: Failed deleting collection (name "ip", key "122.122.122.122_81548c7486a53bd388be273db20199283dfac871"): Internal error

@wneessen
Copy link

wneessen commented Jan 6, 2014

Hi, is there any update on this issue? Let me know if I can provide any further information.

@marcstern
Copy link

Same problem with 2.7.6 :-(
Can we expect a fix

@klydel
Copy link

klydel commented Jan 29, 2014

same issue. is there a particular way to disable this behavior altogether?

@ekenberg
Copy link

Same issue 2.7.7, any news?

@oparoz
Copy link

oparoz commented Mar 14, 2014

Soon after that error message, we get a few "Failed reading DBM file" and Apache is unreachable until the process is killed.

@oparoz
Copy link

oparoz commented Mar 21, 2014

We're using ZFS on FreeBSD. How about the other people affected by this bug?

@marcstern
Copy link

Same problem with Redhat, CentOS & Windows

@oparoz
Copy link

oparoz commented Mar 21, 2014

Thanks!

@lifeforms
Copy link

I think we are having the same issue on ModSecurity 2.7.7 on FreeBSD 9.2 UFS with a reasonably busy server (but not SUPER busy, e.g. 100 httpd clients but lots of heavy database work).

A few times a week, "something" happens that makes all Apache processes seem to spin on disk activity. Disk I/O by Apache shoots up to crazy levels (100MB/s) with CPU time spent mostly in system. In Apache status, children get stuck in "Logging" (L) state. It stays that way for 10-20 minutes, then calms down.

I narrowed it down to the user of collections. The problem completely goes away when I disable an anti-bruteforce rule which stores a block flag/timer in a collection. When I disable this rule, the problem is immediately gone. When I kill Apache and re-enable the rule, the problem is back.

When looking in modsec_audit.log, I see all kinds of "Internal errors" such as the following:

Message: collections_remove_stale: Failed deleting collection (name "ip", key "198.167.x.x_x"): Internal error

I now think that this condition probably happens when some clients hit the collection concurrently. The issue has become more problematic since switching to a twin-site SAN (I imagine that it takes longer before finishing a sync write because of the speed of light), so it might depend on certain I/O latency to happen.

The fact that the issue always disappears by itself after a number of minutes (but NOT by restarting Apache!), leads me to think that probably it ends by some collection item being deprecated. I have no reliable way to reproduce the problem unfortunately, besides using collections in production and waiting until the site is crippled.

@antoxxa
Copy link

antoxxa commented Jan 14, 2015

stil no answer ?

@lifeforms
Copy link

I don't have this problem myself anymore since we're now not using persistent collections on that server anymore.

However, I found that the server was using /tmp to store the data files. This directory happened to be huge (75000 other entries). My suspicion is that it might take too long to do a database lookup or store, and this sends the code into sad country. It was probably made worse due to our SAN with (relatively) high latency storage.

Could other people experiencing this issue try the following:

  • Check where your data files are stored (SecDataDir) and see if this might be slow due to an overfull directory? Maybe cleaning up the tmp directory helps already.
  • Try setting SecDataDir to its own dedicated directory.
  • Try setting SecDataDir to a tmpfs or memory filesystem.

I wonder if this makes the issue go away for you.

@antoxxa
Copy link

antoxxa commented Feb 2, 2015

finding in SecDataDir ip.pag and clearing it out helps ... but not sure why apache cannot do it itslef ... despite the fact that it is creating that file

@pgib
Copy link

pgib commented Apr 16, 2015

@antoxxa Thanks. Had the exact same problem with an ip.pag file that had grown to 115MB. (FreeBSD 9.3, Apache 2.2)

@nobleclem
Copy link

Do we know if this is being worked on? Seems like a but like this thats been open for about 2 years would have been investigated by now.

@onewaypub
Copy link

Same issue here with version 2.8.0. But error message is diffrent

ModSecurity: collection_retrieve_ex: Failed deleting collection (name "ip", key "217.254.155.4_6246985e080008849486b7f63d4773c0b4b3f2dd") ...

Also tried to delete the ip.pag file in the SecDataDir. But it do not help

@lebagvondouche
Copy link

Has anyone seen this issue in 2.9.0? I don't have the time to test it.

@celesteking
Copy link

"ModSecurity: collections_remove_stale: Failed deleting collection XXX"

2.8.0, apache 2.2 mpm worker fixed thread count, suphp.
SecTmpDir writable by apache user, separate dir, (always empty?).
SecDataDir writable by apache user, separate dir.

Suggestions on solving this bug are welcome. Modsec devs, you might want to run a special separate cleaning thread/process or create another solution maybe?

@zimmerle
Copy link
Contributor

Hi @celesteking,

You can cleanup your sdbm using a separated process: https://github.com/SpiderLabs/modsec-sdbm-util

Still, you may want to reduce the timeout and the amount of data to save in the collections. If you try to save a lot of data for a extended period of time, it eventually fails. There are some studies to use memcache to centralize the storage here:

https://github.com/SpiderLabs/ModSecurity/tree/memcache_collections

@celesteking
Copy link

That might work, thanks, we're gonna incorporate that util to run periodically.
SecCollectionTimeout has been set to 600 from day 1 here. Still, ip.pag file was growing up to 1gb.

@celesteking
Copy link

Speaking of, when modsec is in detectiononly mode, will the stale/expired entries be removed?

@zcwang3
Copy link

zcwang3 commented Sep 9, 2016

We are also having this issue with mod_security 2.9.0, any update/plan on this bug?
What's the real root cause for it?

@pgib
Copy link

pgib commented Sep 9, 2016

I've had to resort to deleting the file in a cronjob nightly. O_o

@zcwang3
Copy link

zcwang3 commented Sep 9, 2016

if a IP address is detected to exceed the maximum number of IP request per Sec, request from this IP address will be blocked by Mod_security.
How long does this IP address request continues to be blocked?
Delete the file by cron job will impact the blocking behavior and then impact the preventing Dos attack effect?

refer to Mod_security Notes, collection's default expired time is one hour.
The documentation is very unclear about expiration of persistant collections. The documentation for expirevar seems to imply that collections by default do not expire. However, this is not the case. The default expiration time is given later in the documentation as 3600 seconds (1 hour), but my first attempts at increasing the expiration time (useful for spam blacklists) failed mysteriously.

@dhaupin
Copy link

dhaupin commented Sep 9, 2016

@zcwang3 Did you ever figure out what causes this issue? Just noticed that its happening on one of our srvs. Modsec 2.9.0. The ip.pag file has hit 3.1G

@zcwang3
Copy link

zcwang3 commented Sep 12, 2016

@dhaupin one of my colleague find that "collections_remove_stale: Failed deleting collection" happens when mod_security use apr_sdbm_delete() method to delete expired keys. Mod_security try to delete an sdbm record by key with Apache Portable Runtime Utility Library, but we do not know what causing this failure.

Issue existing in the storage of mod_security, alternative solution is change to use other storage like Redis or Memcached. Unfortunately, refer to Implement Redis support as Collection backend on libmodsecurity, it is not implemented.

Some one suggest using cron job to delete pag file every day and then restart apache, but it does not make sense to me.

Since this issue exists 3 years without solution, and we find other similar issues, like: IP persistence storage seems to not clean up . We are struggling if it is reasonable to keep using mod_security in our production.

Any other suggestion to fix this issue is appreciated!

related source code is as following:
if (expiry_time <= now) {
rc = apr_sdbm_delete(dbm, key);
if (rc != APR_SUCCESS) {
msr_log(msr, 1, "collections_remove_stale: Failed deleting collection (name "%s", "
"key "%s"): %s", log_escape(msr->mp, col_name),
log_escape_ex(msr->mp, key.dptr, key.dsize - 1), get_apr_error(msr->mp, rc));
msr->msc_sdbm_delete_error = 1;
goto error;
}

@zimmerle
Copy link
Contributor

Hi @zcwang3,

We have already implemented support for LMDB [#1141] and InMemory back-end inside ModSecurity version 3. The support for Redis [#1139] and Memcache [#1140] are on our backlog, to be implemented soon.

I think the root cause of this problem is that we are abusing the usage of the apr_sdbm implementation. IHMO was not designed to be used so intensively. That is why we advice to reduce the expiration time to a lower value, therefore reducing the amount of register in a given collection. Anything that you can change in the rules in that sense will reduce the size of the collection.

The persistent collection keep a in-memory cache and from time to time it sync the memory content with the sdbm storage. Notice that each process from your webserver [worker, if you prefer] has a different in-memory collection that is synced with the sdbm from time to time. The error message that you saw, most likely is not really related to the size (in disk) of your collection. It is just saying that it is trying to delete a collection that was already deleted, probably by other process [that is one possibility].

The part of the code that calls the memory cleanup is available here:
https://github.com/SpiderLabs/ModSecurity/blob/master/apache2/modsecurity.c#L240-L253

There was a initial/experimental support for memcache built on top of 2.x, which is available here:
https://github.com/SpiderLabs/ModSecurity/tree/v2/dev/memcache_collections

@marcstern
Copy link

Felipe,
From your explanation, it looks obvious that this warning should be disabled by default in 2.x.
The added value of this warning is nul.

@zimmerle
Copy link
Contributor

Hi @marcstern,

I don't think so, the warning is just a warning. It does not means that there is an error. In order circumstances the warning may be more meaningful.

@mturk
Copy link

mturk commented Oct 6, 2016

Not sure what kind of mpm is used in case of corruption, but in case of threaded mpm's the file locking does not work for threads belonging to the same process. In such case threads can be doing different operations on the sdbm although those operations presume they are protected by lock. I have a patch (#1224) that solves persistent storage data corruption issues, so it could probably fix this one as well in case the issue is observed with threaded (worker, event) mpm's.

@nrob81
Copy link

nrob81 commented Nov 27, 2016

Hello,

I had the same problem (debian 8.5, modsecurity 2.8.0).
I am disabled SecResponseBodyAccess and the warnings disappeared.

@bostrt
Copy link

bostrt commented Feb 15, 2017

This issue is resolved by @mturk's #1224 pull request. The issue is that multiple threads are trying to delete the same collection. I dropped in some debug logging to prove this.

I added the following (line offsets may be a bit off from master):

@@ -676,6 +699,11 @@ int collections_remove_stale(modsec_rec *msr, const char *col_name) {
                 }
 
                 if (expiry_time <= now) {
+                    if (msr->txcfg->debuglog_level >= 9) {
+                        msr_log(msr, 9, "collections_remove_stale: Attempting to delete collection (name \"%s\", "
+                            "key \"%s\")", log_escape(msr->mp, col_name),
+                            log_escape_ex(msr->mp, key.dptr, key.dsize - 1));
+                    }
                     rc = apr_sdbm_delete(dbm, key);
                     if (rc != APR_SUCCESS) {
                         msr_log(msr, 1, "collections_remove_stale: Failed deleting collection (name \"%s\", "
@@ -698,13 +726,14 @@ int collections_remove_stale(modsec_rec *msr, const char *col_name) {

I ran a simple test using serial requests and saw this logging output (scroll horizontally to see session ID):

[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee144150][/index.html][9] collections_remove_stale: Attempting to delete collection (name "SESSION", key "session9271")
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee142140][/index.html][9] collections_remove_stale: Attempting to delete collection (name "SESSION", key "session9271")
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee142140][/index.html][9] collections_remove_stale: Attempting to delete collection (name "SESSION", key "session9271")
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee144150][/index.html][1] collections_remove_stale: Failed deleting collection (name "SESSION", key "session9271"): Internal error (specific information not available)
[14/Feb/2017:22:25:28 --0500] [localhost/sid#5633ee000420][rid#5633ee142140][/index.html][1] collections_remove_stale: Failed deleting collection (name "SESSION", key "session9271"): Internal error (specific information not available)

As you can see, the same session9271 is attempting to be delete by 3 different threads. The first thread actually deletes it from storage and the subsequent two threads receive the "Failed deleting collection" message.

zimmerle pushed a commit that referenced this issue Apr 12, 2017
…ging of collection delete problem in audit log when log level < 9 in audit log [Issue #576 - Marc Stern]
zimmerle pushed a commit that referenced this issue Apr 26, 2017
…ging of collection delete problem in audit log when log level < 9 in audit log [Issue #576 - Marc Stern]
zimmerle pushed a commit that referenced this issue Apr 26, 2017
…ging of collection delete problem in audit log when log level < 9 in audit log [Issue #576 - Marc Stern]
@alexgoldsilver
Copy link

The fix in PR 1224 has been outstanding for a while with no comments. These entries make parsing our audit logs for real threats very very difficult without heavily modifying our tools. Is there at least a way to suppress the warnings for this in the audit log using a configuration directive until some fix is applied to the threading issues?

@zimmerle
Copy link
Contributor

zimmerle commented May 5, 2017

@alexgoldsilver recently #1380 was merged into our mainline.

@zimmerle
Copy link
Contributor

You may want to have a look at: #1224 by enabling the global mutex for collections, this problem should be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug It is a confirmed bug
Projects
None yet
Development

No branches or pull requests