Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes: #15194 - Check the whole queue for matching queued webhooks #15318

Conversation

peteeckel
Copy link
Contributor

Fixes: #15194

This PR changes the logic how changes to M2M relationships are matched to changes in the base object to which the M2M relationship is related. The current code assumes that the post_save is immediately followed by the m2m_changed signal for the former, but that is not generally true as any save() operation for a different object occurring in between will separate the two entries in the queue. This results in two webhooks being triggered, the first one of which has incorrect data in its payload.

The new code finds the position of the webhook data for the object change in the queue and updates it with the M2M change data and ensures only one webhook is triggered, and with the correct data.

@peteeckel peteeckel marked this pull request as draft February 29, 2024 19:09
@peteeckel peteeckel marked this pull request as ready for review March 1, 2024 12:57
@peteeckel peteeckel marked this pull request as draft March 1, 2024 17:37
@peteeckel peteeckel force-pushed the 15194-fix-duplicate-webhook-invocations branch from a2e6304 to 1728997 Compare March 1, 2024 18:10
@peteeckel peteeckel marked this pull request as ready for review March 1, 2024 18:28
@jeremystretch
Copy link
Member

@peteeckel thanks for all your work on this, both in capturing the root issue and in devising a solution! Admittedly, I haven't attempted to reproduce this firsthand, but thanks to your very detailed notes I believe I understand the root issue.

Given that we cannot rely on merely checking the most recent queued item (queue[-1]), perhaps converting the queue to a hash map would be preferable. For each item, we could compute a simple hash from its content type, primary key, and request ID, which would be stored as the entry's key in an ordered dictionary. This would allow us to easily look up a previously queued record for an object by its unique hash. Does that make sense?

If you're open to it, I'd be happy to continue your work in this direction. (Of course, you're also welcome to continue tackling it yourself.) Let me know what you think, and thanks again!

@peteeckel
Copy link
Contributor Author

Hi Jeremy, you're very welcome - I know that some of the bugs I find are slightly exotic and that you're probably working much more than usual to get NetBox 4 out, so never mind about the delay.

As for converting the quere to a dictionary: That makes a lot of sense to me, provided that nobody actually relies on it being a queue, which probably isn't the case. The only thing that might be of concern is that some users might make assumptions about the order in which webhooks are triggered, so my first thought was to use an OrderedDict.

In CPython, dicts are in fact always ordered, but the question is whether we can rely on NetBox always being run on CPython? Trying to err on the safe side I'd assume the answer is 'no', so the queue should be replaced by an OrderedDict.

I'm currently on a business trip until Thursday, so I can start working on it again on Friday - if that's not an issue I'll give it a try. On the other hand, the sooner the problem is fixed the better, so don't hesitate to go for it.

@peteeckel
Copy link
Contributor Author

Hi @jeremystretch, I just had a look at the code and it seems that converting the queue to a hash map is not a trivial change.

To me, there seems to be a showstopper in netbox/extras/events.py function flush_events, which not only expects a queue but also hands it off to the the methods listed in EVENTS_PIPELINE, which might contain other code (plugin code, maybe?) that needs to be prepared to handle the hash map instead of the queue.

So while changing it is rather simple, it will definitely break documented or at least established interfaces, which would probably be possible with the switch to NetBox 4 but not earlier, correct?

@peteeckel
Copy link
Contributor Author

Hi @jeremystretch, as this seems to be stalled on the question of the interface to methods in the EVENTS_PIPELINE - do you have a preference how to proceed with this?

I think moving from a queue to a hash map would require changing the (documented?) interface, so the question is whether or not this is an option (or whether I am missing something). If it gets changed, it would probably be best with NetBox 4.0 as a major version.

Or were you suggesting converting the queue to a hash map just for the sake of the check whether an item is already in the queue, without changing the queue structure as such? I doubt this would be an efficient use of resources, as repeatedly copying the data into a hash map is more effort than the linear search I implemented.

If not, do you require any changes to my PR?

@jeffgdotorg
Copy link
Contributor

@peteeckel Please add a test that illustrates the behavior you're describing, so that the developer who picks this up can be confident in their understanding of the problem. Additionally, given the sensitivity of the code path that this PR targets, we need to include tests which confirm the continued correct operation of the rest of the code path's behavior. For example, there's some concern about the possibility of race conditions when multiple actors try to modify the same object.

@peteeckel
Copy link
Contributor Author

Hi @jeffgdotorg, that's an interesting request :-)

  1. I can't add a test within NetBox because the behaviour can only be reproduced with a specifically crafted save() operation that calls a save() operation for a different model. AFAIK such a constellation does not exist within NetBox (but within NetBox DNS).
  2. I agree that the code path is critical, which is why I am assuming that the existing NetBox test suite already covers it - and the existing test suite passes without a flaw with the patch in place.
  3. The code (the existing as well as the patched version) runs within the context of a single request, so I don't see how concurrency could come into play. Is there an existing test in the NetBox test suite that I can use as an example for how that can be done?

@peteeckel
Copy link
Contributor Author

By the way, I can add a test to the NetBox DNS test suite. Not sure if that would help, though.

@jeremystretch
Copy link
Member

@peteeckel my biggest concern, given the nature of the bug, is that some future change to the logging logic could reintroduce the error. If you can walk me through exactly what's happening in the plugin, I might be able to rig up a test to replicate it in NetBox.

@peteeckel
Copy link
Contributor Author

HI @jeremystretch, thanks for the quick reaction.

I constructed some code to reproduce the behviour in the underlying issue. It can be triggered in multiple ways, all requiring a specifically crafted save() function that may or may not affect a model with an associated many-to-many foreign key.

The easiest way to trigger it is to call super().save() twice within that save() function - that will cause a double invocation of the event hook, one with a correct pre- and one with a correct post-snapshot. The code in events.py does not handle this correctly because it only associates an m2m_changed event with an immediately preceding post_save event, not two consecutive post_save events.

When there is an m2m relationship, the issue is also triggered when the save() method of the model calls the save() method of another model, because that also breaks the order of post_save immediately followed by m2m_changed because post_save for the second model is called before the m2m_changed event.

The only thing my patch does is to check the whole previous queue instead of only the last event in the queue for events for the same object, and not only m2m_changed after post_save, which covers both cases.

With your earlier suggestion of changing the data structure of the queue to a hash map, that would be much simpler and more elegant, however at the cost of breaking the interface for the EVENTS_PIPELINE. In general, however, nothing would change much.

@peteeckel
Copy link
Contributor Author

Just in case you are asking yourself "Why should anyone want to do that kind of stuff?" ...

Take a DNS zone, now update the list of name servers for the zone. The latter is an m2m relationship.

  • If the zone needs to be created first, that causes a post_save event for Zone.
  • Now the m2m_handler kicks in and creates the NS records in the zone (one post_save event for Record per NS record) ...
  • ... which requires the zone's serial to be updated (another post_save for Zone, which can under some circumstances be optimised away, but not always)
  • And finally, the m2m_changed event for the zone itself is queued (m2m_changed event for Zone)

And there you have everything described above in a single request.

I documented the result in peteeckel/netbox-plugin-dns#89 (comment), including a verification that the PR actually solves this.

@peteeckel peteeckel force-pushed the 15194-fix-duplicate-webhook-invocations branch from 1728997 to e535c14 Compare May 28, 2024 14:43
@peteeckel peteeckel force-pushed the 15194-fix-duplicate-webhook-invocations branch from e535c14 to b5cb67b Compare May 30, 2024 16:00
@peteeckel
Copy link
Contributor Author

peteeckel commented May 30, 2024

Sorry, I accidentally deleted the last comment instead of editing it, but never mind - I have some more data now, and a better approach to creating a test for the problem. Essentially there is a little change to the model of the dummy_plugin in netbox/tests, and some little tests to check for the correct invocation if event rules with this change. The change to the model is actually trivial:

(netbox) [root@dns netbox]# git diff develop -- netbox/netbox/tests/dummy_plugin/models.py 
diff --git a/netbox/netbox/tests/dummy_plugin/models.py b/netbox/netbox/tests/dummy_plugin/models.py
index 9bd39a46b..2677727db 100644
--- a/netbox/netbox/tests/dummy_plugin/models.py
+++ b/netbox/netbox/tests/dummy_plugin/models.py
@@ -11,3 +11,7 @@ class DummyModel(models.Model):
 
     class Meta:
         ordering = ['name']
+
+    def save(self, *args, **kwargs):
+        super().save()
+        super().save()

This is enough to trigger the erroneous invocations of the event rules for create and change events. Now for the tests:

(netbox) [root@dns netbox]# git diff develop -- netbox/extras/tests/test_event_rules.py 
diff --git a/netbox/extras/tests/test_event_rules.py b/netbox/extras/tests/test_event_rules.py
index 8cea2078a..80375ba82 100644
--- a/netbox/extras/tests/test_event_rules.py
+++ b/netbox/extras/tests/test_event_rules.py
@@ -16,6 +16,8 @@ from extras.events import enqueue_object, flush_events, serialize_for_event
 from extras.models import EventRule, Tag, Webhook
 from extras.webhooks import generate_signature, send_webhook
 from utilities.testing import APITestCase
+from netbox.tests.dummy_plugin.models import DummyModel
+from netbox.context import events_queue
 
 
 class EventRuleTest(APITestCase):
@@ -31,6 +33,7 @@ class EventRuleTest(APITestCase):
     def setUpTestData(cls):
 
         site_type = ObjectType.objects.get_for_model(Site)
+        dummy_type = ObjectType.objects.get_for_model(DummyModel)
         DUMMY_URL = 'http://localhost:9000/'
         DUMMY_SECRET = 'LOOKATMEIMASECRETSTRING'
 
@@ -65,7 +68,7 @@ class EventRuleTest(APITestCase):
             ),
         ))
         for event_rule in event_rules:
-            event_rule.object_types.set([site_type])
+            event_rule.object_types.set([site_type, dummy_type])
 
         Tag.objects.bulk_create((
             Tag(name='Foo', slug='foo'),
@@ -377,3 +380,45 @@ class EventRuleTest(APITestCase):
         # Patch the Session object with our dummy_send() method, then process the webhook for sending
         with patch.object(Session, 'send', dummy_send) as mock_send:
             send_webhook(**job.kwargs)
+
+    def test_webhook_double_save_create(self):
+        data = {
+            'name': 'Dummy',
+        }
+        url = reverse('plugins-api:dummy_plugin-api:dummymodel-list')
+        self.add_permissions('dummy_plugin.add_dummymodel')
+        response = self.client.post(url, data, format='json', **self.header)
+        self.assertHttpStatus(response, status.HTTP_201_CREATED)
+        self.assertEqual(DummyModel.objects.count(), 1)
+
+        dummy = DummyModel.objects.first()
+
+        self.assertEqual(self.queue.count, 1)
+        job = self.queue.jobs[0]
+        self.assertEqual(job.kwargs['event_rule'], EventRule.objects.get(type_create=True))
+        self.assertEqual(job.kwargs['event'], ObjectChangeActionChoices.ACTION_CREATE)
+        self.assertEqual(job.kwargs['model_name'], 'dummymodel')
+        self.assertEqual(job.kwargs['data']['id'], dummy.pk)
+        self.assertEqual(job.kwargs['snapshots']['postchange']['name'], 'Dummy')
+
+    def test_webhook_double_save_update(self):
+        dummy = DummyModel.objects.create(name='Dummy')
+
+        data = {
+            'name': 'New Dummy',
+            'number': 42,
+        }
+        url = reverse('plugins-api:dummy_plugin-api:dummymodel-detail', kwargs={'pk': dummy.pk})
+        self.add_permissions('dummy_plugin.change_dummymodel')
+        response = self.client.patch(url, data, format='json', **self.header)
+        self.assertHttpStatus(response, status.HTTP_200_OK)
+        self.assertEqual(DummyModel.objects.count(), 1)
+
+        self.assertEqual(self.queue.count, 1)
+        job = self.queue.jobs[0]
+        self.assertEqual(job.kwargs['event_rule'], EventRule.objects.get(type_update=True))
+        self.assertEqual(job.kwargs['event'], ObjectChangeActionChoices.ACTION_UPDATE)
+        self.assertEqual(job.kwargs['model_name'], 'dummymodel')
+        self.assertEqual(job.kwargs['data']['id'], dummy.pk)
+        self.assertEqual(job.kwargs['snapshots']['postchange']['name'], 'New Dummy')
+        self.assertEqual(job.kwargs['snapshots']['postchange']['number'], 42)

I appended the tests appended to extras/tests/test_event_rules.py, which seems to be the most fitting place to me.

The problem is: This does not work at all.

@peteeckel
Copy link
Contributor Author

peteeckel commented May 30, 2024

The first problem is that the updates to DummyModel objects do not trigger any EventRule. The reason is that in order to do that, the object needs to have the to_objectchange() method, and DummyModel does not have it. So I changed the model to inherit from EventRuleMixin und ChangeLoggedMixin (not sure if the former is required, but I threw it in for good measure).

diff --git a/netbox/netbox/tests/dummy_plugin/models.py b/netbox/netbox/tests/dummy_plugin/models.py
index 9bd39a46b..0e818d8ec 100644
--- a/netbox/netbox/tests/dummy_plugin/models.py
+++ b/netbox/netbox/tests/dummy_plugin/models.py
@@ -1,7 +1,8 @@
 from django.db import models
 
+from netbox.models import EventRulesMixin, ChangeLoggingMixin
 
-class DummyModel(models.Model):
+class DummyModel(EventRulesMixin, ChangeLoggingMixin, models.Model):
     name = models.CharField(
         max_length=20
     )
@@ -11,3 +12,7 @@ class DummyModel(models.Model):
 
     class Meta:
         ordering = ['name']
+
+    def save(self, *args, **kwargs):
+        super().save()
+        super().save()

The migration needs to be changed as well:

(netbox) [root@dns netbox]# git diff develop  -- netbox/netbox/tests/dummy_plugin/migrations/0001_initial.py 
diff --git a/netbox/netbox/tests/dummy_plugin/migrations/0001_initial.py b/netbox/netbox/tests/dummy_plugin/migrations/0001_initial.py
index b7241b51d..9ba1547d3 100644
--- a/netbox/netbox/tests/dummy_plugin/migrations/0001_initial.py
+++ b/netbox/netbox/tests/dummy_plugin/migrations/0001_initial.py
@@ -15,6 +15,8 @@ class Migration(migrations.Migration):
                 ('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False)),
                 ('name', models.CharField(max_length=20)),
                 ('number', models.IntegerField(default=100)),
+                ('created', models.DateTimeField(auto_now_add=True, null=True)),
+                ('last_updated', models.DateTimeField(auto_now=True, null=True)),
             ],
             options={
                 'ordering': ['name'],

This is where the trouble really started.

@peteeckel
Copy link
Contributor Author

peteeckel commented May 30, 2024

The change logging functionality relies on serialising objects, and that requires finding the serializer for the model in the first place. This is done via utilities.api.get_serializer_for_model(), and that just fails.

### NetBox interactive shell (dns.dev.hindenburgring.com)
### Python 3.11.5 | Django 5.0.6 | NetBox 4.0.4-dev
### lsmodels() will show available models. Use help(<model>) for more info.
>>> from netbox.tests.dummy_plugin.models import DummyModel
>>> from utilities.api import get_serializer_for_model
>>> get_serializer_for_model(DummyModel)
Traceback (most recent call last):
  File "/opt/netbox/netbox/utilities/api.py", line 36, in get_serializer_for_model
    return import_string(serializer_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/netbox/lib64/python3.11/site-packages/django/utils/module_loading.py", line 30, in import_string
    return cached_import(module_path, class_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/netbox/lib64/python3.11/site-packages/django/utils/module_loading.py", line 15, in cached_import
    module = import_module(module_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1126, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1126, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1140, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'dummy_plugin'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.11/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "/opt/netbox/netbox/utilities/api.py", line 38, in get_serializer_for_model
    raise SerializerNotFound(
netbox.api.exceptions.SerializerNotFound: Could not determine serializer for dummy_plugin.DummyModel with prefix ''

After a while I found out that the problem is with the way get_serializer_for_model() constructs the name of the serializer. The model._meta.label is split at the dot, then api.serializers and an optional prefix is inserted between them, and then Serializer is added to the end. This gives dummy_plugin.DummyModelSerializer. Unfortunately, due to the model path of the dummy plugin, that is not the path where the serializer can be found, hence the exception.

To be honest, I did not find a real solution for this. I have a very dirty hack in place that just works and doesn't break anything else, but it's ugly as hell and definitely can't be used like this.

(netbox) [root@dns netbox]# git diff 0bfb9777bed4fcf22f5b684504b1e62d03cdfb9b 3c38f51250710c6637667ef626f0bc484760ead9 
diff --git a/netbox/netbox/tests/dummy_plugin/models.py b/netbox/netbox/tests/dummy_plugin/models.py
index 9bd39a46b..ade46f768 100644
--- a/netbox/netbox/tests/dummy_plugin/models.py
+++ b/netbox/netbox/tests/dummy_plugin/models.py
@@ -8,6 +8,7 @@ class DummyModel(models.Model):
     number = models.IntegerField(
         default=100
     )
+    serializer_label = 'netbox.tests.dummy_plugin'
 
     class Meta:
         ordering = ['name']
diff --git a/netbox/utilities/api.py b/netbox/utilities/api.py
index 11b914811..3cedfc4a3 100644
--- a/netbox/utilities/api.py
+++ b/netbox/utilities/api.py
@@ -30,7 +30,12 @@ def get_serializer_for_model(model, prefix=''):
     """
     Return the appropriate REST API serializer for the given model.
     """
-    app_label, model_name = model._meta.label.split('.')
+    if hasattr(model, 'serializer_label'):
+        app_label = model.serializer_label
+        model_name = model._meta.label.split('.')[1]
+    else:
+        app_label, model_name = model._meta.label.split('.')
+
     serializer_name = f'{app_label}.api.serializers.{prefix}{model_name}Serializer'
     try:
         return import_string(serializer_name)

For the time being I left that hack in place. One way to do it properly would probably be to add the serializer_path as an official Meta variable, but given the specialty of this use case it's probably not the best way to handle it either. @jeremystretch hopefully has a better idea.

@peteeckel
Copy link
Contributor Author

The next problem, a trivial one, was that the serializer for DummyModel was named DummySerializer, which doesn't work either (same reason as above). But unlike the last issue this is easy to fix.

diff --git a/netbox/netbox/tests/dummy_plugin/api/serializers.py b/netbox/netbox/tests/dummy_plugin/api/serializers.py
index 239d7d998..0a6412095 100644
--- a/netbox/netbox/tests/dummy_plugin/api/serializers.py
+++ b/netbox/netbox/tests/dummy_plugin/api/serializers.py
@@ -2,7 +2,7 @@ from rest_framework.serializers import ModelSerializer
 from netbox.tests.dummy_plugin.models import DummyModel
 
 
-class DummySerializer(ModelSerializer):
+class DummyModelSerializer(ModelSerializer):
 
     class Meta:
         model = DummyModel
diff --git a/netbox/netbox/tests/dummy_plugin/api/views.py b/netbox/netbox/tests/dummy_plugin/api/views.py
index 58f221285..3fd51bcf2 100644
--- a/netbox/netbox/tests/dummy_plugin/api/views.py
+++ b/netbox/netbox/tests/dummy_plugin/api/views.py
@@ -1,8 +1,8 @@
 from rest_framework.viewsets import ModelViewSet
 from netbox.tests.dummy_plugin.models import DummyModel
-from .serializers import DummySerializer
+from .serializers import DummyModelSerializer
 
 
 class DummyViewSet(ModelViewSet):
     queryset = DummyModel.objects.all()
-    serializer_class = DummySerializer
+    serializer_class = DummyModelSerializer

And that's pretty much all ... with these changes in place there is a test that fails if the proposed patch is not in place and succeeds if it is.

The only major issue is the problem with get_serializer_for_model(). If there is a way to have the method find the serializer even if it is not in the top level directory, we have a test.

By the way, I experimented very briefly with setting app_label to netbox.tests.dummy_plugin .... failed miserably, app_label is, among other puroposes, also used to construct the table names for SQL queries.

@peteeckel
Copy link
Contributor Author

peteeckel commented May 31, 2024

By the way, I experimented very briefly with setting app_label to netbox.tests.dummy_plugin .... failed miserably, app_label is, among other puroposes, also used to construct the table names for SQL queries.

And, as a side note, it also requires changing get_serializer_for_model() ever so slightly - model._meta.label.split('.') needs to become model._meta.label.rsplit('.', 1). But anyway, changing app_label gets you nowhere.

@jeremystretch
Copy link
Member

@peteeckel I think you've been on the right path with this approach, but employing unique IDs to map each object in the queue offers better protection against duplicated events. We can also simplify it a good deal by invoking the event_tracking() context manager manually in the test to avoid any modifications to existing models.

Please have a look at the PR I just put in (#16366) and let me know what you think!

@jeremystretch jeremystretch removed their assignment May 31, 2024
@peteeckel peteeckel closed this Jun 2, 2024
@peteeckel peteeckel deleted the 15194-fix-duplicate-webhook-invocations branch June 2, 2024 09:25
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Erroneous invocations of Event Rules/Webhooks with M2M relationships
3 participants