Add Course Published event listener and plugin plumbing #1

bmtcril · 2023-04-24T20:24:51Z

PR is ready for review. This has worked for me in a Tutor nightly build using the accompanying branch of the OARS plugin: openedx/tutor-contrib-aspects#35

Testing:

(in an otherwise working OARS env)
Install the OARS plugin branch above
tutor config save (to get the ClickHouse config)
tutor images build openedx --no-cache (to get the plugin installed)
tutor local do init -l oars (to get the new database and tables created)
tutor local start
Make sure you have the demo course loaded
Edit a text block in the course
You should see course data in ClickHouse's event_sink.course_blocks and event_sink.course_relationships
If there are errors they will likely show up in the cms-worker container

Known issue: The data is not versioned, I will need to update both tables to have a unique id of some variety so we can tell which versions go with which events.

Merge checklist:
Check off if complete or not applicable:

Version bumped
Changelog record added
Documentation updated (not only docstrings)
Fixup commits are squashed away
Unit tests added/updated
Manual testing instructions provided
Noted any: Concerns, dependencies, migration issues, deadlines, tickets

Ian2012 · 2023-05-03T12:33:12Z

event_sink_clickhouse/sinks/course_published.py

+            response.raise_for_status()
+
+            # Just overwriting the previous query
+            params["query"] = f"INSERT INTO {self.ch_database}.course_relationships FORMAT CSV"


Can course_relationships be a setting?

The name of the table? I thought about it, but thought that as the "owner" of the table it would be ok to hard code here. We already have a variable for it in the OARS plugin, though, so it probably makes sense to make it all configured the same. It's going to get really complicated with the superset queries and charts, though.

I think it will simplify a lot of things to keep these hard coded and owned by this plugin. Since the database can be specified there shouldn't be any namespacing issues with other tables.

Ian2012 · 2023-05-03T12:35:46Z

event_sink_clickhouse/tasks.py

+
+@shared_task
+@set_code_owner_attribute
+def dump_course_to_clickhouse(course_key_string, connection_overrides=None):


If it's not too expensive to run a full courses dump, can the course_key be optional and dump all courses?

The next task for this repo is to add in a management command like the coursegraph one which will handle this use case. It does some additional checking of when the last time a course was dumped and kicks off a celery task for each course, so this shouldn't need to change.

Ian2012 · 2023-05-03T19:05:04Z

event_sink_clickhouse/sinks/course_published.py

+            self.log.info(response.headers)
+            self.log.info(response)
+            self.log.info(response.text)


Are those necessary?

They're vital for debugging issues, I'll put them in a try.

Ian2012 · 2023-05-03T19:06:32Z

event_sink_clickhouse/sinks/course_published.py

+            self.log.info(response.headers)
+            self.log.info(response)
+            self.log.info(response.text)


Same comment as before

mariajgrimaldi

I haven't tested this -I'll do it today 😅- but I read through the code and left some style comments! Let me know what you think

mariajgrimaldi · 2023-05-04T12:49:52Z

event_sink_clickhouse/apps.py

+            'cms.djangoapp': {
+                'production': {PluginSettings.RELATIVE_PATH: 'settings.production'},
+                'common': {PluginSettings.RELATIVE_PATH: 'settings.common'},
+                'devstack': {PluginSettings.RELATIVE_PATH: 'settings.devstack'},


Can we use development instead of devstack? Since we're using tutor now.

Those files don't actually exist, and as far as I know we don't need them so I'm going to remove these

mariajgrimaldi · 2023-05-04T12:50:34Z

event_sink_clickhouse/apps.py

+        """
+        super().ready()
+
+        from . import tasks  # pylint: disable=import-outside-toplevel, unused-import


can we use absolute imports across the project?

mariajgrimaldi · 2023-05-04T12:55:05Z

event_sink_clickhouse/sinks/base_sink.py

+            self.ch_auth = (connection_overrides.get("username", self.ch_auth[0]),
+                            connection_overrides.get("password", self.ch_auth[1]))


Can we use named tuples? I think it'll read better than accessing 0 and 1 indexes

mariajgrimaldi · 2023-05-04T12:56:50Z

event_sink_clickhouse/sinks/course_published.py

+        course_key = item.scope_ids.usage_id.course_key
+        block_type = item.scope_ids.block_type
+
+        rtn_fields = {


what is rtn? can we be more precise?

mariajgrimaldi · 2023-05-04T12:57:41Z

event_sink_clickhouse/sinks/course_published.py

+        items = modulestore.get_items(course_id)
+
+        # create nodes
+        i = 0


can we be more precise with the variable names?

mariajgrimaldi · 2023-05-04T12:59:27Z

event_sink_clickhouse/sinks/course_published.py

+        location_to_node = {}
+        items = modulestore.get_items(course_id)
+
+        # create nodes


Can we be more verbose in these inline comments? If the implementation is complex and needs some explanation, then let's be a bit more precise

mariajgrimaldi · 2023-05-04T12:59:42Z

event_sink_clickhouse/sinks/course_published.py

+            fields = self.serialize_item(item, i, detached_xblock_types, dump_id, dump_timestamp)
+            location_to_node[self.strip_branch_and_version(item.location)] = fields
+
+        # create relationships


Same comment about the inline comments here

mariajgrimaldi · 2023-05-04T13:00:29Z

event_sink_clickhouse/sinks/base_sink.py

+    Base class for ClickHouse event sink, allows overwriting of default settings
+    """
+    def __init__(self, connection_overrides, log):
+        self.log = log


why do we need the log to be part of the class?

The next PR will be to add a management command that will call into here, so I'm using the pattern established in Coursegraph that passes in the log so that it can go to the celery log or normal IDA log based on how it's being run.

mariajgrimaldi · 2023-05-04T13:02:32Z

event_sink_clickhouse/sinks/course_published.py

+            response = requests.post(self.ch_url, data=output.getvalue(), params=params, auth=self.ch_auth,
+                                     timeout=self.ch_timeout_secs)


I think something like this would look better:

Suggested change

response = requests.post(self.ch_url, data=output.getvalue(), params=params, auth=self.ch_auth,

timeout=self.ch_timeout_secs)

response = requests.post(

self.ch_url,

data=output.getvalue(),

params=params,

auth=self.ch_auth,

timeout=self.ch_timeout_secs,

)

mariajgrimaldi · 2023-05-04T13:03:35Z

event_sink_clickhouse/sinks/course_published.py

+            params = {
+                # Fail early on bulk inserts
+                "input_format_allow_errors_num": 1,
+                "input_format_allow_errors_ratio": 0.1,
+            }
+
+            # "query" is a special param for the query, it's the best way to get the FORMAT CSV in there.
+            params["query"] = f"INSERT INTO {self.ch_database}.course_blocks FORMAT CSV"
+
+            output = io.StringIO()
+            writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)
+
+            for node in nodes:
+                writer.writerow(node.values())


Same comment here about variables

mariajgrimaldi · 2023-05-04T13:15:00Z

event_sink_clickhouse/sinks/course_published.py

+        from xmodule.modulestore.store_utilities import DETACHED_XBLOCK_TYPES
+        return DETACHED_XBLOCK_TYPES


This looks good! But I'm worried about compatibility issues with older releases. Do these imports work across the later releases?

DETACHED_XBLOCK_TYPES has been in that location for 8 yrs so it should be good 👍

Same for modulestore and courseoverview I guess.

mariajgrimaldi · 2023-05-04T13:16:00Z

event_sink_clickhouse/sinks/course_published.py

+import requests
+from django.utils import timezone
+
+from .base_sink import BaseSink


Same comment on absolute imports, I believe they increase readability

Creates the edx-platform plugin plumbing, adds some new requirements, maps the appropriate Django Signal to push course structure to ClickHouse.

In order to connect the nodes in a dump, where there may be many dumps per course, these columns are necessary to find the dump that corresponds most closely to an event or set of events.

Ian2012 · 2023-05-04T15:00:03Z

I think it's important to merge first #2 for quality

The requests themselves have been moved into the base class to consolidate error handling, and the CSV / Request generation moved into their own methods.

bmtcril · 2023-05-04T18:22:24Z

@mariajgrimaldi @Ian2012 I think I've addressed all of the PR feedback so far. Please re-review when you get a chance!

bmtcril · 2023-05-04T18:23:16Z

@Ian2012 I don't see anything in #2 that's relevant to this work, but let me know if I'm missing something

Ian2012 · 2023-05-04T18:25:38Z

@bmtcril nop, I was thinking that in that PR were more useful tests, but nop

mariajgrimaldi · 2023-05-04T21:06:55Z

This is how I tested after setting up my environment:

I imported the demo course
I changed the text of a course, then checked clickhouse:

tutor dev exec clickhouse bash
root@3ecc1a77f108:/# clickhouse-client
use event_sink;
select * from course_blocks;

where I got all courses' blocks.
4. Then, I changed the name of a section, then published the course again. I checked the blocks Display name, and it was up to date.

Thank you! This looks great :)

mariajgrimaldi · 2023-05-04T21:11:05Z

README.rst


 OARS consumes the data sent to ClickHouse by this plugin as part of data
 enrichment for reporting, or capturing data that otherwise does not fit in
 xAPI.

+Currently the only sink is in the CMS. It listens for the ``COURSE_PUBLISHED``


what does sink mean in this context? [curious]

It's just a message receiver that, in the context of this code, is just saving the data elsewhere. It's not performing any meaningful work or operating in the transactional environment of the service.

mariajgrimaldi · 2023-05-04T21:13:24Z

event_sink_clickhouse/apps.py

+                'common': {PluginSettings.RELATIVE_PATH: 'settings.common'},
+            }
+        },
+        # Configuration setting for Plugin Signals for this app.


I think we can remove these inline comments since the configurations are pretty self-explanatory

mariajgrimaldi · 2023-05-04T21:18:22Z

event_sink_clickhouse/sinks/course_published.py

+            auth=self.ch_auth
+        )
+
+        self._send_clickhouse_request(request)


is there a reason why do we create the request outside _send_clickhouse_request?

I was thinking it would give us more flexibility in the future, for instance if we needed PUT requests or wanted to use params to send data instead of putting it in the body, but I honestly didn't give it a ton of thought. I'd like to see what happens with the next couple of PRs here and decide on this and the testing question once there are additional use cases.

mariajgrimaldi · 2023-05-04T21:20:07Z

tests/test_course_published.py

+@responses.activate(registry=OrderedRegistry)  # pylint: disable=unexpected-keyword-arg,no-value-for-parameter
+@patch("event_sink_clickhouse.sinks.course_published.CoursePublishedSink._get_detached_xblock_types")
+@patch("event_sink_clickhouse.sinks.course_published.CoursePublishedSink._get_modulestore")
+def test_course_publish_success(mock_modulestore, mock_detached, caplog):


can we use the same pattern as in test_django_settings file? ie, create a test suite class

mariajgrimaldi

I left a few more comments but either way, I'm good with what you decide for this version :)

bmtcril · 2023-05-05T14:04:07Z

I'm going to merge as-is, since another PR will be following soon. We can refactor as we get further into the project. I'm also ignoring the coverage error since it's almost all imports that we can't actually test without doing a bunch of useless mocking.

bmtcril changed the title ~~Bmtcril/add event listener~~ Add Course Published event listener and plugin plumbing Apr 24, 2023

bmtcril force-pushed the bmtcril/add_event_listener branch from f898e07 to 0d07506 Compare April 24, 2023 20:28

bmtcril force-pushed the bmtcril/add_event_listener branch from 986729b to 089b2d0 Compare May 2, 2023 22:59

bmtcril marked this pull request as ready for review May 2, 2023 22:59

bmtcril requested a review from pomegranited May 2, 2023 23:03

Ian2012 reviewed May 3, 2023

View reviewed changes

bmtcril force-pushed the bmtcril/add_event_listener branch from 90f5a3d to d936dd3 Compare May 3, 2023 15:18

Ian2012 reviewed May 3, 2023

View reviewed changes

mariajgrimaldi suggested changes May 4, 2023

View reviewed changes

mariajgrimaldi reviewed May 4, 2023

View reviewed changes

bmtcril and others added 6 commits May 4, 2023 10:35

chore: Update catalog-info.yaml

990aeeb

feat: Add event listener for course publish

7747b55

Creates the edx-platform plugin plumbing, adds some new requirements, maps the appropriate Django Signal to push course structure to ClickHouse.

chore: Updating Python Requirements

185c12c

docs: Update changelog

e2a0e08

fix: Add unique dump id and consistent dump time

db42bc7

In order to connect the nodes in a dump, where there may be many dumps per course, these columns are necessary to find the dump that corresponds most closely to an event or set of events.

chore: Run make upgrade

d6af144

refactor: Address PR feedback and refactor ClickHouse HTTP requests

61a1c6e

The requests themselves have been moved into the base class to consolidate error handling, and the CSV / Request generation moved into their own methods.

bmtcril force-pushed the bmtcril/add_event_listener branch from d936dd3 to 61a1c6e Compare May 4, 2023 18:20

mariajgrimaldi reviewed May 4, 2023

View reviewed changes

mariajgrimaldi approved these changes May 4, 2023

View reviewed changes

bmtcril merged commit 8a9d84e into main May 5, 2023

bmtcril deleted the bmtcril/add_event_listener branch May 5, 2023 14:04

		self.ch_auth = (connection_overrides.get("username", self.ch_auth[0]),
		connection_overrides.get("password", self.ch_auth[1]))

		response = requests.post(self.ch_url, data=output.getvalue(), params=params, auth=self.ch_auth,
		timeout=self.ch_timeout_secs)

		from xmodule.modulestore.store_utilities import DETACHED_XBLOCK_TYPES
		return DETACHED_XBLOCK_TYPES

Add Course Published event listener and plugin plumbing #1

Add Course Published event listener and plugin plumbing #1

Conversation

bmtcril commented Apr 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ian2012 commented May 4, 2023

bmtcril commented May 4, 2023

bmtcril commented May 4, 2023

Ian2012 commented May 4, 2023

mariajgrimaldi commented May 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi May 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi left a comment • edited Loading

Choose a reason for hiding this comment

bmtcril commented May 5, 2023

bmtcril commented Apr 24, 2023 •

edited

Loading

mariajgrimaldi commented May 4, 2023 •

edited

Loading

mariajgrimaldi May 4, 2023 •

edited

Loading

mariajgrimaldi left a comment •

edited

Loading