Skip to content
This repository has been archived by the owner on Apr 3, 2024. It is now read-only.

Add Course Published event listener and plugin plumbing #1

Merged
merged 7 commits into from
May 5, 2023

Conversation

bmtcril
Copy link
Contributor

@bmtcril bmtcril commented Apr 24, 2023

PR is ready for review. This has worked for me in a Tutor nightly build using the accompanying branch of the OARS plugin: openedx/tutor-contrib-aspects#35

Testing:

  • (in an otherwise working OARS env)
  • Install the OARS plugin branch above
  • tutor config save (to get the ClickHouse config)
  • tutor images build openedx --no-cache (to get the plugin installed)
  • tutor local do init -l oars (to get the new database and tables created)
  • tutor local start
  • Make sure you have the demo course loaded
  • Edit a text block in the course
  • You should see course data in ClickHouse's event_sink.course_blocks and event_sink.course_relationships
  • If there are errors they will likely show up in the cms-worker container

Known issue: The data is not versioned, I will need to update both tables to have a unique id of some variety so we can tell which versions go with which events.

Merge checklist:
Check off if complete or not applicable:

  • Version bumped
  • Changelog record added
  • Documentation updated (not only docstrings)
  • Fixup commits are squashed away
  • Unit tests added/updated
  • Manual testing instructions provided
  • Noted any: Concerns, dependencies, migration issues, deadlines, tickets

@bmtcril bmtcril changed the title Bmtcril/add event listener Add Course Published event listener and plugin plumbing Apr 24, 2023
@bmtcril bmtcril force-pushed the bmtcril/add_event_listener branch from f898e07 to 0d07506 Compare April 24, 2023 20:28
@bmtcril bmtcril force-pushed the bmtcril/add_event_listener branch from 986729b to 089b2d0 Compare May 2, 2023 22:59
@bmtcril bmtcril marked this pull request as ready for review May 2, 2023 22:59
@bmtcril bmtcril requested a review from pomegranited May 2, 2023 23:03
response.raise_for_status()

# Just overwriting the previous query
params["query"] = f"INSERT INTO {self.ch_database}.course_relationships FORMAT CSV"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can course_relationships be a setting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the table? I thought about it, but thought that as the "owner" of the table it would be ok to hard code here. We already have a variable for it in the OARS plugin, though, so it probably makes sense to make it all configured the same. It's going to get really complicated with the superset queries and charts, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will simplify a lot of things to keep these hard coded and owned by this plugin. Since the database can be specified there shouldn't be any namespacing issues with other tables.


@shared_task
@set_code_owner_attribute
def dump_course_to_clickhouse(course_key_string, connection_overrides=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too expensive to run a full courses dump, can the course_key be optional and dump all courses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next task for this repo is to add in a management command like the coursegraph one which will handle this use case. It does some additional checking of when the last time a course was dumped and kicks off a celery task for each course, so this shouldn't need to change.

@bmtcril bmtcril force-pushed the bmtcril/add_event_listener branch from 90f5a3d to d936dd3 Compare May 3, 2023 15:18
Comment on lines 206 to 208
self.log.info(response.headers)
self.log.info(response)
self.log.info(response.text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're vital for debugging issues, I'll put them in a try.

Comment on lines 221 to 223
self.log.info(response.headers)
self.log.info(response)
self.log.info(response.text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before

Copy link

@mariajgrimaldi mariajgrimaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this -I'll do it today 😅- but I read through the code and left some style comments! Let me know what you think

'cms.djangoapp': {
'production': {PluginSettings.RELATIVE_PATH: 'settings.production'},
'common': {PluginSettings.RELATIVE_PATH: 'settings.common'},
'devstack': {PluginSettings.RELATIVE_PATH: 'settings.devstack'},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use development instead of devstack? Since we're using tutor now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those files don't actually exist, and as far as I know we don't need them so I'm going to remove these

"""
super().ready()

from . import tasks # pylint: disable=import-outside-toplevel, unused-import

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use absolute imports across the project?

Comment on lines 23 to 24
self.ch_auth = (connection_overrides.get("username", self.ch_auth[0]),
connection_overrides.get("password", self.ch_auth[1]))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use named tuples? I think it'll read better than accessing 0 and 1 indexes

course_key = item.scope_ids.usage_id.course_key
block_type = item.scope_ids.block_type

rtn_fields = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is rtn? can we be more precise?

items = modulestore.get_items(course_id)

# create nodes
i = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be more precise with the variable names?

location_to_node = {}
items = modulestore.get_items(course_id)

# create nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more verbose in these inline comments? If the implementation is complex and needs some explanation, then let's be a bit more precise

fields = self.serialize_item(item, i, detached_xblock_types, dump_id, dump_timestamp)
location_to_node[self.strip_branch_and_version(item.location)] = fields

# create relationships

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the inline comments here

Base class for ClickHouse event sink, allows overwriting of default settings
"""
def __init__(self, connection_overrides, log):
self.log = log

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the log to be part of the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next PR will be to add a management command that will call into here, so I'm using the pattern established in Coursegraph that passes in the log so that it can go to the celery log or normal IDA log based on how it's being run.

Comment on lines 204 to 205
response = requests.post(self.ch_url, data=output.getvalue(), params=params, auth=self.ch_auth,
timeout=self.ch_timeout_secs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like this would look better:

Suggested change
response = requests.post(self.ch_url, data=output.getvalue(), params=params, auth=self.ch_auth,
timeout=self.ch_timeout_secs)
response = requests.post(
self.ch_url,
data=output.getvalue(),
params=params,
auth=self.ch_auth,
timeout=self.ch_timeout_secs,
)

Comment on lines 189 to 202
params = {
# Fail early on bulk inserts
"input_format_allow_errors_num": 1,
"input_format_allow_errors_ratio": 0.1,
}

# "query" is a special param for the query, it's the best way to get the FORMAT CSV in there.
params["query"] = f"INSERT INTO {self.ch_database}.course_blocks FORMAT CSV"

output = io.StringIO()
writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)

for node in nodes:
writer.writerow(node.values())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about variables

Comment on lines +34 to +41
from xmodule.modulestore.store_utilities import DETACHED_XBLOCK_TYPES
return DETACHED_XBLOCK_TYPES

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! But I'm worried about compatibility issues with older releases. Do these imports work across the later releases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DETACHED_XBLOCK_TYPES has been in that location for 8 yrs so it should be good 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for modulestore and courseoverview I guess.

import requests
from django.utils import timezone

from .base_sink import BaseSink

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on absolute imports, I believe they increase readability

bmtcril and others added 6 commits May 4, 2023 10:35
Creates the edx-platform plugin plumbing, adds some new requirements, maps the appropriate Django Signal to push course structure to ClickHouse.
In order to connect the nodes in a dump, where there may be many dumps per course, these columns are necessary to find the dump that corresponds most closely to an event or set of events.
@Ian2012
Copy link
Contributor

Ian2012 commented May 4, 2023

I think it's important to merge first #2 for quality

The requests themselves have been moved into the base class to consolidate error handling, and the CSV / Request generation moved into their own methods.
@bmtcril bmtcril force-pushed the bmtcril/add_event_listener branch from d936dd3 to 61a1c6e Compare May 4, 2023 18:20
@bmtcril
Copy link
Contributor Author

bmtcril commented May 4, 2023

@mariajgrimaldi @Ian2012 I think I've addressed all of the PR feedback so far. Please re-review when you get a chance!

@bmtcril
Copy link
Contributor Author

bmtcril commented May 4, 2023

@Ian2012 I don't see anything in #2 that's relevant to this work, but let me know if I'm missing something

@Ian2012
Copy link
Contributor

Ian2012 commented May 4, 2023

@bmtcril nop, I was thinking that in that PR were more useful tests, but nop

@mariajgrimaldi
Copy link

mariajgrimaldi commented May 4, 2023

This is how I tested after setting up my environment:

  1. I imported the demo course
  2. I changed the text of a course, then checked clickhouse:
tutor dev exec clickhouse bash
root@3ecc1a77f108:/# clickhouse-client
use event_sink;
select * from course_blocks;

where I got all courses' blocks.
4. Then, I changed the name of a section, then published the course again. I checked the blocks Display name, and it was up to date.

Thank you! This looks great :)


OARS consumes the data sent to ClickHouse by this plugin as part of data
enrichment for reporting, or capturing data that otherwise does not fit in
xAPI.

Currently the only sink is in the CMS. It listens for the ``COURSE_PUBLISHED``

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does sink mean in this context? [curious]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a message receiver that, in the context of this code, is just saving the data elsewhere. It's not performing any meaningful work or operating in the transactional environment of the service.

'common': {PluginSettings.RELATIVE_PATH: 'settings.common'},
}
},
# Configuration setting for Plugin Signals for this app.
Copy link

@mariajgrimaldi mariajgrimaldi May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove these inline comments since the configurations are pretty self-explanatory

auth=self.ch_auth
)

self._send_clickhouse_request(request)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why do we create the request outside _send_clickhouse_request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it would give us more flexibility in the future, for instance if we needed PUT requests or wanted to use params to send data instead of putting it in the body, but I honestly didn't give it a ton of thought. I'd like to see what happens with the next couple of PRs here and decide on this and the testing question once there are additional use cases.

@responses.activate(registry=OrderedRegistry) # pylint: disable=unexpected-keyword-arg,no-value-for-parameter
@patch("event_sink_clickhouse.sinks.course_published.CoursePublishedSink._get_detached_xblock_types")
@patch("event_sink_clickhouse.sinks.course_published.CoursePublishedSink._get_modulestore")
def test_course_publish_success(mock_modulestore, mock_detached, caplog):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the same pattern as in test_django_settings file? ie, create a test suite class

Copy link

@mariajgrimaldi mariajgrimaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments but either way, I'm good with what you decide for this version :)

@bmtcril
Copy link
Contributor Author

bmtcril commented May 5, 2023

I'm going to merge as-is, since another PR will be following soon. We can refactor as we get further into the project. I'm also ignoring the coverage error since it's almost all imports that we can't actually test without doing a bunch of useless mocking.

@bmtcril bmtcril merged commit 8a9d84e into main May 5, 2023
@bmtcril bmtcril deleted the bmtcril/add_event_listener branch May 5, 2023 14:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants