-
Notifications
You must be signed in to change notification settings - Fork 5
Add Course Published event listener and plugin plumbing #1
Conversation
f898e07
to
0d07506
Compare
986729b
to
089b2d0
Compare
response.raise_for_status() | ||
|
||
# Just overwriting the previous query | ||
params["query"] = f"INSERT INTO {self.ch_database}.course_relationships FORMAT CSV" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can course_relationships be a setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the table? I thought about it, but thought that as the "owner" of the table it would be ok to hard code here. We already have a variable for it in the OARS plugin, though, so it probably makes sense to make it all configured the same. It's going to get really complicated with the superset queries and charts, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will simplify a lot of things to keep these hard coded and owned by this plugin. Since the database can be specified there shouldn't be any namespacing issues with other tables.
|
||
@shared_task | ||
@set_code_owner_attribute | ||
def dump_course_to_clickhouse(course_key_string, connection_overrides=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too expensive to run a full courses dump, can the course_key be optional and dump all courses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next task for this repo is to add in a management command like the coursegraph one which will handle this use case. It does some additional checking of when the last time a course was dumped and kicks off a celery task for each course, so this shouldn't need to change.
90f5a3d
to
d936dd3
Compare
self.log.info(response.headers) | ||
self.log.info(response) | ||
self.log.info(response.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're vital for debugging issues, I'll put them in a try.
self.log.info(response.headers) | ||
self.log.info(response) | ||
self.log.info(response.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested this -I'll do it today 😅- but I read through the code and left some style comments! Let me know what you think
event_sink_clickhouse/apps.py
Outdated
'cms.djangoapp': { | ||
'production': {PluginSettings.RELATIVE_PATH: 'settings.production'}, | ||
'common': {PluginSettings.RELATIVE_PATH: 'settings.common'}, | ||
'devstack': {PluginSettings.RELATIVE_PATH: 'settings.devstack'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use development instead of devstack? Since we're using tutor now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those files don't actually exist, and as far as I know we don't need them so I'm going to remove these
""" | ||
super().ready() | ||
|
||
from . import tasks # pylint: disable=import-outside-toplevel, unused-import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use absolute imports across the project?
self.ch_auth = (connection_overrides.get("username", self.ch_auth[0]), | ||
connection_overrides.get("password", self.ch_auth[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use named tuples? I think it'll read better than accessing 0 and 1 indexes
course_key = item.scope_ids.usage_id.course_key | ||
block_type = item.scope_ids.block_type | ||
|
||
rtn_fields = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is rtn? can we be more precise?
items = modulestore.get_items(course_id) | ||
|
||
# create nodes | ||
i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be more precise with the variable names?
location_to_node = {} | ||
items = modulestore.get_items(course_id) | ||
|
||
# create nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more verbose in these inline comments? If the implementation is complex and needs some explanation, then let's be a bit more precise
fields = self.serialize_item(item, i, detached_xblock_types, dump_id, dump_timestamp) | ||
location_to_node[self.strip_branch_and_version(item.location)] = fields | ||
|
||
# create relationships |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the inline comments here
Base class for ClickHouse event sink, allows overwriting of default settings | ||
""" | ||
def __init__(self, connection_overrides, log): | ||
self.log = log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the log to be part of the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next PR will be to add a management command that will call into here, so I'm using the pattern established in Coursegraph that passes in the log so that it can go to the celery log or normal IDA log based on how it's being run.
response = requests.post(self.ch_url, data=output.getvalue(), params=params, auth=self.ch_auth, | ||
timeout=self.ch_timeout_secs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like this would look better:
response = requests.post(self.ch_url, data=output.getvalue(), params=params, auth=self.ch_auth, | |
timeout=self.ch_timeout_secs) | |
response = requests.post( | |
self.ch_url, | |
data=output.getvalue(), | |
params=params, | |
auth=self.ch_auth, | |
timeout=self.ch_timeout_secs, | |
) |
params = { | ||
# Fail early on bulk inserts | ||
"input_format_allow_errors_num": 1, | ||
"input_format_allow_errors_ratio": 0.1, | ||
} | ||
|
||
# "query" is a special param for the query, it's the best way to get the FORMAT CSV in there. | ||
params["query"] = f"INSERT INTO {self.ch_database}.course_blocks FORMAT CSV" | ||
|
||
output = io.StringIO() | ||
writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC) | ||
|
||
for node in nodes: | ||
writer.writerow(node.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about variables
from xmodule.modulestore.store_utilities import DETACHED_XBLOCK_TYPES | ||
return DETACHED_XBLOCK_TYPES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! But I'm worried about compatibility issues with older releases. Do these imports work across the later releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DETACHED_XBLOCK_TYPES
has been in that location for 8 yrs so it should be good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for modulestore and courseoverview I guess.
import requests | ||
from django.utils import timezone | ||
|
||
from .base_sink import BaseSink |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on absolute imports, I believe they increase readability
Creates the edx-platform plugin plumbing, adds some new requirements, maps the appropriate Django Signal to push course structure to ClickHouse.
In order to connect the nodes in a dump, where there may be many dumps per course, these columns are necessary to find the dump that corresponds most closely to an event or set of events.
I think it's important to merge first #2 for quality |
The requests themselves have been moved into the base class to consolidate error handling, and the CSV / Request generation moved into their own methods.
d936dd3
to
61a1c6e
Compare
@mariajgrimaldi @Ian2012 I think I've addressed all of the PR feedback so far. Please re-review when you get a chance! |
@bmtcril nop, I was thinking that in that PR were more useful tests, but nop |
This is how I tested after setting up my environment:
where I got all courses' blocks. Thank you! This looks great :) |
|
||
OARS consumes the data sent to ClickHouse by this plugin as part of data | ||
enrichment for reporting, or capturing data that otherwise does not fit in | ||
xAPI. | ||
|
||
Currently the only sink is in the CMS. It listens for the ``COURSE_PUBLISHED`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does sink
mean in this context? [curious]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a message receiver that, in the context of this code, is just saving the data elsewhere. It's not performing any meaningful work or operating in the transactional environment of the service.
'common': {PluginSettings.RELATIVE_PATH: 'settings.common'}, | ||
} | ||
}, | ||
# Configuration setting for Plugin Signals for this app. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove these inline comments since the configurations are pretty self-explanatory
auth=self.ch_auth | ||
) | ||
|
||
self._send_clickhouse_request(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why do we create the request outside _send_clickhouse_request
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it would give us more flexibility in the future, for instance if we needed PUT requests or wanted to use params to send data instead of putting it in the body, but I honestly didn't give it a ton of thought. I'd like to see what happens with the next couple of PRs here and decide on this and the testing question once there are additional use cases.
@responses.activate(registry=OrderedRegistry) # pylint: disable=unexpected-keyword-arg,no-value-for-parameter | ||
@patch("event_sink_clickhouse.sinks.course_published.CoursePublishedSink._get_detached_xblock_types") | ||
@patch("event_sink_clickhouse.sinks.course_published.CoursePublishedSink._get_modulestore") | ||
def test_course_publish_success(mock_modulestore, mock_detached, caplog): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the same pattern as in test_django_settings file? ie, create a test suite class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments but either way, I'm good with what you decide for this version :)
I'm going to merge as-is, since another PR will be following soon. We can refactor as we get further into the project. I'm also ignoring the coverage error since it's almost all imports that we can't actually test without doing a bunch of useless mocking. |
PR is ready for review. This has worked for me in a Tutor nightly build using the accompanying branch of the OARS plugin: openedx/tutor-contrib-aspects#35
Testing:
tutor config save
(to get the ClickHouse config)tutor images build openedx --no-cache
(to get the plugin installed)tutor local do init -l oars
(to get the new database and tables created)tutor local start
event_sink.course_blocks
andevent_sink.course_relationships
Known issue: The data is not versioned, I will need to update both tables to have a unique id of some variety so we can tell which versions go with which events.
Merge checklist:
Check off if complete or not applicable: