Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize block date retrieval for course outlines. #99

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ Change Log
Unreleased
~~~~~~~~~~

[2.2.0] - 2021-08-27
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Added optimization when requesting course block dates for an outline, where block dates below subsections are unneeded.
* Use current version of the course to improve the cache key, along with using the TieredCache to cache date data.

[2.1.0] - 2021-07-23
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Added Django 3.2 Support
Expand Down
4 changes: 2 additions & 2 deletions edx_when/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
Your project description goes here.
Central source of course block dates for the LMS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chefkiss:

"""

__version__ = '2.1.0'
__version__ = '2.2.0'

default_app_config = 'edx_when.apps.EdxWhenConfig' # pylint: disable=invalid-name
179 changes: 119 additions & 60 deletions edx_when/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,10 @@
import logging
from datetime import timedelta

from django.core.cache import cache
from django.core.exceptions import ValidationError
from django.db import transaction
from django.db.models import DateTimeField, ExpressionWrapper, F, ObjectDoesNotExist, Q
from edx_django_utils.cache.utils import DEFAULT_REQUEST_CACHE
from edx_django_utils.cache.utils import TieredCache
from opaque_keys import InvalidKeyError
from opaque_keys.edx.keys import CourseKey, UsageKey

Expand All @@ -26,15 +25,28 @@
FIELDS_TO_EXTRACT = ('due', 'start', 'end')


def _content_dates_cache_key(course_key, query_dict):
"""Memcached key for ContentDates given course_key and filter args."""
def _content_dates_cache_key(course_key, query_dict, subsection_and_higher_only, published_version):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't love the name subsection_and_higher_only. It's so specific to a use case, but I don't ave any better ideas. I do appreciate the change from outline_only as I know we discussed that before and that it bled business logic.

I feel like ideally, this feature would be able to just specify different block types and have those returned and then the cache key would reflect those. I'm not sure how feasible/realistic that is so this is all just a nit and me thinking about different ideas. But being able to say "get me only units" or "sections and subsections" and pass in ['vertical'] or ['chapter', 'sequential'] would reduce any specificity of this feature. And enable the outline to still say, give me ['course', 'chapter', 'sequential']. Or maybe even ['course', 'sequential'] (I'm not sure we even use the section dates on the outline).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in 😍 with the name subsection_or_higher_only either - but it's the best I could think of and it's at least explicit without mentioning outlines. I agree with your proposed ideal solution as well. But I pulled back from that solution (passing in a list of block types) to avoid the extra scope & testing - and KISS/YAGNI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair. Thank you!

"""
Memcached key for ContentDates given course_key, filter args, subsection and higher blocks, and published version.

Adding the course's published version makes cache invalidation unnecessary,
as setting new course block dates will always be a new course version.
"""
query_dict_str = ".".join(
sorted(
"{},{}".format(key, value or "")
for key, value in query_dict.items()
)
)
return f"edx-when.content_dates:{course_key}:{query_dict_str}"
subsection_and_higher_only_str = ''
if subsection_and_higher_only:
subsection_and_higher_only_str = 'subsection_and_higher_only'
published_version_str = ''
if published_version:
published_version_str = published_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is published_version supposed to be a str at this point, or still the BSON obj?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, published_version would be a str at this point. But since I still haven't gotten my hands on an actual published_version, I'm not yet sure what type it will be.

I'm trying to get hold of a published_version here:
https://github.com/edx/edx-platform/blob/7e146e04879944c956ee89fb5ae126ae0e3371c7/openedx/core/djangoapps/course_date_signals/handlers.py#L130
And haven't had any luck yet...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a str at this point - verified it.


return f'edx-when.content_dates:{course_key}:{query_dict_str}:'\
f'{subsection_and_higher_only_str}:{published_version_str}'


def _ensure_key(key_class, key_obj):
Expand Down Expand Up @@ -68,7 +80,7 @@ def set_dates_for_course(course_key, items):
"""
Set dates for blocks.

items is an iterator of (location, field metadata dictionary)
items: iterator of (location, field metadata dictionary)
"""
with transaction.atomic():
active_date_ids = []
Expand All @@ -79,16 +91,22 @@ def set_dates_for_course(course_key, items):
val = fields[field]
if val:
log.info('Setting date for %r, %s, %r', location, field, val)
active_date_ids.append(set_date_for_block(course_key, location, field, val))
active_date_ids.append(
set_date_for_block(course_key, location, field, val)
)

# Now clear out old dates that we didn't touch
clear_dates_for_course(course_key, keep=active_date_ids)
_clear_dates_for_course(course_key, active_date_ids)


def clear_dates_for_course(course_key, keep=None):
def _clear_dates_for_course(course_key, keep=None):
"""
Set all dates to inactive.

This method changes currently-cached rows - but the cache key contains the course's published version.
And the calling method is *only* invoked upon course publish, which guarantees that a new course published
version will be used to get course dates after this call is made, invalidating existing cache entries.
Comment on lines +106 to +108
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment. I'm sure it will be a life-saver in the future


Arguments:
course_key: either a CourseKey or string representation of same
keep: an iterable of ContentDate ids to keep active
Expand All @@ -99,10 +117,6 @@ def clear_dates_for_course(course_key, keep=None):
dates = dates.exclude(id__in=keep)
dates.update(active=False)

for query_dict in [{}, {'policy__rel_date': None}]:
cache_key = _content_dates_cache_key(course_key, query_dict)
cache.delete(cache_key)


def _get_end_dates_from_content_dates(qset):
"""
Expand All @@ -122,9 +136,33 @@ def _get_end_dates_from_content_dates(qset):
return end_datetime, cutoff_datetime


def _processed_results_cache_key(
course_id, user_id, schedule, allow_relative_dates,
subsection_and_higher_only, published_version
):
"""
Construct the cache key, incorporating all parameters which would cause a different query set to be returned.
"""
cache_key = 'course_dates.%s' % course_id
if user_id:
cache_key += '.%s' % user_id
if schedule:
cache_key += '.schedule-%s' % schedule.start_date
if allow_relative_dates:
cache_key += '.with-rel-dates'
if subsection_and_higher_only:
cache_key += '.subsection_and_higher_only'
cache_key += '.%s' % published_version if published_version else ''
return cache_key


# TODO: Record dates for every block in the course, not just the ones where the block
# has an explicitly set date.
def get_dates_for_course(course_id, user=None, use_cached=True, schedule=None):
def get_dates_for_course(
course_id,
user=None, use_cached=True, schedule=None,
subsection_and_higher_only=False, published_version=None
):
"""
Return dictionary of dates for the given course_id and optional user.

Expand All @@ -133,29 +171,35 @@ def get_dates_for_course(course_id, user=None, use_cached=True, schedule=None):

Arguments:
course_id: either a CourseKey or string representation of same
user: None, an int, or a User object
use_cached: will skip cache lookups (but not saves) if False
schedule: optional override for a user's enrollment Schedule, used for relative date calculations
user: None, an int (user_id), or a User object
use_cached: bool (optional) - skips cache lookups (but not saves) if False
schedule: Schedule obj (optional) - override for a user's enrollment Schedule, used
for relative date calculations
subsection_and_higher_only: bool (optional) - only returns dates for blocks at the subsection
level and higher (i.e. course, section (chapter), subsection (sequential)).
published_version: (optional) string representing the ID of the course's published version
"""
course_id = _ensure_key(CourseKey, course_id)
log.debug("Getting dates for %s as %s", course_id, user)
allow_relative_dates = _are_relative_dates_enabled(course_id)

cache_key = 'course_dates.%s' % course_id
user_id = None
if user:
if isinstance(user, int):
user_id = user
else:
user_id = user.id if not user.is_anonymous else ''
cache_key += '.%s' % user_id
else:
user_id = None
if schedule:
cache_key += '.schedule-%s' % schedule.start_date
if allow_relative_dates:
cache_key += '.with-rel-dates'

dates = DEFAULT_REQUEST_CACHE.data.get(cache_key, None)
# Construct the cache key, incorporating all parameters which would cause a different
# query set to be returned.
processed_results_cache_key = _processed_results_cache_key(
course_id, user_id, schedule, allow_relative_dates, subsection_and_higher_only, published_version
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cache key for the queried and processed results.


dates = None
cached_response = TieredCache.get_cached_response(processed_results_cache_key)
if cached_response.is_found:
dates = cached_response.value

if use_cached and dates is not None:
return dates
Expand All @@ -166,19 +210,28 @@ def get_dates_for_course(course_id, user=None, use_cached=True, schedule=None):
# to cache invalidation in clear_dates_for_course. This is only safe to do
# because a) we serialize to cache with pickle; b) we don't write to
# ContentDate in this function; This is not a great long-term solution.
external_cache_key = _content_dates_cache_key(course_id, rel_lookup)
qset = cache.get(external_cache_key) if use_cached else None
raw_results_cache_key = _content_dates_cache_key(
course_id, rel_lookup, subsection_and_higher_only, published_version
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cache key for the raw, unprocessed query results before the additional processing below.

qset = None
if use_cached:
cached_response = TieredCache.get_cached_response(raw_results_cache_key)
if cached_response.is_found:
qset = cached_response.value
if qset is None:
qset = models.ContentDate.objects.filter(course_id=course_id, active=True, **rel_lookup)
if subsection_and_higher_only:
# Include NULL block_type values as well because of lazy rollout.
qset = qset.filter(Q(block_type__in=('course', 'chapter', 'sequential', None)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter is the new one that limits the queryset to the block types shown. None is also included because of the lazy filling of the block_type field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a condition on which we can remove the None? Or just always keep because we won't ever be able to guarantee everything has it filled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds the new column with null=True to avoid DB locking on deployment. And I've chosen to do a lazy load of the column upon course publish. With this strategy, it's possible that some courses will never be published and won't have values for block_type ever - unless we populate the value for all courses using some asynchronous process. Once a course is published, no None values would be expected for block_type.

Summary: With the current plan, just always keep.


qset = list(
models.ContentDate.objects
.filter(course_id=course_id, active=True, **rel_lookup)
.select_related('policy')
.only(
"course_id", "policy__rel_date",
"policy__abs_date", "location", "field"
)
qset.select_related('policy')
.only(
"course_id", "policy__rel_date",
"policy__abs_date", "location", "field"
)
)
cache.set(external_cache_key, qset)
TieredCache.set_all_tiers(raw_results_cache_key, qset)

dates = {}
policies = {}
Expand Down Expand Up @@ -212,12 +265,12 @@ def get_dates_for_course(course_id, user=None, use_cached=True, schedule=None):
except (ValueError, ObjectDoesNotExist, KeyError):
log.warning("Unable to read date for %s", userdate.content_date, exc_info=True)

DEFAULT_REQUEST_CACHE.data[cache_key] = dates
TieredCache.set_all_tiers(processed_results_cache_key, dates)

return dates


def get_date_for_block(course_id, block_id, name='due', user=None):
def get_date_for_block(course_id, block_id, name='due', user=None, published_version=None):
"""
Return the date for block in the course for the (optional) user.

Expand All @@ -226,9 +279,15 @@ def get_date_for_block(course_id, block_id, name='due', user=None):
block_id: either a UsageKey or string representation of same
name (optional): the name of the date field to read
user: None, an int, or a User object
published_version: (optional) string representing the ID of the course's published version
"""
try:
return get_dates_for_course(course_id, user).get((_ensure_key(UsageKey, block_id), name), None)
return get_dates_for_course(
course_id, user=user, published_version=published_version
).get(
(_ensure_key(UsageKey, block_id), name),
None
)
except InvalidKeyError:
return None

Expand Down Expand Up @@ -296,7 +355,10 @@ def get_overrides_for_user(course_id, user):
yield {'location': udate.content_date.location, 'actual_date': udate.actual_date}


def set_date_for_block(course_id, block_id, field, date_or_timedelta, user=None, reason='', actor=None):
def set_date_for_block(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method also be clearing the cache when a new date is set? I know we do it when we clear dates, but wouldn't this be a similar process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption I'd made here is that a set_date_for_block would be preceded by a course publish - and therefore a new published_version would be passed-in. This new published_version would leave the old cached value behind. Clearing all course dates on blocks might not necessarily be preceded by a course publish.

course_id, block_id, field, date_or_timedelta,
user=None, reason='', actor=None
):
"""
Save the date for a particular field in a block.

Expand All @@ -318,6 +380,14 @@ def set_date_for_block(course_id, block_id, field, date_or_timedelta, user=None,
else:
date_kwargs = {'abs_date': date_or_timedelta}

def _set_content_date_policy(date_kwargs, existing_content_date):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for DRYing it up in here

# Race conditions were creating multiple DatePolicies w/ the same values. Handle that case.
existing_policies = list(models.DatePolicy.objects.filter(**date_kwargs).order_by('id'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just me wondering, and no obligation at all to test or try or change anything, but I wonder why do the list and [0] vs just using .first() on the query set. Not sure the error it throws if none (maybe none) but maybe an option. Not at all for you to do, just commenting while in my head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://docs.djangoproject.com/en/2.2/ref/models/querysets/#first, using first() would work just as well. I simply copied the existing code, removing the never-thrown exception.

if existing_policies:
existing_content_date.policy = existing_policies[0]
else:
existing_content_date.policy = models.DatePolicy.objects.create(**date_kwargs)

with transaction.atomic(savepoint=False): # this is frequently called in a loop, let's avoid the savepoints
try:
existing_date = models.ContentDate.objects.select_related('policy').get(
Expand All @@ -327,20 +397,13 @@ def set_date_for_block(course_id, block_id, field, date_or_timedelta, user=None,
existing_date.active = True
except models.ContentDate.DoesNotExist as error:
if user:
# A UserDate creation below requires an existing ContentDate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these awesome comments! Always good to have some fresh eyes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took awhile to figure it out! So thought I'd capture it.

raise MissingDateError(block_id) from error
existing_date = models.ContentDate(course_id=course_id, location=block_id, field=field)

# We had race-conditions create multiple DatePolicies w/ the same values. Handle that case.
try:
existing_policies = list(models.DatePolicy.objects.filter(**date_kwargs).order_by('id'))
except models.DatePolicy.DoesNotExist:
existing_policies = []
if existing_policies:
existing_date.policy = existing_policies[0]
else:
existing_date.policy = models.DatePolicy.objects.create(**date_kwargs)
_set_content_date_policy(date_kwargs, existing_date)
needs_save = True

# Determine if ourse block date is for a particular user -or- for the course in general.
if user and not user.is_anonymous:
userd = models.UserDate(
user=user,
Expand All @@ -363,18 +426,14 @@ def set_date_for_block(course_id, block_id, field, date_or_timedelta, user=None,
existing_date.policy.abs_date or existing_date.policy.rel_date,
date_or_timedelta
)
# We had race-conditions create multiple DatePolicies w/ the same values. Handle that case.
try:
existing_policies = list(models.DatePolicy.objects.filter(**date_kwargs).order_by('id'))
except models.DatePolicy.DoesNotExist:
existing_policies = []
if existing_policies:
existing_date.policy = existing_policies[0]
else:
existing_date.policy = models.DatePolicy.objects.create(**date_kwargs)

_set_content_date_policy(date_kwargs, existing_date)
needs_save = True

# Sync the block_type for the ContentDate, if needed.
if existing_date.block_type != block_id.block_type:
existing_date.block_type = block_id.block_type
needs_save = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the new block_type field is added as NULL, this code will lazily fill the field with block types.


if needs_save:
existing_date.save()
return existing_date.id
Expand Down
22 changes: 22 additions & 0 deletions edx_when/migrations/0008_courseversion_block_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Generated by Django 3.2.6 on 2021-09-01 10:45

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('edx_when', '0007_meta_tweaks'),
]

operations = [
migrations.AddField(
model_name='contentdate',
name='block_type',
field=models.CharField(max_length=255, null=True),
),
migrations.AddIndex(
model_name='contentdate',
index=models.Index(fields=['course_id', 'block_type'], name='edx_when_course_block_type_idx'),
),
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the SQL which this migration will run - generated with ./manage.py cms sqlmigrate edx_when 0008:

BEGIN;
--
-- Add field block_type to contentdate
--
ALTER TABLE `edx_when_contentdate` ADD COLUMN `block_type` varchar(255) NULL;
--
-- Alter field course_id on contentdate
--
DROP INDEX `edx_when_contentdate_course_id_e6c39fdc` ON `edx_when_contentdate`;
--
-- Alter field location on contentdate
--
DROP INDEX `edx_when_contentdate_location_485206ea` ON `edx_when_contentdate`;
--
-- Create index edx_when_course_block_type_idx on field(s) course_id, block_type of model contentdate
--
CREATE INDEX `edx_when_course_block_type_idx` ON `edx_when_contentdate` (`course_id`, `block_type`);
--
-- Create index edx_when_course_id_idx on field(s) course_id of model contentdate
--
CREATE INDEX `edx_when_course_id_idx` ON `edx_when_contentdate` (`course_id`);
--
-- Create index edx_when_location_idx on field(s) location of model contentdate
--
CREATE INDEX `edx_when_location_idx` ON `edx_when_contentdate` (`location`);
COMMIT;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah whoops. I didn't mean for additional SQL to be executed via the switch to models.Index vs the db_index parameter. Happy to have you keep it, but if you'd rather not keep in these additional changes, totally understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually going to remove it - just to avoid risk by doing less index creation and dropping for this PR. Thanks!

Copy link
Contributor Author

@doctoryes doctoryes Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the index changes as mentioned - the SQL that will now be executed is below:

BEGIN;
--
-- Add field block_type to contentdate
--
ALTER TABLE `edx_when_contentdate` ADD COLUMN `block_type` varchar(255) NULL;
--
-- Create index edx_when_course_block_type_idx on field(s) course_id, block_type of model contentdate
--
CREATE INDEX `edx_when_course_block_type_idx` ON `edx_when_contentdate` (`course_id`, `block_type`);
COMMIT;

6 changes: 5 additions & 1 deletion edx_when/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,18 +96,22 @@ class ContentDate(models.Model):
location = UsageKeyField(null=True, default=None, db_index=True, max_length=255)
field = models.CharField(max_length=255, default='')
active = models.BooleanField(default=True)
block_type = models.CharField(max_length=255, null=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also want a composite index on (course_id, block_type)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added below.


class Meta:
"""Django Metadata."""

unique_together = ('policy', 'location', 'field')
indexes = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Add course_id and location here using the Index format instead of db_index on the field definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - I'll try this and see how it changes the migration file.

models.Index(fields=('course_id', 'block_type'), name='edx_when_course_block_type_idx'),
]

def __str__(self):
"""
Get a string representation of this model instance.
"""
# Location already holds course id
return f'{self.location}, {self.field}'
return f'ContentDate({self.policy}, {self.location}, {self.field}, {self.block_type})'

def schedule_for_user(self, user):
"""
Expand Down
Empty file added tests/__init__.py
Empty file.
Loading