You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, we store all the revision raw data in our Revisions table, and query historical revision data in different parts of the app, particularly for calculate course stats caches. These are one of the heaviest queries we currently do.
Desired Behavior:
We want to store stats for each course by time period (for example, week-by-week stats), and only use revision data during the update for a given time period, so that we can remove the Revisions table altogether. This will dramatically reduce the storage requirements of the system and remove one of the major database performance bottlenecks.
Tasks:
Non-exhaustive list of things to do:
High priority:
We need this done to the timeslice world works.
Prepare deployment
Create timeslices models
Add indexes for models
Implement cache updates for ArticleCourseWikiTimeslices and ArticlesCourses.
Implement cache updates for CourseUserWikiTimeslices and CoursesUsers.
Implement cache updates for CourseWikiTimeslices and Courses.
Check we're dealing with timezones in a good way
Revisit how to calculate ArticlesCourses.new_article because it's not an invariant: if the first revision is removed, then the following revision starts to be the first revision (with new_article = true). We should add a new_article field to ArticleCourseTimeslices and calculate ArticlesCourses.new_article based on that.
Make refactor on timeslices models to avoid using join fields. For example, use two different fields course and wiki instead of using a single courses_wikis field.
Replace the logic on ArticlesCourses.update_from_course. Currently, this takes care of removing all the ArticlesCourses that do not correspond to course revisions (due to updates on the course dates or tracked wikis). It also adds new ArticlesCourses. This should be modified to calculate new ArticlesCourses based on revisions in RAM. Changes on course dates or tracked wikis should trigger recalculations or calculations of new timeslices. Note: this should be done when implementing CourseUserUpdater ✔️, CourseWikiUpdater ✔️ and CourseDatesUpdater ✔️ .
Add support for untracking articles.
THIS WONT BE NECESSARY Replace the logic on DuplicateArticleDeleter.resolve_duplicates. Instead of removing "limbo revisions", we should try to identify if any timeslice needs to be recalculated.
Replace the logic derivated from ArticleStatusManager.update_article_status_for_course(@course). This ends up removing "limbo revisions" through the ModifiedRevisionsManager class. Related to the point mentioned above.
Revisit view counts calculation when TimesliceManager is implemented.
Make UpdateCourseStatsTimeslice works by wiki (so that it is not necessary to have all the revisions in memory).
Revisit which update processes should be atomic and when we can update caches even with gaps in between.
Create a TimesliceManager class that takes care of the logic around creating new/ updating existing timeslice records. Supposing the Revision data was imported and we have it in RAM, the criteria is the following:
For every entity (course_wiki, course_user, article_course) that was updated, find the last existing timeslice record for it. If no timeslice is found or the timeslice is finished (end - start range is more than 1 day), create a new record for it with the new data. If the timeslice is still current, then update it.
For every entity not updated (retrieve all the entities for the course and calculate the difference against the updated ones), find the last existing timeslice for it. If no timeslice exists (for example, a new course user was added recently, but the user didn’t make any edit yet), then create a new empty timeslice (from beginning of the course to now). If a timeslice is found, then extend it (increase the end field).
Fix update_last_mw_rev_datetime method. Think about the case updating from 23:55 to 00:02.
Improve Sentry logs
Add a strategy to track errors when getting scores. One option would be to have an error field as part of the timeslice and mark timeslices that received any error while getting scores (for example, because some API was down at that time). This way we could try to reprocess the timeslice in the future.
Support changes in categories and assignments for ArticleScopedProgram and VisitingScholarship course types.
Support changes in tracked namespaces? Right now, if a namespace is added or removed, it won't modify historical data.
Support having different timeslices durations for different courses. Right now you can set the TIMESLICE_DURATION constant to the value you want and it will work, but all the courses will use the same timeslice duration.
Find some way to use the UpdateWikidataStats without revisions?
Medium priority:
We need this done so that the timeslice world does not break existing things.
Design how to deal with after-the-end revisions, currently included for calculating retention when a past course gets updated.
Make sure things like copying courses still work as expected
Translate update_wiki_namespace_stats to timeslice version (if necessary).
Replace the logic in RevisionStat class to calculate recent revisions based on revision_count field in timeslices for the last week.
Replace all the places where revision data is used, by hitting the API instead. Read this.
Make caches based on uploads raw data (update_upload_count, update_uploads_in_use_count, update_upload_usages_count) restricted to the range of time defined by the timeslice. Right now those values are absolute values (not timeslices)
Issues
[] If the course dates are updated during a course update, then the update can fail at some point due to missing timeslices.
The text was updated successfully, but these errors were encountered:
Current Behavior:
At the moment, we store all the revision raw data in our Revisions table, and query historical revision data in different parts of the app, particularly for calculate course stats caches. These are one of the heaviest queries we currently do.
Desired Behavior:
We want to store stats for each course by time period (for example, week-by-week stats), and only use revision data during the update for a given time period, so that we can remove the Revisions table altogether. This will dramatically reduce the storage requirements of the system and remove one of the major database performance bottlenecks.
Tasks:
Non-exhaustive list of things to do:
High priority:
We need this done to the timeslice world works.
ArticleCourseWikiTimeslices
andArticlesCourses
.CourseUserWikiTimeslices
andCoursesUsers
.CourseWikiTimeslices
andCourses
.ArticlesCourses.new_article
because it's not an invariant: if the first revision is removed, then the following revision starts to be the first revision (withnew_article
= true). We should add anew_article
field to ArticleCourseTimeslices and calculateArticlesCourses.new_article
based on that.course
andwiki
instead of using a singlecourses_wikis
field.ArticlesCourses.update_from_course
. Currently, this takes care of removing all theArticlesCourses
that do not correspond to course revisions (due to updates on the course dates or tracked wikis). It also adds newArticlesCourses
. This should be modified to calculate newArticlesCourses
based on revisions in RAM. Changes on course dates or tracked wikis should trigger recalculations or calculations of new timeslices. Note: this should be done when implementing CourseUserUpdater ✔️, CourseWikiUpdater ✔️ and CourseDatesUpdater ✔️ .DuplicateArticleDeleter.resolve_duplicates
. Instead of removing "limbo revisions", we should try to identify if any timeslice needs to be recalculated.ArticleStatusManager.update_article_status_for_course(@course)
. This ends up removing "limbo revisions" through theModifiedRevisionsManager
class. Related to the point mentioned above.TimesliceManager
is implemented.UpdateCourseStatsTimeslice
works by wiki (so that it is not necessary to have all the revisions in memory).TimesliceManager
class that takes care of the logic around creating new/ updating existing timeslice records. Supposing the Revision data was imported and we have it in RAM, the criteria is the following:course_wiki
,course_user
,article_course
) that was updated, find the last existing timeslice record for it. If no timeslice is found or the timeslice is finished (end
-start
range is more than 1 day), create a new record for it with the new data. If the timeslice is still current, then update it.end
field).update_last_mw_rev_datetime
method. Think about the case updating from 23:55 to 00:02.error
field as part of the timeslice and mark timeslices that received any error while getting scores (for example, because some API was down at that time). This way we could try to reprocess the timeslice in the future.ArticleScopedProgram
andVisitingScholarship
course types.TIMESLICE_DURATION
constant to the value you want and it will work, but all the courses will use the same timeslice duration.UpdateWikidataStats
without revisions?Medium priority:
We need this done so that the timeslice world does not break existing things.
update_wiki_namespace_stats
to timeslice version (if necessary).RevisionStat
class to calculate recent revisions based onrevision_count
field in timeslices for the last week.Low priority:
Do we need this?
update_upload_count
,update_uploads_in_use_count
,update_upload_usages_count
) restricted to the range of time defined by the timeslice. Right now those values are absolute values (not timeslices)Issues
The text was updated successfully, but these errors were encountered: