Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data rearchitecture for dashboard #5808

Open
21 of 31 tasks
gabina opened this issue May 23, 2024 · 0 comments
Open
21 of 31 tasks

Data rearchitecture for dashboard #5808

gabina opened this issue May 23, 2024 · 0 comments
Assignees

Comments

@gabina
Copy link
Member

gabina commented May 23, 2024

Current Behavior:

At the moment, we store all the revision raw data in our Revisions table, and query historical revision data in different parts of the app, particularly for calculate course stats caches. These are one of the heaviest queries we currently do.

Desired Behavior:

We want to store stats for each course by time period (for example, week-by-week stats), and only use revision data during the update for a given time period, so that we can remove the Revisions table altogether. This will dramatically reduce the storage requirements of the system and remove one of the major database performance bottlenecks.

Tasks:

Non-exhaustive list of things to do:

High priority:

We need this done to the timeslice world works.

  • Prepare deployment
  • Create timeslices models
  • Add indexes for models
  • Implement cache updates for ArticleCourseWikiTimeslices and ArticlesCourses.
  • Implement cache updates for CourseUserWikiTimeslices and CoursesUsers.
  • Implement cache updates for CourseWikiTimeslices and Courses.
  • Check we're dealing with timezones in a good way
  • Revisit how to calculate ArticlesCourses.new_article because it's not an invariant: if the first revision is removed, then the following revision starts to be the first revision (with new_article = true). We should add a new_article field to ArticleCourseTimeslices and calculate ArticlesCourses.new_article based on that.
  • Make refactor on timeslices models to avoid using join fields. For example, use two different fields course and wiki instead of using a single courses_wikis field.
  • Replace the logic on ArticlesCourses.update_from_course. Currently, this takes care of removing all the ArticlesCourses that do not correspond to course revisions (due to updates on the course dates or tracked wikis). It also adds new ArticlesCourses. This should be modified to calculate new ArticlesCourses based on revisions in RAM. Changes on course dates or tracked wikis should trigger recalculations or calculations of new timeslices. Note: this should be done when implementing CourseUserUpdater ✔️, CourseWikiUpdater ✔️ and CourseDatesUpdater ✔️ .
  • Add support for untracking articles.
  • THIS WONT BE NECESSARY Replace the logic on DuplicateArticleDeleter.resolve_duplicates. Instead of removing "limbo revisions", we should try to identify if any timeslice needs to be recalculated.
  • Replace the logic derivated from ArticleStatusManager.update_article_status_for_course(@course). This ends up removing "limbo revisions" through the ModifiedRevisionsManager class. Related to the point mentioned above.
  • Revisit view counts calculation when TimesliceManager is implemented.
  • Make UpdateCourseStatsTimeslice works by wiki (so that it is not necessary to have all the revisions in memory).
  • Revisit which update processes should be atomic and when we can update caches even with gaps in between.
  • Create a TimesliceManager class that takes care of the logic around creating new/ updating existing timeslice records. Supposing the Revision data was imported and we have it in RAM, the criteria is the following:
    • For every entity (course_wiki, course_user, article_course) that was updated, find the last existing timeslice record for it. If no timeslice is found or the timeslice is finished (end - start range is more than 1 day), create a new record for it with the new data. If the timeslice is still current, then update it.
    • For every entity not updated (retrieve all the entities for the course and calculate the difference against the updated ones), find the last existing timeslice for it. If no timeslice exists (for example, a new course user was added recently, but the user didn’t make any edit yet), then create a new empty timeslice (from beginning of the course to now). If a timeslice is found, then extend it (increase the end field).
  • Fix update_last_mw_rev_datetime method. Think about the case updating from 23:55 to 00:02.
  • Improve Sentry logs
  • Add a strategy to track errors when getting scores. One option would be to have an error field as part of the timeslice and mark timeslices that received any error while getting scores (for example, because some API was down at that time). This way we could try to reprocess the timeslice in the future.
  • Support changes in categories and assignments for ArticleScopedProgram and VisitingScholarship course types.
  • Support changes in tracked namespaces? Right now, if a namespace is added or removed, it won't modify historical data.
  • Support having different timeslices durations for different courses. Right now you can set the TIMESLICE_DURATION constant to the value you want and it will work, but all the courses will use the same timeslice duration.
  • Find some way to use the UpdateWikidataStats without revisions?

Medium priority:

We need this done so that the timeslice world does not break existing things.

  • Design how to deal with after-the-end revisions, currently included for calculating retention when a past course gets updated.
  • Make sure things like copying courses still work as expected
  • Translate update_wiki_namespace_stats to timeslice version (if necessary).
  • Replace the logic in RevisionStat class to calculate recent revisions based on revision_count field in timeslices for the last week.
  • Replace all the places where revision data is used, by hitting the API instead. Read this.
  • Issue DiffViewer relies on a Revisions table query #5806

Low priority:

Do we need this?

  • Make caches based on uploads raw data (update_upload_count, update_uploads_in_use_count, update_upload_usages_count) restricted to the range of time defined by the timeslice. Right now those values are absolute values (not timeslices)

Issues

  • [] If the course dates are updated during a course update, then the update can fail at some point due to missing timeslices.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant