Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Data rearchitecture] Implement article status manager for timeslices #6083

Draft
wants to merge 10 commits into
base: data-rearchitecture-for-dashboard
Choose a base branch
from

Conversation

gabina
Copy link
Member

@gabina gabina commented Jan 2, 2025

What this PR does

This PR implements a new ArticleStatusManagerTimeslice class that "imitates" the process previously done by ArticleStatusManager class.

Analysis of the ArticleStatusManager class (revisions verison)

The article status manager has several steps. Conceptually, ArticleStatusManager modifies articles and revisions. Both entities are course agnostic, i.e., they are not associated with a particular course. This is a fundamental difference with the new system because although the entity article was not modified, the revisions entity no longer exists. Instead,
we have articles course timeslices, which are entities associated to a specific course. This is something to pay special attention to, because if several courses share the same article, article course timeslices should be updated for all courses that use that article.

New implementation

Due to the fundamental differences between the systems, it is not possible to make a direct translation from one implementation to the other. It is necessary to redesign the behavior of the class thinking about the new properties that the timeslices system presents.

As part of this PR, we modified the logic for article course timeslice creation. Before this PR, timeslices were created only for articles courses (articles edited by the course users in the tracked namespaces). Now, timeslices for all articles with at least one revision are created, but we only create article course records for those that are pertinent to the course. This is to have data for articles that, for example, are created in draft but then move to the main namespace.

The new ArticleStatusManagerTimeslice class does the following:

  1. Update articles that moves or gets deleted/undeleted. It basically iterates over article course timeslices to sync articles. This updates title, namespace, deleted and mw_page_id fields.
  2. Reset articles according to their new status:
    • Articles that were deleted or untracked. These are articles that were either deleted or moved to a namespace not traceable by the course. Such articles should be excluded from course statistics.
    • Articles that were restored or re-tracked: These are articles that were either undeleted or moved to a namespace relevant to the course. Such articles should be included in course statistics.

Open questions and concerns

It looks like there are two main use cases for the ArticleStatusManager class. The usual one is through the periodic course update. The other one is through the cleanup scripts: docs/cleanup_scripts/duplicate_articles.rb and docs/cleanup_scripts/duplicate_mw_page_id_handling.rb. For the latter case, update_status with a single article is invoked, and it's kind of a special case. I'm not entirely sure if those cleanup scripts are still used, as it seems that we added a db restriction in the Articles table (see #4381). If the scripts are no longer used then we can simplify the new implementation.

@gabina gabina changed the title [WIP] [Data rearchitecture] implement article status manager [WIP] [Data rearchitecture] Implement article status manager for timeslices Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant