Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discovery] Ownership of edx-platform #181

Open
1 task
robrap opened this issue Feb 3, 2023 · 6 comments
Open
1 task

[Discovery] Ownership of edx-platform #181

robrap opened this issue Feb 3, 2023 · 6 comments
Labels
maintenance Tasks for keeping code in good repair

Comments

@robrap
Copy link
Contributor

robrap commented Feb 3, 2023

Note: This ticket may go on the backlog and/or have sub-tickets broken out.

This is a list of a variety of issues related to owning edx-platform:

  • Better handling of the addition of new apps.
  • There may be another ticket for this, but the ownership sync job that looks for new edx-platform directories appearing or disappearing is broken. It is unclear how out of sync the spreadsheet currently is with the filesystem.
  • Do we have minimum acceptability for Performance and Errors in edx-platform that we want to communicate?
    • Note that too many errors, even if "owned" by a "code_owner_squad", make it difficult to see anything else at the overview level.
  • Long running web requests (>1 minute) are erroring in Splunk with 499 from the ELB, but Django doesn't know about this timeout (not configured in nginx or gunicorn), so it is sometimes timing out at 5 minutes, and sometimes returning a 200 that isn't a real 200.
    • Here is data from New Relic: https://onenr.io/0BQr185r3jZ
    • Any web request that was taking >1 minute, but <5 minutes, and was succeeding and caching results in cache or DB, etc., might now fail repeatedly, where it used to possibly fail just on the initial request.
    • We should consider introducing config changes to align all timeouts to 1 minute, or slightly shorter, but we may want to resolve or investigate issues first, so we know what consequences we will have.
    • Do we want to force code owners to adjust error alerts to include these calls?
  • The grades API doesn't have rate limiting and was having issues because it is unprotected. Is there a way for us to introduce default rate limiting that supports current traffic on all requests, but prevents major changes in throughput? There is risk that we will temporarily break things, and need to disable, fix endpoint, and re-enable, etc.
  • Auto transactions for requests are no longer the default for Django, presumably for Performance reasons. We could use linting to ensure that all new endpoints explicitly disable (or enable) and amnesty old endpoints. This would allow us to stop the bleeding, and we could continue to fix poor performers or db locking issues this way.
  • On what cadence to we check that New Relic's ignored errors matches our settings and docs? See https://2u-internal.atlassian.net/wiki/spaces/AT/pages/16386732/Expected+and+Ignored+Error+Runbook
  • I was thinking of introducing a requirement for removing outdated toggles on maintainers, so they remove what they add. Feanil pointed out that edx-platform is where the most toggles live. Is there any kind of linting or notifications around this to better ensure people follow-up on removing their toggles?
  • I think the main docs (including ADRs) of edx-platform are not on readthedocs, but maybe I'm wrong. Also, there are separate "guides" in edx-platform that are on readthedocs, but I don't think they should really be separate. Maybe they were kept separate simply to avoid a larger set of docs that maybe doesn't build?
@robrap robrap added this to Arch-BOM Feb 3, 2023
@robrap robrap converted this from a draft issue Feb 3, 2023
@robrap robrap changed the title [Discovery] Maintaining edx-platform [Discovery] Ownership of edx-platform Feb 3, 2023
@robrap robrap added the maintenance Tasks for keeping code in good repair label Feb 3, 2023
@robrap robrap removed the status in Arch-BOM Mar 16, 2023
@rgraber rgraber added the backlog To be put on a team's backlog or wishlist label Mar 20, 2023
@robrap robrap removed the backlog To be put on a team's backlog or wishlist label Apr 20, 2023
@robrap
Copy link
Contributor Author

robrap commented Apr 21, 2023

FYI: @kdmccormick: I heard you were also thinking about edx-platform maintainer work.

@kdmccormick
Copy link

I am, thanks!

@kdmccormick
Copy link

kdmccormick commented May 12, 2023

From #maintainers-pilot in the openedx Slack (https://openedx.slack.com/archives/C03R320AFJP/p1683901316195499?thread_ts=1682001481.307689&cid=C03R320AFJP):

Kyle:

The current shape of edx-platform responsibilities is that the root directory is owned by one lead team (Jeremy & his BOM squad at 2U) and various directory subtrees are delegated to different 2U teams, with any un-delegated subtrees remaining the responsibility of the lead team. That system seems to have worked well for 2U, and it would make sense to bring it forward into the maintenance world, with the difference that teams could be from any organization. This system will start coming online in the next phase of the maintainers pilot.

In terms who who the "lead team" would be, Feanil, Jeremy, and I discussed a couple ideas. One idea was working together as joint lead-maintainers of edx-platform. Our thinking was that working together across orgs would make it easier to coordinate big edx-platform restructuring and refactoring projects that Feanil and I see on the horizon.

We actually came to the conclusion that it'd be more efficient just to have Jeremy and his BOM squad continue as lead-maintainer of edx-platform. Our hope is that coordinating with BOM via ADRs should be enough to unblock Feanil and I (or anyone else, for that matter) from driving big changes to edx-platform. We can also make use of the sub-directory delegation to align contributors/teams with parts of the repository that they plan to make big changes to.

Comments/concerns welcome

@robrap
Copy link
Contributor Author

robrap commented May 12, 2023

Thanks @kdmccormick. If a non-2U team ends up owning any edx-platform Django app, we'd need to ensure 2U's tooling can handle Production on-call appropriately for that area. There probably is not a one-size-fits-all solution here. Some options include:

  • finding a 2U team that will handle monitoring, or
  • leaving monitoring to the lead-maintainer (BOM, in the case of edx-platform), and/or
  • having an escalation path to the outside team.

@kdmccormick
Copy link

@robrap Agreed. What happens now when there are alerts from non-2U-maintained packages which edx-platform uses, such as blockstore (OpenCraft) and openedx-filters (eduNEXT)? Could that same solution be used for non-2U-maintained edx-platform directories?

@robrap
Copy link
Contributor Author

robrap commented May 12, 2023

What happens now when there are alerts from non-2U-maintained packages which edx-platform uses, such as blockstore (OpenCraft) and openedx-filters (eduNEXT)? Could that same solution be used for non-2U-maintained edx-platform directories?

It's really the Django apps that are a problem because monitoring ownership is attributed to the owner of the view used for the request. I'm guessing those don't have associated views?

@jristau1984 jristau1984 moved this to Backlog in Arch-BOM Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Tasks for keeping code in good repair
Projects
Status: Backlog
Development

No branches or pull requests

3 participants