Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrastructure: Fix failing link checker on GitHub README link #2931

Merged
merged 4 commits into from
Feb 26, 2024

Conversation

evmiguel
Copy link
Contributor

@evmiguel evmiguel commented Feb 8, 2024

This PR addresses #2907, where the link checker is failing due to a reference to a GitHub README link. The link checker fails because the page uses a react-partial element to wrap the README content.

Questions for reviewers:

  • This is a first pass solution, and I am looking for more feedback on how to make this solution less brittle, in case GitHub decides to change how they handle READMEs in the future. Thoughts?

WAI Preview Link (Last built on Thu, 22 Feb 2024 21:30:45 GMT).

@howard-e howard-e changed the title Fix failing link checker on GitHub README link Infrastructure: Fix failing link checker on GitHub README link Feb 12, 2024
@howard-e howard-e added the Infrastructure Related to maintaining task force and repo operations, processes, systems, documentation label Feb 12, 2024
Copy link
Contributor

@howard-e howard-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evmiguel thanks for doing the research into figuring what the issue was! I've also confirmed this solution works as expected!

I also really appreciate the consideration for the brittleness of this approach as you've mentioned.

My initial thought for this, was if there were any alternative ways explored? So as to not risk GH pulling the rug from under link-checker.js once more 🙂. If not, that doesn't have to be something explored in this PR but was wondering if there was anything else to consider other than how the page ids are currently being extracted.

I've left some additional comments in the thread, design-related questions and possible brittleness mitigation thoughts.

.link-checker.js Outdated
@@ -18,13 +20,27 @@ module.exports = {
{
name: 'github',
pattern: /^https:\/\/github\.com\/.*/,
matchHash: (ids, hash) =>
ids.includes(hash) || ids.includes(`user-content-${hash}`),
matchHash: (ids, hash, ssr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
matchHash: (ids, hash, ssr) => {
matchHash: (ids, hash, { ssr }) => {

To me, passing in something like ssr here feels like an additional 'option', and beyond just a 'parameter required for matching against known hashes'.

I'm wondering what you think of representing this (and updating other references in this PR where relevant). Also, If future edge cases like this pop up that require new 'options', it may make this easier to extend if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Howard's point about adding an object here.

I would say though that I don't think this is properly called ssr. After all, if GitHub was doing SSR properly then we would be dealing with correctly formatted HTML already and none of these workarounds would be necessary.

As you can see in my other comment, I think we need to add another option to the configuration file to contain the complexity that GitHub is forcing us to handle. When you do that maybe you can try out another name and/or way of passing data to this function that makes sense based on how you tackle it.

@@ -33,7 +33,7 @@ async function checkLinks() {
return getLineNumber;
};

const checkPathForHash = (hrefOrSrc, ids = [], hash) => {
const checkPathForHash = (hrefOrSrc, ids = [], hash, ssr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const checkPathForHash = (hrefOrSrc, ids = [], hash, ssr) => {
const checkPathForHash = (hrefOrSrc, ids = [], hash, { ssr } = {}) => {

To align with my previous thought

@@ -142,7 +142,18 @@ async function checkLinks() {
.querySelectorAll('[id]')
.map((idElement) => idElement.getAttribute('id'));

return { ok: response.ok, status: response.status, ids };
// Handle GitHub README links.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is solely related to GitHub-related urls with hashes, I think checking hrefOrSrc to verify if the link follows those rules before running this would be okay.

.link-checker.js Outdated
Comment on lines 26 to 29
const overviewFiles =
ssr['props']['initialPayload']['overview']['overviewFiles'];
for (let file of overviewFiles) {
if (file['richText']) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could recursively check the sub-objects, and arrays found in ssr until you find the richText attribute. Doing this could lessen the brittle-ness by putting the main point of failure on richText existing, which is probably more manageable, than if any of the intermediate objects disappear or the path gets shifted around.

Purely for demonstration, I was thinking something along the lines of:

const getAttributeValue = (obj, attribute) => {
  if (typeof obj !== 'object' || obj === null) return undefined;
  if (obj.hasOwnProperty(attribute)) return obj[attribute];

  if (Array.isArray(obj)) {
    for (const element of obj) {
      const attributeValue = getAttributeValue(element, attribute);
      if (attributeValue !== undefined) return attributeValue;
    }
  } else {
    for (const key in obj) {
      const attributeValue = getAttributeValue(obj[key], attribute);
      if (attributeValue !== undefined) return attributeValue;
    }
  }

  return undefined;
}

// ...

const richText = getAttributeValue(ssr, 'richText');
if (richText !== undefined) {
  // ...

[My thoughts on its maintainability vs performance is another discussion though but for now, we only have that 1 offending link which shouldn't cause much impact]

Copy link
Contributor

@alflennik alflennik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple comments but my first bit of feedback is a biggie, which is I verified that this does indeed fix the issue with the page. Nice!

There is one link failing though. If you merge in the latest changes from the main branch there is another PR, merged yesterday, which fixes it.

)
.flatMap((element) => element.getElementsByTagName('script'))
.map((element) => JSON.parse(element.innerHTML))[0];
return { ok: response.ok, status: response.status, ids, ssr };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add an option for this work in the .link-checker file? If we can avoid dumping this complexity in the main script that would be a big win.

.link-checker.js Outdated
@@ -18,13 +20,27 @@ module.exports = {
{
name: 'github',
pattern: /^https:\/\/github\.com\/.*/,
matchHash: (ids, hash) =>
ids.includes(hash) || ids.includes(`user-content-${hash}`),
matchHash: (ids, hash, ssr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Howard's point about adding an object here.

I would say though that I don't think this is properly called ssr. After all, if GitHub was doing SSR properly then we would be dealing with correctly formatted HTML already and none of these workarounds would be necessary.

As you can see in my other comment, I think we need to add another option to the configuration file to contain the complexity that GitHub is forcing us to handle. When you do that maybe you can try out another name and/or way of passing data to this function that makes sense based on how you tackle it.

@evmiguel
Copy link
Contributor Author

@howard-e @alflennik thank you for your feedback! I've made some changes based on them. Looking forward to your thoughts.

@css-meeting-bot
Copy link
Member

The ARIA Authoring Practices (APG) Task Force just discussed Link checker.

The full IRC log of that discussion <jugglinmike> Topic: Link checker
<jugglinmike> github: https://github.com//pull/2931
<jugglinmike> howard-e: The link checker fails to find a GitHub link with an ID matching the anchor
<jugglinmike> howard-e: It worked once. The reason it no longer works is that there is a server-rendered element now rendered on that page which is not easily identified using the link checker implementation
<jugglinmike> howard-e: The server-side-rendered element seems to be hiding the one we want
<jugglinmike> howard-e: I have some thoughts about the stability of this solution, but I haven't reviewed the latest changes (Erika has made more changes since I last shared feedback)
<jugglinmike> howard-e: I will be taking a look again this week
<jugglinmike> Matt_King: This is really about testing the fragment part of the link. We're not worried about the link checker failing because the URL of the page is wrong. It's just that if we specify a fragment and the fragment is invalid--that's a problem
<jugglinmike> Matt_King: We want to catch those kinds of errors when we can
<jugglinmike> Matt_King: Okay howard-e, you merge this when you think it's ready
<jugglinmike> howard-e: Will do
<jugglinmike> Zakim, end the meeting

Copy link
Contributor

@howard-e howard-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for addressing my feedback and thanks for the research put into this work!

We'll want to pay close attention to the link-checker checks, to determine how this holds up with any future GitHub changes -- but having this a bit more structured is great and makes it easier to think about adding future support. Not only for GitHub, but also if other domains are changing unexpectedly.

@@ -1,3 +1,24 @@
const HTMLParser = require('node-html-parser');

const getAttributeValue = (obj, attribute) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a brief comment here on what this does for future context? Something like:

// Checks object for attribute and returns value. If not found on first pass, recursively checks nested objects and arrays of nested object(s) until attribute is found. If not found, returns undefined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Contributor

@alflennik alflennik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my feedback, I'm happy with how this turned out!

@howard-e howard-e merged commit 829499f into main Feb 26, 2024
7 checks passed
@howard-e howard-e deleted the fix-lint-checker branch February 26, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Related to maintaining task force and repo operations, processes, systems, documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants