Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand Slur redaction feature to more websites #254

Closed
Tracked by #252
tarunima opened this issue Jun 15, 2023 · 5 comments · Fixed by #299
Closed
Tracked by #252

Expand Slur redaction feature to more websites #254

tarunima opened this issue Jun 15, 2023 · 5 comments · Fixed by #299
Assignees
Labels
level:feature An issue that describes a feature (initiative>feature>ticket) level:ticket An issue that describes a ticket (initiative>feature>ticket) priority:high role:backend role:frontend size:1pt track: P1 Product track: user-facing tool for ogbv response
Milestone

Comments

@tarunima
Copy link
Collaborator

No description provided.

@tarunima tarunima mentioned this issue Jun 15, 2023
3 tasks
@tarunima tarunima moved this from Backlog to In Dev in Uli Roadmap 2021 - 2022 Jun 15, 2023
@dennyabrain dennyabrain added this to the 0.2.0 milestone Jun 22, 2023
@dennyabrain dennyabrain changed the title Expand redaction feature to more websites Expand Slur redaction feature to more websites Jun 22, 2023
@dennyabrain dennyabrain moved this from In Progress to Todo in Uli 2022 - 2023 Delivery Plan Jun 22, 2023
@tarunima tarunima added the track: P1 Product track: user-facing tool for ogbv response label Jul 3, 2023
@dennyabrain dennyabrain assigned duggalsu and unassigned dennyabrain Jul 10, 2023
@duggalsu
Copy link
Contributor

For expanding to other websites, a single code base won't work across all websites. We will require site-specific DOM handling.

This is because we have been querying the <p>, <span> and <li> tags and assuming them to be leaf nodes for direct slur replacement. However, in many sites, these are not the leaf notes and entire DOM snippets get replaced with text, thus modifying the page structure and breaking functionality.

For example, the <p> tag can have elements such as forms, bold/italic formatting, link anchors and superscripts, as is the case with adminer database and wikipedia.

Adminer breaks here and the button and checkbox disappear on slur replacement

<p><input type='submit' value='Login'>
<label><input type='checkbox' name='auth[permanent]' value='1'>Permanent login</label>

Uli works here on the main content with the <p> tag without considering the leaf nodes explicitly, but breaks the side menu bar with the <li> tags (see below). It may break inconsistently within text, especially if it is surrounded by other tags.
https://en.wikipedia.org/wiki/%22Hello,_World!%22_program

<p>A <b>"Hello, World!" program</b> is generally a <a href="/wiki/Computer_program" title="Computer program">computer program</a> that ignores any input, and outputs or displays a message similar to "Hello, World!". A small piece of code in most <a href="/wiki/General-purpose_programming_language" title="General-purpose programming language">general-purpose programming languages</a>, this program is used to illustrate a language's basic <a href="/wiki/Syntax_(programming_languages)" title="Syntax (programming languages)">syntax</a>. "Hello, World!" programs are often the first a student learns to write in a given language,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">&#91;1&#93;</a></sup> and they can also be used as a <a href="/wiki/Sanity_check" title="Sanity check">sanity check</a> to ensure computer software intended to compile or run <a href="/wiki/Source_code" title="Source code">source code</a> is correctly installed, and that its operator understands how to use it.
</p>

The <li> tag can have lists, link anchors, div and span elements like in wikipedia side menu bar and github top bar. Both wikipedia and github menu bars break with Uli currently.

https://en.wikipedia.org/wiki/%22Hello,_World!%22_program

<li id="toc-History"
		class="vector-toc-list-item vector-toc-level-1 vector-toc-list-item-expanded">
		<a class="vector-toc-link" href="#History">
			<div class="vector-toc-text">
			<span class="vector-toc-numb">1</span>History</div>
		</a>
		
		<ul id="toc-History-sublist" class="vector-toc-list">
		</ul>
</li>

@dennyabrain
Copy link
Contributor

dennyabrain commented Jul 12, 2023

Hey @duggalsu, I had a breakthrough. Pasting working code that I was able to use to find and replace text content on wikipedia.

// Traverse dom nodes to find leaf node that are text nodes and process
function dft(nodes){
    if(nodes.childNodes.length===0 && nodes.nodeType === 3){ 
        console.log("found leaf text node", nodes)
        nodes.textContent = nodes.textContent.replace("g", "🔥")
    }
    else{
        nodes.childNodes.forEach((nodes)=>dft(nodes))
    }
}

// filtering logic to get the right set of dom elements
let body = document.getElementsByTagName("body")
let first_body = body[0]
let elements = first_body.children
let elements_array = Array.from(elements)
let relevant_elements = elements_array.filter((element)=>["P","DIV"].includes(element.nodeName))

for(const element of relevant_elements){
    dft(element)
}

Try pasting this in the console with Wikipedia open, it will replace all instances of the character "g" with the fire emoji. There might be edge cases in my logic but see if you can find any reason why this is not an acceptable solution.

References

To Investigate

  • I didn't know that html element can have multiple <body> elements. I am not even sure if its standard practice to have more than 1 body elements. For now, I am only working with the first body element (let first_body = body[0]). Maybe we'll need to iterate over every body element and run our logic.
  • When extracting relevant elements, I am only looking at <p> and <div>. This is only because I saw these two elements in wikipedia at the root level. Maybe we should include elements like <ul>, <ol>, <a>, <span> etc

@dennyabrain
Copy link
Contributor

I called my function dft, because i thought i'm doing depth first traversal. Just realized this might be better named bft for breadth first traversal.

@dennyabrain dennyabrain self-assigned this Jul 12, 2023
@duggalsu
Copy link
Contributor

Hey @dennyabrain , I was also toying with nodeType earlier but this code is working successfully with minor edge-case handling and optimization. It is also handling <li>, <a> elements without any modification. Haven't seen any issues with <span> yet. No site seems to be breaking now.

@dennyabrain dennyabrain moved this from Todo to In Progress in Uli 2022 - 2023 Delivery Plan Jul 12, 2023
@dennyabrain dennyabrain moved this to In Progress in 2023 Q3 Planner Jul 12, 2023
@dennyabrain
Copy link
Contributor

@duggalsu With respect to listening for continuous changes, I got this unoptimized code working on reddit

function dft(nodes){
    if(nodes.childNodes.length===0 && nodes.nodeType === 3){ 
        console.log("found leaf text node", nodes)
        nodes.textContent = nodes.textContent.replace("g", "🔥")
    }
    else{
        nodes.childNodes.forEach((nodes)=>dft(nodes))
    }
}


let body = document.getElementsByTagName("body")
let first_body = body[0]



const config = { attributes: true, childList: true, subtree: true };
const callback = (mutationList, observer) => {
  let elements = first_body.children
let elements_array = Array.from(elements)
let relevant_elements = elements_array.filter((element)=>["P","DIV"].includes(element.nodeName))

for(const element of relevant_elements){
    dft(element)
}

} 
const observer = new MutationObserver(callback);
observer.observe(first_body, config); 

There is a lot of documentation on MutationObserver here that I haven't read fully. And its quite possible that in my code above, our on change callback gets called a lot more than it should be. Can you first evaluate if this is a good solution and if yes, see if we can reduce the number of processing from our side?

@github-project-automation github-project-automation bot moved this from In Progress to Done in Uli 2022 - 2023 Delivery Plan Jul 12, 2023
@dennyabrain dennyabrain added level:feature An issue that describes a feature (initiative>feature>ticket) level:ticket An issue that describes a ticket (initiative>feature>ticket) labels Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level:feature An issue that describes a feature (initiative>feature>ticket) level:ticket An issue that describes a ticket (initiative>feature>ticket) priority:high role:backend role:frontend size:1pt track: P1 Product track: user-facing tool for ogbv response
Projects
No open projects
Status: Done
Status: Done
Status: In Dev
Development

Successfully merging a pull request may close this issue.

3 participants