Re-define `lychee` as both a link checker and a link linter #269

lebensterben · 2021-06-24T19:45:09Z

Currently, lychee is just a link checker, which conceptually has the following components:

It parses a config file or CLI argument, and turns it into an internal configuration
An overseer is created given the configurations. It serves several purposes:
- Detection: It detects/scrape HTML links and EMAIL addresses from the input file(s)
- Filtering: It checks whether the link should be checked or skipped
- Dispatcher: It sends URLs to workers who verify whether a given URL is valid by checking whether it's accessible.
- Logging: It keeps an record of URL it encounters and the status of each URL. (Valid, Ignored, Timeout, etc)
The result is returned either as CLI output or to a log file.

We can extend lychee so that it's also a link linter.

Note that the overseer hands the job to its workers, and workers are only responsible for validating the URLs. The linting facility should also be done by overseer.

Conceptually, linting could happen before and/or after the overseer dispatches the job. For example:

Before: If the user wants to deny a certain pattern in URL, the overseer don't need to send URLs that match the pattern to the workers, instead it directly logs it.
After: If the user wants to avoid the use of absolute/relative links, the overseer first verify whether a link is valid, and if it is, then it check whether it's in the desired style before logging.
Around (before and after): If the user wants to use HTTPS links whenever possible, the overseer first sends an HTTP link to workers, and if it's valid the overseer resends it HTTPS counterpart the workers, and logs the result accordingly.

The text was updated successfully, but these errors were encountered:

untitaker · 2021-11-15T23:00:43Z

This is labelled as design-feedback because it already goes into the nitty-gritty of which code unit does what, but from what i can tell what is actually proposed here is to expand lychee's feature scope such that it can be used to enforce policies on HTML, such as the already mentioned "need to use HTTPs links everywhere" and "links can't match this pattern".

You can already enforce some basic pattern-policies today: Use --dump and a wrapper shell script. What can't be done is the HTTPs enforcement in the way you imagine it.

Is the list of policies in your OP exhaustive?

mre · 2022-02-04T11:23:03Z

@lebensterben any thoughts?

untitaker · 2022-02-04T11:25:07Z

prior art: https://github.com/wjdp/htmltest

mre · 2022-02-04T11:29:07Z

Link validation is a whole other use-case with a lot of design decisions to consider along the way. We have to be careful to keep the scope manageable. I guess we can commit to the following:

enforce HTTPS links: This was added in Feature request: Check usage of HTTP when HTTPS is available #192
allow/block list, see [Feature] allowlist of URL patterns #486. There is a workaround using --dump.

Outside of that, I'd probably defer to other tools (e.g. htmltest that @untitaker mentioned) or workarounds using --dump for now.

untitaker · 2022-02-04T11:35:10Z

I checked OP again, what you currently cannot do is hook into before/after link traversal for linting, or define your own link extraction logic. --dump is insufficient there.

eg you may want to lint a link and based on the linting decide whether to follow the link.

I wonder if OP meant to build a scripting platform on top of lychee where the user could hook custom logic into any of those stages, and that's why internals are discussed so much in detail.

mre added the request-for-comments label Sep 4, 2021

lycheeverse locked and limited conversation to collaborators Dec 19, 2022

mre converted this issue into discussion #880 Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Re-define `lychee` as both a link checker and a link linter #269

Re-define `lychee` as both a link checker and a link linter #269

lebensterben commented Jun 24, 2021

untitaker commented Nov 15, 2021

mre commented Feb 4, 2022

untitaker commented Feb 4, 2022

mre commented Feb 4, 2022

untitaker commented Feb 4, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Re-define lychee as both a link checker and a link linter #269

Re-define lychee as both a link checker and a link linter #269

Comments

lebensterben commented Jun 24, 2021

untitaker commented Nov 15, 2021

mre commented Feb 4, 2022

untitaker commented Feb 4, 2022

mre commented Feb 4, 2022

untitaker commented Feb 4, 2022

This issue was moved to a discussion.

Re-define `lychee` as both a link checker and a link linter #269

Re-define `lychee` as both a link checker and a link linter #269