Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces external depth (#74) & a few fixes (incl. #69) #146

Merged
merged 2 commits into from
Sep 23, 2021
Merged

Introduces external depth (#74) & a few fixes (incl. #69) #146

merged 2 commits into from
Sep 23, 2021

Conversation

marchellodev
Copy link
Contributor

@marchellodev marchellodev commented Aug 22, 2021

Motivation

A lot of modern websites rely on external domains (usually referred to as cnd domains) for their css, js, images, and other resources. Since SuckIT does not yet support downloading data from external domains (except for the bug when //en.wikipedia.org is treated as a relative path, (which I fixed)), it is impossible to properly download big and complex websites (#74).

Also, this patch fixes panic when trying to parse urls like ///tools.wmflabs.org/, which returns Empty host error. I encountered this trying to download wikipedia. So, I think this PR should also close #69

Notes

I almost have no experience with Rust, and I haven't yet implemented tests for the changes (I'm not really sure what is the best way to do this). So, please look at the code with extra scrutiny :). However, I have tested it on a few websites, and everything seems to work properly.

Also, --edepth (external depth) does not have a shortcut, since -e is used for excluding pattern. I'm not sure how this parameter should be renamed in order for shortcut to exist

@CohenArthur CohenArthur requested a review from Skallwar August 24, 2021 18:23
src/args.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/args.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
Copy link
Owner

@Skallwar Skallwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Some nitpicks here and there. Thanks a lot

README.md Outdated Show resolved Hide resolved
src/args.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
src/scraper.rs Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
@Skallwar Skallwar requested a review from CohenArthur September 6, 2021 21:04
@Skallwar
Copy link
Owner

Skallwar commented Sep 7, 2021

You need to fix the coding style (use rustfmt)

src/args.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
src/scraper.rs Outdated Show resolved Hide resolved
@Skallwar
Copy link
Owner

Skallwar commented Sep 9, 2021

Please rebase on top of master and squash into one or two commit

@Skallwar Skallwar force-pushed the master branch 2 times, most recently from 526a5c5 to 1be1f85 Compare September 10, 2021 12:40
Copy link
Owner

@Skallwar Skallwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase your branch on top of master and squash your commit into one or 2 commits

@Skallwar
Copy link
Owner

@marchellodev Can you still work on this? If not, I can rebase and add tests for you

@marchellodev
Copy link
Contributor Author

@Skallwar I would really appreciate it! :) I tried to do that a few days ago, but I always stumbled upon some errors. I'm kinda new to git, especially to those sophisticated operations

Thanks again!

Introduces --edepth flag for external domain depth
Transforms urls that start with `//` into full https links for more accurate detection of external links
Fixes panic when url starts with `///`
Readme: Added explanation of the `--edepth` flag
--edepth -> --ext-depth
Refactors code to avoid repetition (normalizing urls)
Improved code documentation (url normalization)
README.md upd: --ext-depth
Minor docs changes (--depth)
rustfmt coding style
Updated `--ext_depth` command docs
Small code documentation improvements

Co-authored-by: CohenArthur <[email protected]>
Co-authored-by: Esteban Blanc <[email protected]>
@codecov
Copy link

codecov bot commented Sep 22, 2021

Codecov Report

Merging #146 (64af90d) into master (1be1f85) will increase coverage by 0.07%.
The diff coverage is 52.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #146      +/-   ##
==========================================
+ Coverage   62.54%   62.62%   +0.07%     
==========================================
  Files          16       17       +1     
  Lines         558      610      +52     
==========================================
+ Hits          349      382      +33     
- Misses        209      228      +19     
Impacted Files Coverage Δ
src/args.rs 0.00% <ø> (ø)
src/disk.rs 0.00% <ø> (ø)
src/scraper.rs 12.28% <0.00%> (-1.54%) ⬇️
src/downloader.rs 72.89% <100.00%> (ø)
tests/external.rs 100.00% <100.00%> (ø)

@Skallwar
Copy link
Owner

@CohenArthur are you ok with this?

Copy link
Collaborator

@CohenArthur CohenArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :D Thanks a lot @marchellodev

@Skallwar Skallwar merged commit 04edd27 into Skallwar:master Sep 23, 2021
@Skallwar
Copy link
Owner

@marchellodev Thanks again, excellent work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic with EmptyHost
3 participants