Bold suggestion: save full content of the page. #318

urza · 2015-08-10T19:15:51Z

I know this is a bold one, but hear me out.

About half of my links that are older then 3 years are broken, the content disappeared from web. I discovered this the hard way of course. Sometimes it doesn't matter, but sometimes the webpage in my bookmarks was really useful. But lifetime of URLs on internet is on average just few years.

I know there is discussion about archiving links in archive.org #307 and I vouch for this, this enhancement would be very useful.

But I have one other possible suggestion: How about saving the full content locally to filesystem structure? Would it be hard? If we save text it would allow for fulltext search (not just links, but also their content), which would bring Shaarli to completely new level of usability. I currently do this with Evernote - links that I consider important, I save both to my shaarli bookmarks (because I already have years of records in here, so I want to keep it consistent), but also to my Evernote which retrieves the whole content of the page, including pictures and media, alows me to organize by notebooks, tags and even decide how I want to clip the page (full, article, just text)... then I can fulltext search it... If shaarli could match this somehow, it would be just fantastic...

Maybe combination of saving full text from the link to allow for fulltext search, and taking screenshot of the page with something line PhantomJS http://phantomjs.org/screen-capture.html to save the visual aspect of the page?

What do you guys think?

nicolasdanelon · 2015-08-10T19:18:47Z

Very interesting.. +1

ArthurHoaro · 2015-08-10T21:15:12Z

Well... this is a difficult subject, but an interesting feature.

What's can be done:

Save page HTML: unreadable most of the time, useless.
Save page + media: we do this in Projet-Autoblog: can be done with a link option. Although we rely on RSS feed which is easier than full page.
Save content: project like wallabag try to do this. Definitly won't do it here.
Screeshots: also an option, maybe the easiest, but won't allow fulltext search.

Any opinion would be welcome here.

virtualtam · 2015-08-11T18:29:09Z

Some thoughts and related tools:

some pages may just be near to unreadable (programatically speaking): frames, AJAX, embedded JSON data loaded in JS, CSS3 renders, etc.
archiving the whole page (HTML content) may be way too verbose, compared to the actual, relevant information
the Firefox Reader View is developing the ability to isolate worthy content, there may be libraries available to trim pages a bit
the Firefox Resurrect Pages addon allows to browse the major Internet caches
Shaarchiver browses a link list and downloads media content (audio, video) with youtube-dl -which supports far more than only youtube, see the available content extractors

urza · 2015-08-11T19:53:46Z

Shaarchiver looks nice... it does save the full html page content in addition to media you mention? If so, it solves the problem for me.. setting it in cron job or something...

virtualtam · 2015-08-11T20:02:35Z

AFAIK, downloading page content is in the TODO list

mro · 2015-08-11T21:22:46Z

-1

@urza, @virtualtam what you describe doesn't fit with "The personal, minimalist, super-fast, no-database delicious clone" so it should IMHO not go into shaarli because it is not shaarli. I'd even go as far as saying it must not.

what you describe is "The personal, ... http://archive.org clone" (not respecting robots.txt) - which is another beast.

BTW, have you seen the support for archive.org in shaarli?

dimtion · 2015-08-11T22:43:17Z

Are there any downsides of using archive.org ? Otherwise I agree with @mro, why bother doing such hard work for something which already exists.

nodiscc · 2015-08-12T10:50:12Z

Again, archiving features should remain in a separate tool. Shaarli should provide export formats that makes it easy for such tools to parse and work with data.

shaarchiver is easy enough to use and just relies on Shaarli's HTML export (could be improved to parse RSS) - though it doesn't support archiving webpages yet, mainly because I'd like to do it right on first try (I did not decide whether this will use python-scrapy, or an external tool like httrack/wget, it needs better error handling, etc...) Help is welcome.

If you want a PHP based tool, there is https://github.com/broncowdd/respawn (didn't try it yet).

Shaarli is not a web scraping system, and I think it's fine this way.

urza · 2015-08-12T12:39:08Z

Yes, it makes sense. Keep Shaarli simple and provide good interface for other tools to do the downloading-archiving... It is a project of itself. I totally agree lets keep the jobs separated.

@nodiscc .. Any time estimation when Shaarchiver could be able to do it? Personally I would go with the wget route.. wget has imo solved all the problems, it is battle tested solution, that can download "full page content" so it fits the philosophy... but I dont know about the other options you mention, maybe they are just as good...

nodiscc · 2015-08-12T13:36:06Z

@urza no clear ETA for shaarchiver, but page archiving should definitely be working before 2016 :) I'm low on free time.

For the record there was an interesting discussion regarding link rot at https://news.ycombinator.com/item?id=8217133 (link rot is a real problem, it is even more obvious on some content types such as multimedia/audio/video works; ~10% of my music/video bookmarks are now broken)

I think it will use wget, I need to find out a sensible directory organization, and it will take some work filtering/organizing downloaded files (I don't want to download 15GB of ads..).

mro · 2015-08-12T21:53:17Z

mabe blacklists like https://github.com/sononum/abloprox/ or http://wpad.mro.name/ can help? The latter uses a blacklist during the proxy configuration. Don't know whether wget can handle such if/or how the operating system helps here.

nodiscc · 2015-08-13T14:07:25Z

@mro the ad blocking system will likely use dnsmasq and hosts files. I already collected some relevant lists and conversion tools: https://github.com/nodiscc/shaarchiver/blob/master/ad-hosts.txt https://github.com/Andrwe/privoxy-blocklist/blob/master/privoxy-blocklist.sh. Adding abloprox to these. I will need to investigate a bit more - other suggestions are welcome.

github-account1111 · 2021-05-06T03:30:11Z

Might be worth using a web-scraper (akin to the one Evernote use) over a separate client. A faithful webpage copy is always preferable to blindly passing a URL into a server because:

content filtering and ad blocking can mess things up
not being logged in can mess things up
custom CSS can mess things up

So you might (I would argue will) end up in a situation where what you find in your archive is nothing like what you saw in your browser, especially if you tend to use custom CSS for things like dark mode or tools like uBlock Origin.

virtadpt · 2021-05-06T17:28:30Z

There is already a plugin which sends the bookmarked link to an arbitrary Wallabag instance.

ArthurHoaro added discussion feature labels Aug 10, 2015

nodiscc added project-related and removed feature labels Aug 13, 2015

ArthurHoaro added this to the backlog to the future milestone Mar 4, 2016

nodiscc mentioned this issue Jul 12, 2017

Request for comments: add media/page archiving capabilities to the Python Shaarli client shaarli/python-shaarli-client#22

Closed

nodiscc mentioned this issue Mar 17, 2022

Archive Websites #1834

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bold suggestion: save full content of the page. #318

Bold suggestion: save full content of the page. #318

urza commented Aug 10, 2015

nicolasdanelon commented Aug 10, 2015

ArthurHoaro commented Aug 10, 2015

virtualtam commented Aug 11, 2015

urza commented Aug 11, 2015

virtualtam commented Aug 11, 2015

mro commented Aug 11, 2015

dimtion commented Aug 11, 2015

nodiscc commented Aug 12, 2015

urza commented Aug 12, 2015

nodiscc commented Aug 12, 2015

mro commented Aug 12, 2015

nodiscc commented Aug 13, 2015

github-account1111 commented May 6, 2021 •

edited

Loading

virtadpt commented May 6, 2021

Bold suggestion: save full content of the page. #318

Bold suggestion: save full content of the page. #318

Comments

urza commented Aug 10, 2015

nicolasdanelon commented Aug 10, 2015

ArthurHoaro commented Aug 10, 2015

virtualtam commented Aug 11, 2015

urza commented Aug 11, 2015

virtualtam commented Aug 11, 2015

mro commented Aug 11, 2015

dimtion commented Aug 11, 2015

nodiscc commented Aug 12, 2015

urza commented Aug 12, 2015

nodiscc commented Aug 12, 2015

mro commented Aug 12, 2015

nodiscc commented Aug 13, 2015

github-account1111 commented May 6, 2021 • edited Loading

virtadpt commented May 6, 2021

github-account1111 commented May 6, 2021 •

edited

Loading