Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bold suggestion: save full content of the page. #318

Open
urza opened this issue Aug 10, 2015 · 14 comments
Open

Bold suggestion: save full content of the page. #318

urza opened this issue Aug 10, 2015 · 14 comments

Comments

@urza
Copy link

urza commented Aug 10, 2015

I know this is a bold one, but hear me out.

About half of my links that are older then 3 years are broken, the content disappeared from web. I discovered this the hard way of course. Sometimes it doesn't matter, but sometimes the webpage in my bookmarks was really useful. But lifetime of URLs on internet is on average just few years.

I know there is discussion about archiving links in archive.org #307 and I vouch for this, this enhancement would be very useful.

But I have one other possible suggestion: How about saving the full content locally to filesystem structure? Would it be hard? If we save text it would allow for fulltext search (not just links, but also their content), which would bring Shaarli to completely new level of usability. I currently do this with Evernote - links that I consider important, I save both to my shaarli bookmarks (because I already have years of records in here, so I want to keep it consistent), but also to my Evernote which retrieves the whole content of the page, including pictures and media, alows me to organize by notebooks, tags and even decide how I want to clip the page (full, article, just text)... then I can fulltext search it... If shaarli could match this somehow, it would be just fantastic...

Maybe combination of saving full text from the link to allow for fulltext search, and taking screenshot of the page with something line PhantomJS http://phantomjs.org/screen-capture.html to save the visual aspect of the page?

What do you guys think?

@nicolasdanelon
Copy link

Very interesting.. +1

@ArthurHoaro
Copy link
Member

Well... this is a difficult subject, but an interesting feature.

What's can be done:

  1. Save page HTML: unreadable most of the time, useless.
  2. Save page + media: we do this in Projet-Autoblog: can be done with a link option. Although we rely on RSS feed which is easier than full page.
  3. Save content: project like wallabag try to do this. Definitly won't do it here.
  4. Screeshots: also an option, maybe the easiest, but won't allow fulltext search.

Any opinion would be welcome here.

@virtualtam
Copy link
Member

Some thoughts and related tools:

  • some pages may just be near to unreadable (programatically speaking): frames, AJAX, embedded JSON data loaded in JS, CSS3 renders, etc.
  • archiving the whole page (HTML content) may be way too verbose, compared to the actual, relevant information
  • the Firefox Reader View is developing the ability to isolate worthy content, there may be libraries available to trim pages a bit
  • the Firefox Resurrect Pages addon allows to browse the major Internet caches
  • Shaarchiver browses a link list and downloads media content (audio, video) with youtube-dl -which supports far more than only youtube, see the available content extractors

@urza
Copy link
Author

urza commented Aug 11, 2015

Shaarchiver looks nice... it does save the full html page content in addition to media you mention? If so, it solves the problem for me.. setting it in cron job or something...

@virtualtam
Copy link
Member

AFAIK, downloading page content is in the TODO list

@mro
Copy link

mro commented Aug 11, 2015

-1

@urza, @virtualtam what you describe doesn't fit with "The personal, minimalist, super-fast, no-database delicious clone" so it should IMHO not go into shaarli because it is not shaarli. I'd even go as far as saying it must not.

what you describe is "The personal, ... http://archive.org clone" (not respecting robots.txt) - which is another beast.

BTW, have you seen the support for archive.org in shaarli?

@dimtion
Copy link

dimtion commented Aug 11, 2015

Are there any downsides of using archive.org ? Otherwise I agree with @mro, why bother doing such hard work for something which already exists.

@nodiscc
Copy link
Member

nodiscc commented Aug 12, 2015

Again, archiving features should remain in a separate tool. Shaarli should provide export formats that makes it easy for such tools to parse and work with data.

shaarchiver is easy enough to use and just relies on Shaarli's HTML export (could be improved to parse RSS) - though it doesn't support archiving webpages yet, mainly because I'd like to do it right on first try (I did not decide whether this will use python-scrapy, or an external tool like httrack/wget, it needs better error handling, etc...) Help is welcome.

If you want a PHP based tool, there is https://github.com/broncowdd/respawn (didn't try it yet).

Shaarli is not a web scraping system, and I think it's fine this way.

@urza
Copy link
Author

urza commented Aug 12, 2015

Yes, it makes sense. Keep Shaarli simple and provide good interface for other tools to do the downloading-archiving... It is a project of itself. I totally agree lets keep the jobs separated.

@nodiscc .. Any time estimation when Shaarchiver could be able to do it? Personally I would go with the wget route.. wget has imo solved all the problems, it is battle tested solution, that can download "full page content" so it fits the philosophy... but I dont know about the other options you mention, maybe they are just as good...

@nodiscc
Copy link
Member

nodiscc commented Aug 12, 2015

@urza no clear ETA for shaarchiver, but page archiving should definitely be working before 2016 :) I'm low on free time.

For the record there was an interesting discussion regarding link rot at https://news.ycombinator.com/item?id=8217133 (link rot is a real problem, it is even more obvious on some content types such as multimedia/audio/video works; ~10% of my music/video bookmarks are now broken)

I think it will use wget, I need to find out a sensible directory organization, and it will take some work filtering/organizing downloaded files (I don't want to download 15GB of ads..).

@mro
Copy link

mro commented Aug 12, 2015

mabe blacklists like https://github.com/sononum/abloprox/ or http://wpad.mro.name/ can help? The latter uses a blacklist during the proxy configuration. Don't know whether wget can handle such if/or how the operating system helps here.

@nodiscc
Copy link
Member

nodiscc commented Aug 13, 2015

@mro the ad blocking system will likely use dnsmasq and hosts files. I already collected some relevant lists and conversion tools: https://github.com/nodiscc/shaarchiver/blob/master/ad-hosts.txt https://github.com/Andrwe/privoxy-blocklist/blob/master/privoxy-blocklist.sh. Adding abloprox to these. I will need to investigate a bit more - other suggestions are welcome.

@github-account1111
Copy link

github-account1111 commented May 6, 2021

Might be worth using a web-scraper (akin to the one Evernote use) over a separate client. A faithful webpage copy is always preferable to blindly passing a URL into a server because:

  • content filtering and ad blocking can mess things up
  • not being logged in can mess things up
  • custom CSS can mess things up

So you might (I would argue will) end up in a situation where what you find in your archive is nothing like what you saw in your browser, especially if you tend to use custom CSS for things like dark mode or tools like uBlock Origin.

@virtadpt
Copy link

virtadpt commented May 6, 2021

There is already a plugin which sends the bookmarked link to an arbitrary Wallabag instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants