-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bold suggestion: save full content of the page. #318
Comments
Very interesting.. +1 |
Well... this is a difficult subject, but an interesting feature. What's can be done:
Any opinion would be welcome here. |
Some thoughts and related tools:
|
Shaarchiver looks nice... it does save the full html page content in addition to media you mention? If so, it solves the problem for me.. setting it in cron job or something... |
AFAIK, downloading page content is in the TODO list |
-1 @urza, @virtualtam what you describe doesn't fit with "The personal, minimalist, super-fast, no-database delicious clone" so it should IMHO not go into shaarli because it is not shaarli. I'd even go as far as saying it must not. what you describe is "The personal, ... http://archive.org clone" (not respecting robots.txt) - which is another beast. BTW, have you seen the support for archive.org in shaarli? |
Are there any downsides of using archive.org ? Otherwise I agree with @mro, why bother doing such hard work for something which already exists. |
Again, archiving features should remain in a separate tool. Shaarli should provide export formats that makes it easy for such tools to parse and work with data. shaarchiver is easy enough to use and just relies on Shaarli's HTML export (could be improved to parse RSS) - though it doesn't support archiving webpages yet, mainly because I'd like to do it right on first try (I did not decide whether this will use python-scrapy, or an external tool like httrack/wget, it needs better error handling, etc...) Help is welcome. If you want a PHP based tool, there is https://github.com/broncowdd/respawn (didn't try it yet). Shaarli is not a web scraping system, and I think it's fine this way. |
Yes, it makes sense. Keep Shaarli simple and provide good interface for other tools to do the downloading-archiving... It is a project of itself. I totally agree lets keep the jobs separated. @nodiscc .. Any time estimation when Shaarchiver could be able to do it? Personally I would go with the wget route.. wget has imo solved all the problems, it is battle tested solution, that can download "full page content" so it fits the philosophy... but I dont know about the other options you mention, maybe they are just as good... |
@urza no clear ETA for shaarchiver, but page archiving should definitely be working before 2016 :) I'm low on free time. For the record there was an interesting discussion regarding link rot at https://news.ycombinator.com/item?id=8217133 (link rot is a real problem, it is even more obvious on some content types such as multimedia/audio/video works; ~10% of my music/video bookmarks are now broken) I think it will use |
mabe blacklists like https://github.com/sononum/abloprox/ or http://wpad.mro.name/ can help? The latter uses a blacklist during the proxy configuration. Don't know whether wget can handle such if/or how the operating system helps here. |
@mro the ad blocking system will likely use |
Might be worth using a web-scraper (akin to the one Evernote use) over a separate client. A faithful webpage copy is always preferable to blindly passing a URL into a server because:
So you might (I would argue will) end up in a situation where what you find in your archive is nothing like what you saw in your browser, especially if you tend to use custom CSS for things like dark mode or tools like uBlock Origin. |
There is already a plugin which sends the bookmarked link to an arbitrary Wallabag instance. |
I know this is a bold one, but hear me out.
About half of my links that are older then 3 years are broken, the content disappeared from web. I discovered this the hard way of course. Sometimes it doesn't matter, but sometimes the webpage in my bookmarks was really useful. But lifetime of URLs on internet is on average just few years.
I know there is discussion about archiving links in archive.org #307 and I vouch for this, this enhancement would be very useful.
But I have one other possible suggestion: How about saving the full content locally to filesystem structure? Would it be hard? If we save text it would allow for fulltext search (not just links, but also their content), which would bring Shaarli to completely new level of usability. I currently do this with Evernote - links that I consider important, I save both to my shaarli bookmarks (because I already have years of records in here, so I want to keep it consistent), but also to my Evernote which retrieves the whole content of the page, including pictures and media, alows me to organize by notebooks, tags and even decide how I want to clip the page (full, article, just text)... then I can fulltext search it... If shaarli could match this somehow, it would be just fantastic...
Maybe combination of saving full text from the link to allow for fulltext search, and taking screenshot of the page with something line PhantomJS http://phantomjs.org/screen-capture.html to save the visual aspect of the page?
What do you guys think?
The text was updated successfully, but these errors were encountered: