Request for comments: add media/page archiving capabilities to the Python Shaarli client #22

nodiscc · 2017-07-12T14:08:02Z

Hi, this is not intended to be merged.

I attached my current quick & dirty script to archive music from an export of my Shaarli instance. It's just a bash script, as I needed it quick. Currently it downloads music, which is what I needed. I'd like to rewrite it in Python, with well thought-out integration with the official client. Consider this as a proof of concept for a rewrite of https://github.com/nodiscc/shaarchiver

I'd like some input on how this would be best achieved:

How much code separation from the main client? How to properly implement it?
- Add a separate entry_point to setuptools?
- Add a --archive-media flag to shaarli?
- Add an actions = option in config file? Add extractor configuration there?
- Write a totally separate client and import shaarli-client as a library?

Some notes:

The original Shaarli feature request for archiving shaares contents is Bold suggestion: save full content of the page. Shaarli#318
There's a brief discussion about content extraction for the python client at Packaged REST API client Shaarli#745
In Original Ideas/Fixme page Shaarli#106 (comment) it was suggested that multimedia/page content archiving/mirroring could be added directly as a Shaarli plugin. I think both a CLI archiving tool and a Shaarli plugin have their place (eg. I want to run the archive on my laptop, I don't want my webserver/PHP stack to exec() call youtube-dl, I have a shared host without youtube-dl/wget/... support...)
There will inevitably be some feature creep, as there are many use cases for web scraping and web content download in general - which is why I'm dubious about direct integration in the official API client. In the first time I intend to focus on 1. downloading multimedia content as it frequently disappears without notice 2. generating a friendly offline export of my shaares.
--format text is broken for me (invalid option --format). I'll investigate that.

To get a clearer picture, I added a list of current shaarchiver features, as well as features that might reasonably be requested, to the script header. Have a look

With that mind, what is the best way to start implementing an archiving tool around the API? (@virtualtam this is for you :) I'd rather not add bloat to the shiny new API client - I think it should stay a clean, reference client. On the other hand well integrated actions/modules would be interesting)

Once I have a clearer picture I will start working on a basic implementation, and might as well ping people who were interested in a Shaarli archiving tool.

Again there is no rush :) ETA year 2018. I'd like to work on polishing the API client first, add some tests, etc.

Edits:

This project could be useful as an inspiration: Create a Package with the functionalities ArchiveBox/ArchiveBox#3

… a shaarli proof of concept for an API-based rewrite of https://github.com/nodiscc/shaarchiver TODO: define desired features, and how to integrate it with python-shaarli-client

virtualtam · 2017-07-17T12:51:15Z

Hi!

Here are some first thoughts :)

How much code separation from the main client? How to properly implement it?

Let's start simple:

keep a single codebase
leverage setuptools dependency management to specify optional features tied to 3rd-party dependencies
add a subcommand parser dedicated to data archival

IMO these operations should be performed separately:

query a Shaarli instance to get a list of links
parse a list of links and retrieve/archive corresponding media

On the long run, we'll see whether more granularity is needed to keep sources and CLI usage consistent.

Add extractor configuration there [in a config file]?

Archival preferences could be specified in a config file:

local archive directories
multimedia preferences, e.g. audio & video formats
...

There will inevitably be some feature creep, as there are many use cases for web scraping and web content download in general

As for the current REST client, 3rd-party integrations should be implemented in a library form, with a console entrypoint that may serve as a Minimal Working Example in case someone wants to customize data retrieval and/or processing.

multimedia/page content archiving/mirroring could be added directly as a Shaarli plugin
[...]
I don't want my webserver/PHP stack to exec() call youtube-dl, I have a shared host without youtube-dl/wget/... support...)

The archival tool could be wrapped in a web (micro)service providing a REST API, that would be called by the corresponding Shaarli plugin.

nodiscc · 2017-10-23T17:35:01Z

I've been thinking about this lately. Can't figure out how to add a subcommand parser that would run a function that does 1. get-link with the specified parameters 2. write the output to a file (JSON) 3. parse the file and run archival methods on the link list. The command line would be something like

shaarli archive-links --limit=200 --tags=something --outdir=archive/.

I can't simply add archive-links to endpoints since those specifically correspond to Shaarli API endpoints

All in all I'm thinking about starting a separate project that would depend on python-shaarli-client, but maybe you could point me to the right way of adding that subcommand parser?

virtualtam · 2017-10-23T18:35:44Z

Suggestions:

rename the current script to shaarli-api and add new scripts, e.g. shaarli-archive
move API commands to an api subparser, and declare other subparsers for specific actions:
- $ shaarli api <params>
- $ shaarli archive <params>
- $ shaarli <action> <params>

Option 2. seems more consistent, by providing a single entrypoint and action-specific subparsers, while keeping a single project/package to gather Shaarli archival tools.

virtualtam · 2017-11-06T14:52:01Z

@nodiscc there's also the possibility of providing an interactive CLI entrypoint using the click library (possibly overkill but potentially quite fun to write :) )

nodiscc · 2017-11-06T23:15:52Z

Hi, I wrote a small patch to implement an --outfile command line parameter, it got me up to speed and I have a clearer picture of how to implement basic shaarli api/shaarli archive... command line logic now (and thanks for your comment, that put me on the right track).

I'll make the final tests (python SSL warnings also lead me to finally ditch my server self-signed certs and setup Letsencrypt) and send a PR soon. It took me a while to pass the CI tests :)

Edit: re interactive interface: I'm more interested in the scripted/automated aspect of this tool right now, but I always wanted to look into python-click. Maybe someday :)

nodiscc · 2017-11-16T17:28:35Z

Moved to #24

add quick&dirty bash script to download multimedia/music content from…

347e22e

… a shaarli proof of concept for an API-based rewrite of https://github.com/nodiscc/shaarchiver TODO: define desired features, and how to integrate it with python-shaarli-client

nodiscc added question discussion feature labels Jul 12, 2017

re-enable install

458b589

audio format: best

363ccfb

nodiscc mentioned this pull request Nov 16, 2017

add media/page archiving capabilities #24

Closed

nodiscc closed this Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for comments: add media/page archiving capabilities to the Python Shaarli client #22

Request for comments: add media/page archiving capabilities to the Python Shaarli client #22

nodiscc commented Jul 12, 2017 •

edited

Loading

virtualtam commented Jul 17, 2017

nodiscc commented Oct 23, 2017 •

edited

Loading

virtualtam commented Oct 23, 2017

virtualtam commented Nov 6, 2017

nodiscc commented Nov 6, 2017 •

edited

Loading

nodiscc commented Nov 16, 2017

Request for comments: add media/page archiving capabilities to the Python Shaarli client #22

Request for comments: add media/page archiving capabilities to the Python Shaarli client #22

Conversation

nodiscc commented Jul 12, 2017 • edited Loading

virtualtam commented Jul 17, 2017

nodiscc commented Oct 23, 2017 • edited Loading

virtualtam commented Oct 23, 2017

virtualtam commented Nov 6, 2017

nodiscc commented Nov 6, 2017 • edited Loading

nodiscc commented Nov 16, 2017

nodiscc commented Jul 12, 2017 •

edited

Loading

nodiscc commented Oct 23, 2017 •

edited

Loading

nodiscc commented Nov 6, 2017 •

edited

Loading