Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for comments: add media/page archiving capabilities to the Python Shaarli client #22

Closed
wants to merge 3 commits into from

Conversation

nodiscc
Copy link
Member

@nodiscc nodiscc commented Jul 12, 2017

Hi, this is not intended to be merged.

I attached my current quick & dirty script to archive music from an export of my Shaarli instance. It's just a bash script, as I needed it quick. Currently it downloads music, which is what I needed. I'd like to rewrite it in Python, with well thought-out integration with the official client. Consider this as a proof of concept for a rewrite of https://github.com/nodiscc/shaarchiver

I'd like some input on how this would be best achieved:

  • How much code separation from the main client? How to properly implement it?
    • Add a separate entry_point to setuptools?
    • Add a --archive-media flag to shaarli?
    • Add an actions = option in config file? Add extractor configuration there?
    • Write a totally separate client and import shaarli-client as a library?

Some notes:

  • The original Shaarli feature request for archiving shaares contents is Bold suggestion: save full content of the page. Shaarli#318
  • There's a brief discussion about content extraction for the python client at Packaged REST API client Shaarli#745
  • In Original Ideas/Fixme page Shaarli#106 (comment) it was suggested that multimedia/page content archiving/mirroring could be added directly as a Shaarli plugin. I think both a CLI archiving tool and a Shaarli plugin have their place (eg. I want to run the archive on my laptop, I don't want my webserver/PHP stack to exec() call youtube-dl, I have a shared host without youtube-dl/wget/... support...)
  • There will inevitably be some feature creep, as there are many use cases for web scraping and web content download in general - which is why I'm dubious about direct integration in the official API client. In the first time I intend to focus on 1. downloading multimedia content as it frequently disappears without notice 2. generating a friendly offline export of my shaares.
  • --format text is broken for me (invalid option --format). I'll investigate that.

To get a clearer picture, I added a list of current shaarchiver features, as well as features that might reasonably be requested, to the script header. Have a look

With that mind, what is the best way to start implementing an archiving tool around the API? (@virtualtam this is for you :) I'd rather not add bloat to the shiny new API client - I think it should stay a clean, reference client. On the other hand well integrated actions/modules would be interesting)

Once I have a clearer picture I will start working on a basic implementation, and might as well ping people who were interested in a Shaarli archiving tool.

Again there is no rush :) ETA year 2018. I'd like to work on polishing the API client first, add some tests, etc.

Edits:

… a shaarli

proof of concept for an API-based rewrite of https://github.com/nodiscc/shaarchiver
TODO: define desired features, and how to integrate it with python-shaarli-client
@virtualtam
Copy link
Member

Hi!

Here are some first thoughts :)

How much code separation from the main client? How to properly implement it?

Let's start simple:

  • keep a single codebase
  • leverage setuptools dependency management to specify optional features tied to 3rd-party dependencies
  • add a subcommand parser dedicated to data archival

IMO these operations should be performed separately:

  • query a Shaarli instance to get a list of links
  • parse a list of links and retrieve/archive corresponding media

On the long run, we'll see whether more granularity is needed to keep sources and CLI usage consistent.

Add extractor configuration there [in a config file]?

Archival preferences could be specified in a config file:

  • local archive directories
  • multimedia preferences, e.g. audio & video formats
  • ...

There will inevitably be some feature creep, as there are many use cases for web scraping and web content download in general

As for the current REST client, 3rd-party integrations should be implemented in a library form, with a console entrypoint that may serve as a Minimal Working Example in case someone wants to customize data retrieval and/or processing.

multimedia/page content archiving/mirroring could be added directly as a Shaarli plugin
[...]
I don't want my webserver/PHP stack to exec() call youtube-dl, I have a shared host without youtube-dl/wget/... support...)

The archival tool could be wrapped in a web (micro)service providing a REST API, that would be called by the corresponding Shaarli plugin.

@nodiscc
Copy link
Member Author

nodiscc commented Oct 23, 2017

I've been thinking about this lately. Can't figure out how to add a subcommand parser that would run a function that does 1. get-link with the specified parameters 2. write the output to a file (JSON) 3. parse the file and run archival methods on the link list. The command line would be something like

shaarli archive-links --limit=200 --tags=something --outdir=archive/.

I can't simply add archive-links to endpoints since those specifically correspond to Shaarli API endpoints

All in all I'm thinking about starting a separate project that would depend on python-shaarli-client, but maybe you could point me to the right way of adding that subcommand parser?

@virtualtam
Copy link
Member

Suggestions:

  1. rename the current script to shaarli-api and add new scripts, e.g. shaarli-archive
  2. move API commands to an api subparser, and declare other subparsers for specific actions:
    • $ shaarli api <params>
    • $ shaarli archive <params>
    • $ shaarli <action> <params>

Option 2. seems more consistent, by providing a single entrypoint and action-specific subparsers, while keeping a single project/package to gather Shaarli archival tools.

@virtualtam
Copy link
Member

@nodiscc there's also the possibility of providing an interactive CLI entrypoint using the click library (possibly overkill but potentially quite fun to write :) )

@nodiscc
Copy link
Member Author

nodiscc commented Nov 6, 2017

Hi, I wrote a small patch to implement an --outfile command line parameter, it got me up to speed and I have a clearer picture of how to implement basic shaarli api/shaarli archive... command line logic now (and thanks for your comment, that put me on the right track).

I'll make the final tests (python SSL warnings also lead me to finally ditch my server self-signed certs and setup Letsencrypt) and send a PR soon. It took me a while to pass the CI tests :)

Edit: re interactive interface: I'm more interested in the scripted/automated aspect of this tool right now, but I always wanted to look into python-click. Maybe someday :)

@nodiscc
Copy link
Member Author

nodiscc commented Nov 16, 2017

Moved to #24

@nodiscc nodiscc closed this Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants