-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for arbitrary adapters via a config file #60
Comments
Type: Reply 💬 1. SummaryIt would be nice, if format of configuration file will not JSON. In my opinion YAML is the best config format for human editing; TOML is a good alternative. 2. Argumentation2.1. Summary
2.2. DetailsI already wrote 9 issues about this problem for another tools. Example; another issues referenced below. 2.3. Additional linkThanks. |
Counterpoints:
See these screenshots: I'd love to use a different format, I'm looking at JSON5, but so far I've not really found a worthy replacement. |
Maybe GeML + GeML schema could be the solution? :D |
Sure, but only if you manage to get your Automatic settings UI editor to work. I'd love to ship a html file so I can add a |
Type: Reply 💬 1. Replies1.1. StrictYAMLDid you try StrictYAML? Rust implementation (I didn’t test it).
1.2. YAML problems
I don’t think, that YAML (issue) or ruamel.yaml (issue) is ideal, but:
1.3. YAML vs JSON
Simple example:
Extra symbols (1
1.4. TOML popularity
May I ask what this opinion is based on? Currently, on 9 June 2020, GitHub has 156 thousands TOML code results; most matches of I don’t think, that TOML is “very unknown”. 2. Custom configuration formatSome Node.js projects have a cosmiconfig dependency. Users of these projects can have YAML, JSON or JavaScript configuration (see official repository for details) files. Confinode also can support another formats, include TOML. Is something like this possible in Rust? So that ripgrep-all users themselves can choose preferable config format. I found (but not tested) config-rs Rust repository.
Possibly, it may help. Thanks. |
This will be added in 1.0.0 and is already present in 1.0.0-alpha.4 |
Since there's many feature requests for different file formats now, many of which do not have corresponding nice and fast Rust libraries, I think the best solution is to allow specifying "custom" preprocessors via a config file.
This comes with the question about how this would differ than just using
rg
with the--pre
directive directly:I'm very happy with this feature of rga, since extractors are often very slow and with the zstd-compressed cache most extractions are both very small and very fast to read, while barely adding any overhead on initial run. This is hard or impossible to reproduce in a simple extract-script (see my original pdfextract.sh)
rga can recurse into archives, and return contents at any depth as a binary stream. The same can be implemented for other things that aren't strictly archives, like a pdf file that contains images, where the images may be searched by a different extractor
Future additions that might be possible here (no promises) that will probably not appear in rg core are:
Like the pdf extractor already adds the Page number to the pdftotext output by counting for ascii pagebreak symbols, there might be a some postprocessing steps that could be defined in the config file so they are implemented in fast rust without effort on behalf of the filetype-handler
From current usage the extractor is always slow enough so the initialization time is kinda irrelevant, but this might not always be the case:
For example, stuff like tesseract loads neural networks into memory when started, which can be a significant overhead. I think those are evaluated on the CPU, but if there was stuff like GPU-based compute it would be even worse.
It might be useful to add adapters that are more like text-conversion tools (such as removing broken characters (Unicode normalization #26, feature_request(ebooks): kill gremlin characters #46) or changing encodings (UTF16 and possibly other UTF encodings support #5, feature_request(ebooks): non UTF-8 books support #47)) that could then be added as a step before or after the usual adapters
The baseline implementation of this should be pretty easy, more features can be added later. Main decision is the config file format, whether or not to change existing SpawningFileAdapters to build on top of this and how to document it.
The text was updated successfully, but these errors were encountered: