ELN Developer wishlist #18
PeterKraus
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Following on the very fruitful ELN Roundtable, I have collated what we identified as items on the wishlist for the remainder of the duration of this WG:
1. The extractor schema must not impose an output schema for the extraction operation (@markus1978)
This is mainly to allow adding custom extractors quickly, without restricting their return schema. Of course, extractors with a stable schema might wish to specify that schema, but this must remain an optional feature. This schema specification is currently out of scope of this WG.
2. The registry must allow selecting extractors that return only metadata (@SteffenBrinckmann)
To prevent overloading downstream with huge amounts of extracted data, it must be possible to filter the matching extractors to those that return only metadata.
3. The api has to handle extraction failures (@kjappelbaum, @NicolasCARPi)
This means returning sensible error values when extraction fails, and allowing for cycling through multiple extractors for the same filetype, if present. Mechanisms of selecting a preferred extractor would be nice, but currently out of scope of the WG.
4. The filetype schema should be extended to allow filetype matching (@markus1978, @NicolasCARPi)
With the goal of including a "reference implementation" of such file to filetype matching function, illustrating how the filetype hints may be used.
5. The preferred returned object is one json per submitted file (@NicolasCARPi)
This implies setting default output types of extractors to json, if possible. Automated chaining of extractors is currently out of scope of this WG.
6. The extractors should be bundled and made available as a docker (@markus1978, @SteffenBrinckmann, @NicolasCARPi)
With the caveat that providing a single docker might be impossible due to package incompatibility, and that providing the framework packages themselves is necessary to allow power users. Ideally, the docker could be deployed as a service, but that might be out of scope.
7. The extractors in the framework must be validated and tested (@kjappelbaum)
This implies finding a permanent home to host the infrastructure of the framework, at least for CI & deployment.
8. Development of new extractor and filetype specifications should be easy (@kjappelbaum)
Avoiding boilerplate. Design a web form which lets users write extractor and/or filetype yamls.
There are many other great ideas that came out of the ELN Roundtable and previous discussions. However, I think the above set is a good low- and medium-hanging fruit, possible to deliver on by Q4/2023.
Beta Was this translation helpful? Give feedback.
All reactions