-
Notifications
You must be signed in to change notification settings - Fork 123
New XML plugin #1280
Comments
Thanks for the comprehensive analysis!
Yes, it should be able to completely roundtrip every XML file as far as possible (see the paper of "The essence of XML" from Jerome Simeon and Philip Wadler for some limitations that are inherent in XML). Storing attributes as meta data sounds sensible. Fields of the same name would be stored in an Elektra array. Problematic parts include comments, CDATA, and space normalization. While missing spaces/CDATA are acceptable, it would be great to not lose comments (so we somehow need to escape the attribute "comment" to be distinguishable from XML comments).
xmltool is using libxml2 and it works quite well. We use quite old APIs; there are better possibilities within libxml2 now. The main advantage would be that we already have experience with it, and it is available in our build-system and on our build-servers.
I have not used it, cannot say much about it. From the website it says "stream-oriented" which usually means that it is much more difficult to work with. Transforming data structures is simpler than building up data structures within callbacks/SAX. The advantage of the streaming approaches ("parse documents that won't fit into memory") does not apply to Elektra.
Is there a benefit from using an incomplete implementation? The dependency is only for the plugin, so if someone wants XML, they should get the full thing.
Sounds fine. You can start with basic functionality and add support for XSD validation, character encodings, and so on later.
Yes, it definitely fixes #1228 but it does not fix #1159. xmltool would still be required to import xml files generated by xmltool. Maybe could write an XML transformation from our previous XML format to the new one (e.g. using XSLT, which is supported by libxml2), then we could get rid of the whole xmltool plugin (which then would solve #1159).
I would not use an incomplete parser. I would avoid a stream-oriented parser (in the case of Elektra). I do not think that libxml2 could be a wrong decision. You did not mention xerces: it seems to have cmake support, but it seems to lack XSLT (and least a cmake file for xalan is missing?) and has a C++ dependency which is unwanted by many people (but acceptable within a plugin). So both libxml2 and xerces are ok. If you are much faster with xerces, pick it. Otherwise use libxml2.
Yes! Let us start with a minimalistic implementation (but nicely documented, commented, and tested) and add more if needed later. |
Just found out that libxml2 is not thread-safe for plugins: It requires to "call xmlInitParser() in the "main" thread before using any of the libxml2 API (except possibly selecting a different memory allocator)" (see http://xmlsoft.org/threads.html), which cannot be fulfilled by Elektra (we cannot choose from which threads we are called in which order). It might be not a real issue, because any application using libxml2 will call Xerces https://xerces.apache.org/xerces-c/faq-parse-3.html#faq-6 however, does not have this issue at all. |
@markus2330 In the meeting we agreed xerces should be fine to use. Do you think its better to call the plugin simply xerces, xercesc (its called like that in homebrew and cmake, but i think the trailing c is a bit ugly) or xmlxerces (my preferred way), as with the latter you immediately see it has to do with xml processing, for people that don't know xerces. |
For consistency to I see no point to call it From history perspective naming the json plugin Many people, however, connect Elektra with XML anyway, so maybe we won't have this problem with XML. (EDIT: added last sentence) |
…tion, always use relative paths, add special chars to tests ElektraInitiative#1280
…tion, always use relative paths, add special chars to tests ElektraInitiative#1280
…tion, always use relative paths, add special chars to tests ElektraInitiative#1280
The new plugin was merged with #1380. Small fixes were done in 5f7887e and 9ebc79e In 31a18b8 I tried to add the xerces plugin to be also part of the configuration file formats available on the website. Unfortunately, the unit test fails there: https://build.libelektra.org/job/elektra-homepage/169/console
Furthermore there is a warning:
Can you do a quick fix or should we again remove xerces from the homepage build. Maybe it is a simple localization issue? On my computer locally the tests work. |
Currently trying to figure it out. looks like some kind of encoding issue on a first glance, though xerces should take care about that... thats why i use those XMLChar::transcode functions constantly. I compiled it on a linux (ubuntu in that case since i had it lying around) with the cmake options exactly like in the build-homepage script, but then the tests succeed. Asan shows me some strange errors though, like
Those don't seem to matter, at the end it still says 71 tests succeeded and 0 failed. It made me wonder if there are some kind of memory leaks, but according to valgrind it should be good. Any idea what this is and if its bad? Back to the character problem, it has to be something else that makes the environment different then i guess. I saw that it uses xerces in version 3.1.1, on mac i used 3.1.4 and on ubuntu i have 3.1.3 . So i compiled the old 3.1.1 version of xerces and then tried again, tests are still succeeding with it. Any idea what else it could be different or how i can reproduce the build environment better? Or can i somehow enable the logging output for this build/environment? that could maybe already help, though it would be the best to have it reproduced locally. |
The problem is rather to be found in the locales or similar, try playing around with If that does not work, see the tests in type and range plugin how to enforce some specific locales for the tests.
Yes, ASAN needs to be fixed for the homepage, too. We use ASAN in production mode there.
Strange, usually it aborts on ASAN problems. Maybe it is irrelevant because it is just a note? It does not look like that the fault is in your code, seems like
On the homepage's build server the environment is:
I can easily reproduce the problem with:
If you cannot reproduce it, I can give you access to the build server agent. But it is very slow,.. not much fun to work there. |
Btw. from within a PR you can simply trigger build jobs on the build server. See Simply say |
ah, now i see it. my other environments all use an utf8 locale, and the test files are also in utf8 and use several special characters. So when it uses POSIX locale, xerces tries to parse the characters to this from UTF-8, which then leads to these undefined characters because of the special characters. POSIX is similar to the C locale and basically just englisch ASCII characters right? Per default it always transcodes between some internal format based on UTF-16 and the platform's locale. It can decode XML files of all various formats, depending on what the system supports to this internal encoding. Most XML files are UTF 8 though as far as i know. Summing up to my understanding the correct locale in this case would be what elektra expects and elektra delivers, as the xerces plugin only communicates with elektra and not other parts of the system. So the conversion between c++ strings and the xerces string type is what is affected by the locale. The Elektra API takes the c++ strings, so it dictates what code page we have to deliver here. Doesn't elektra also just use the local locale, which would make it unable to store such special characters? If elektra works fine with UTF-8, it could change xerces to always deliver UTF-8 (and in return expect UTF-8) as a possible quick solution. |
Yes, it is fine if you only use UTF-8 to read and write XML files. In a later version we might add that the locale that is written in the XML file header is preserved (similar to rootname). It does not make much sense to use the system's locales for configuration files because configuration files should be shareable between users and systems with different locales. We have the iconv plugin to convert between different encodings. But I think this is hardly used these days, UTF-8 is ubiquitous. |
…n using a POSIX locale for the system ElektraInitiative#1280
https://www.libelektra.org/conversion now contains the xerces plugin. Seems like there are some issues regarding umlauts and it also fails at output format for quite some KeySets. Did you try using it with XML configuration files? In particular jenkins config files would be interesting. |
…me “real world” tests ElektraInitiative#1280
…ts with the same parent (map to elektra array) ElektraInitiative#1280
… in arrayFilter of array.c ElektraInitiative#1280
So after some fixies it seems to work quite well for the few things i tried (maven pom file, some jenkins config file i found on the internet including chinese characters, some random umlauts). Always converted from xerces to xerces again to see if the content stays the same apart from the format. Can i have one of our jenkins files from the build server (a complex one if we have one) or get access to the build server so i can look it up myself? That would be another interesting test case to add. If you have some examples where it still fails please show them so i can figure out whats going on. I checked a few snippets from the homepage and a lot of failures can be traced back to invalid characters , but thats mentioned in the README that there is no direct support for that, e.g. One idea is to ignore attributes/elements with invalid names, maybe as a plugin configuration if this behavior is wanted or not. Also i changed some of my snippets to xerces instead of line, works good: |
Thank you, looks like the xml plugin is getting great! I uploaded our jenkins.xml in fc4b583
Yes, we should drop everything starting with
|
Hmm i just checked it and for me apiToken and passwordHash are equivalent after im- and exporting again. basically the file looks quite the same judging from the content.
Can i add our jenkins file as another unit test despite including the api key and password hash or should we rather keep those a secret?
The issue i see with that is that plugins might depend on their metadata for serialization/deserialization, so they can't be stripped beforehand. But we could at least drop all metadata of other plugins (e.g. for xerces export drop everything except internal/xerces nodes) so that our plugins only have to care about their metadata, which they should be aware of anyway. |
Sorry, with "realistic" I mean it is as found on the build server, not that there are any issues with your plugin. apiToken and passwordHash were modified to keep them secret.
It is already published, so yes: you can use it in a unit test.
It would not be internal then. If it contains relevant information, it should be stored somewhere else. |
…dd root element array limitation note to readme ElektraInitiative#1451
…me “real world” tests ElektraInitiative#1280
…ts with the same parent (map to elektra array) ElektraInitiative#1280
… in arrayFilter of array.c ElektraInitiative#1280
…dd root element array limitation note to readme ElektraInitiative#1451
As the current xml plugin is rather limited and fails to parse several valid xml files (e.g. maven snippets when uploaded to the snippet sharing) it is a good idea to replace it with a more robust implementation. To avoid carrying over problems of the old xmltoolkit codebase, we'd recreate it from scratch.
i guess this plugin should aim to be as general as possible? The xml elements can probably be translated to an elektra hierarchy like
is
A quick research on C xml libraries showed the following possibilities:
I estimate a few hours to develop a simple version e.g. 5-10, given i have no plugin experience yet.
This should solve the following related issues:
So what do you think about which xml library to use? And do you agree with the requirements?
The text was updated successfully, but these errors were encountered: