Skip to content
This repository has been archived by the owner on Oct 15, 2024. It is now read-only.

New XML plugin #1280

Closed
e1528532 opened this issue Jan 15, 2017 · 16 comments
Closed

New XML plugin #1280

e1528532 opened this issue Jan 15, 2017 · 16 comments
Assignees

Comments

@e1528532
Copy link
Contributor

As the current xml plugin is rather limited and fails to parse several valid xml files (e.g. maven snippets when uploaded to the snippet sharing) it is a good idea to replace it with a more robust implementation. To avoid carrying over problems of the old xmltoolkit codebase, we'd recreate it from scratch.

i guess this plugin should aim to be as general as possible? The xml elements can probably be translated to an elektra hierarchy like

<test>
    <element1 attr1="test" />
    <element2>
        <element3>text</element3>
    </element2>
</test>

is

/test/element1/attr1 = "test" // store as metadata
/test/element2/element3 = "text"

A quick research on C xml libraries showed the following possibilities:

  • libxml2 http://www.xmlsoft.org/examples/index.html
    • seems to support a variety of systems
    • actively maintained and backed by the GNOME project
    • no dependencies, just ansi c
    • MIT licence
  • expat http://expat.sourceforge.net/
    • also seems to support a lot of systems
    • also seems to be actively maintained
    • no information wheter its ansi c found yet but apparently also used by some known OSS projects like apache webserver or mozilla
    • MIT licence
  • mini xml http://www.msweet.org/projects.php?Z3
    • also seems to be actively maintained
    • only ansi c needed, thus also works on a variety of systems
    • LGPL licence

I estimate a few hours to develop a simple version e.g. 5-10, given i have no plugin experience yet.

This should solve the following related issues:

So what do you think about which xml library to use? And do you agree with the requirements?

@markus2330
Copy link
Contributor

Thanks for the comprehensive analysis!

i guess this plugin should aim to be as general as possible?

Yes, it should be able to completely roundtrip every XML file as far as possible (see the paper of "The essence of XML" from Jerome Simeon and Philip Wadler for some limitations that are inherent in XML).

Storing attributes as meta data sounds sensible. Fields of the same name would be stored in an Elektra array. Problematic parts include comments, CDATA, and space normalization. While missing spaces/CDATA are acceptable, it would be great to not lose comments (so we somehow need to escape the attribute "comment" to be distinguishable from XML comments).

libxml2

xmltool is using libxml2 and it works quite well. We use quite old APIs; there are better possibilities within libxml2 now. The main advantage would be that we already have experience with it, and it is available in our build-system and on our build-servers.

expat

I have not used it, cannot say much about it. From the website it says "stream-oriented" which usually means that it is much more difficult to work with. Transforming data structures is simpler than building up data structures within callbacks/SAX. The advantage of the streaming approaches ("parse documents that won't fit into memory") does not apply to Elektra.

mini xml

Is there a benefit from using an incomplete implementation? The dependency is only for the plugin, so if someone wants XML, they should get the full thing.

I estimate a few hours to develop a simple version e.g. 5-10, given i have no plugin experience yet.

Sounds fine. You can start with basic functionality and add support for XSD validation, character encodings, and so on later.

This should solve the following related issues.

Yes, it definitely fixes #1228 but it does not fix #1159. xmltool would still be required to import xml files generated by xmltool. Maybe could write an XML transformation from our previous XML format to the new one (e.g. using XSLT, which is supported by libxml2), then we could get rid of the whole xmltool plugin (which then would solve #1159).

So what do you think about which xml library to use?

I would not use an incomplete parser. I would avoid a stream-oriented parser (in the case of Elektra). I do not think that libxml2 could be a wrong decision. You did not mention xerces: it seems to have cmake support, but it seems to lack XSLT (and least a cmake file for xalan is missing?) and has a C++ dependency which is unwanted by many people (but acceptable within a plugin).

So both libxml2 and xerces are ok. If you are much faster with xerces, pick it. Otherwise use libxml2.

And do you agree with the requirements?

Yes! Let us start with a minimalistic implementation (but nicely documented, commented, and tested) and add more if needed later.

markus2330 pushed a commit that referenced this issue Jan 15, 2017
@markus2330
Copy link
Contributor

markus2330 commented Jan 15, 2017

Just found out that libxml2 is not thread-safe for plugins: It requires to "call xmlInitParser() in the "main" thread before using any of the libxml2 API (except possibly selecting a different memory allocator)" (see http://xmlsoft.org/threads.html), which cannot be fulfilled by Elektra (we cannot choose from which threads we are called in which order).

It might be not a real issue, because any application using libxml2 will call xmlInitParser anyway quite early. And maybe the requirement is outdated anyway (I cannot think of a sane implementation that really requires that it must be the main thread. Maybe they wanted to say that concurrent calls of xmlInitParser call for trouble). So we should ask at the mailing list/issue tracker if they really do not want to support plugins. See https://bugzilla.gnome.org/show_bug.cgi?id=704439 and https://bugzilla.gnome.org/show_bug.cgi?id=704905

Xerces https://xerces.apache.org/xerces-c/faq-parse-3.html#faq-6 however, does not have this issue at all.

e1528532 pushed a commit to e1528532/libelektra that referenced this issue Jan 15, 2017
@markus2330 markus2330 mentioned this issue Jan 23, 2017
5 tasks
@e1528532 e1528532 mentioned this issue Feb 1, 2017
27 tasks
@e1528532
Copy link
Contributor Author

@markus2330 In the meeting we agreed xerces should be fine to use. Do you think its better to call the plugin simply xerces, xercesc (its called like that in homebrew and cmake, but i think the trailing c is a bit ugly) or xmlxerces (my preferred way), as with the latter you immediately see it has to do with xml processing, for people that don't know xerces.

@markus2330
Copy link
Contributor

markus2330 commented Feb 23, 2017

For consistency to yajl, simply xerces would be a good option. If you say that your plugin provides the format xml, simply mounting xml will pick xerces: so people do not need to know what xerces is.

I see no point to call it xercesc. xmlxerces would work, but its longer than necessary. But maybe you are right, for me its so obvious that xerces does XML (and is not about blue butterflies) that I am maybe not the right person to ask.

From history perspective naming the json plugin yajl maybe wasn't the best idea, from time to time people think Elektra does not have JSON support. But I do not know if the reason is the name yajl or that we basically have no advertising for it.

Many people, however, connect Elektra with XML anyway, so maybe we won't have this problem with XML.

(EDIT: added last sentence)

e1528532 added a commit to e1528532/libelektra that referenced this issue Feb 28, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Feb 28, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 3, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 3, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 3, 2017
…tion, always use relative paths, add special chars to tests ElektraInitiative#1280
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 4, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 27, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 27, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 27, 2017
…tion, always use relative paths, add special chars to tests ElektraInitiative#1280
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 27, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 29, 2017
…tion, always use relative paths, add special chars to tests ElektraInitiative#1280
e1528532 added a commit to e1528532/libelektra that referenced this issue Mar 29, 2017
@markus2330
Copy link
Contributor

The new plugin was merged with #1380.

Small fixes were done in 5f7887e and 9ebc79e

In 31a18b8 I tried to add the xerces plugin to be also part of the configuration file formats available on the website. Unfortunately, the unit test fails there:

https://build.libelektra.org/job/elektra-homepage/169/console

 83/157 Test  #83: testmod_xerces .....................***Failed    0.93 sec
XERCES    TESTS
==================

test basics
test basics finished
test simple read
/home/jenkins/workspace/workspace/elektra-homepage/src/plugins/xerces/testmod_xerces.c:103: error in test_simple_read: sp��ci��l_-ke��s1 key not found
/home/jenkins/workspace/workspace/elektra-homepage/src/plugins/xerces/testmod_xerces.c:129: error in test_simple_read: value not correct
/home/jenkins/workspace/workspace/elektra-homepage/src/plugins/xerces/testmod_xerces.c:132: error in test_simple_read: no metadata exists
test simple read finished
test simple write
/home/jenkins/workspace/workspace/elektra-homepage/src/plugins/xerces/testmod_xerces.c:174: error in test_simple_write: call to kdbSet was not successful
/home/jenkins/workspace/workspace/elektra-homepage/tests/cframework/tests.c:165: fatal in compare_line_files: could not open file, orig: xerces/escaping.xml gen: xerces/escaping-gen.xml

scripts/build-homepage describes how it is built there.

Furthermore there is a warning:

/home/jenkins/workspace/workspace/elektra-homepage/src/plugins/xerces/serializer.cpp:39:60: warning: unused parameter 'name' [-Wunused-parameter]
 void key2xml (DOMDocument & doc, DOMElement & elem, string name, Key const & key)
                                                            ^

Can you do a quick fix or should we again remove xerces from the homepage build.

Maybe it is a simple localization issue? On my computer locally the tests work.

@e1528532
Copy link
Contributor Author

e1528532 commented Apr 1, 2017

Currently trying to figure it out. looks like some kind of encoding issue on a first glance, though xerces should take care about that... thats why i use those XMLChar::transcode functions constantly. I compiled it on a linux (ubuntu in that case since i had it lying around) with the cmake options exactly like in the build-homepage script, but then the tests succeed. Asan shows me some strange errors though, like

/home/armin/git/libelektra/src/plugins/xerces/deserializer.cpp:152:57: runtime error: member call on address 0x613000007c90 which does not point to an object of type 'DOMDocument'
0x613000007c88: note: object is base class subobject at offset 8 within object of type 'xercesc_3_1::DOMDocumentImpl'
 20 60 00 00  08 39 8a b1 3f 7f 00 00  18 3c 8a b1 3f 7f 00 00  60 3d 8a b1 3f 7f 00 00  98 3d 8a b1
              ^                        ~~~~~~~~~~~~~~~~~~~~~~~
                                       vptr for 'xercesc_3_1::DOMDocument' base class of 'xercesc_3_1::DOMDocumentImpl'

Those don't seem to matter, at the end it still says 71 tests succeeded and 0 failed. It made me wonder if there are some kind of memory leaks, but according to valgrind it should be good. Any idea what this is and if its bad?

Back to the character problem, it has to be something else that makes the environment different then i guess. I saw that it uses xerces in version 3.1.1, on mac i used 3.1.4 and on ubuntu i have 3.1.3 . So i compiled the old 3.1.1 version of xerces and then tried again, tests are still succeeding with it. Any idea what else it could be different or how i can reproduce the build environment better? Or can i somehow enable the logging output for this build/environment? that could maybe already help, though it would be the best to have it reproduced locally.

@markus2330
Copy link
Contributor

cmake options

The problem is rather to be found in the locales or similar, try playing around with LC_ALL. Ideally, the implementation would be independent of the currently set locale.

If that does not work, see the tests in type and range plugin how to enforce some specific locales for the tests.

Asan

Yes, ASAN needs to be fixed for the homepage, too. We use ASAN in production mode there.

Those don't seem to matter,

Strange, usually it aborts on ASAN problems. Maybe it is irrelevant because it is just a note? It does not look like that the fault is in your code, seems like auto yields an type ASAN cannot cope with?

the build environment better?

On the homepage's build server the environment is:

> locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

I can easily reproduce the problem with:

LC_ALL="POSIX" ctest -V -R xerces

If you cannot reproduce it, I can give you access to the build server agent. But it is very slow,.. not much fun to work there.

@markus2330
Copy link
Contributor

markus2330 commented Apr 1, 2017

Btw. from within a PR you can simply trigger build jobs on the build server. See doc/GIT.md.

Simply say jenkins build homepage please.

markus2330 pushed a commit that referenced this issue Apr 1, 2017
@e1528532
Copy link
Contributor Author

e1528532 commented Apr 1, 2017

ah, now i see it. my other environments all use an utf8 locale, and the test files are also in utf8 and use several special characters. So when it uses POSIX locale, xerces tries to parse the characters to this from UTF-8, which then leads to these undefined characters because of the special characters. POSIX is similar to the C locale and basically just englisch ASCII characters right?

Per default it always transcodes between some internal format based on UTF-16 and the platform's locale. It can decode XML files of all various formats, depending on what the system supports to this internal encoding. Most XML files are UTF 8 though as far as i know. Summing up to my understanding the correct locale in this case would be what elektra expects and elektra delivers, as the xerces plugin only communicates with elektra and not other parts of the system. So the conversion between c++ strings and the xerces string type is what is affected by the locale. The Elektra API takes the c++ strings, so it dictates what code page we have to deliver here.

Doesn't elektra also just use the local locale, which would make it unable to store such special characters? If elektra works fine with UTF-8, it could change xerces to always deliver UTF-8 (and in return expect UTF-8) as a possible quick solution.

@markus2330
Copy link
Contributor

Yes, it is fine if you only use UTF-8 to read and write XML files. In a later version we might add that the locale that is written in the XML file header is preserved (similar to rootname). It does not make much sense to use the system's locales for configuration files because configuration files should be shareable between users and systems with different locales.

We have the iconv plugin to convert between different encodings. But I think this is hardly used these days, UTF-8 is ubiquitous.

e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 1, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 1, 2017
@markus2330
Copy link
Contributor

https://www.libelektra.org/conversion

now contains the xerces plugin.

Seems like there are some issues regarding umlauts and it also fails at output format for quite some KeySets.

Did you try using it with XML configuration files? In particular jenkins config files would be interesting.

e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 3, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 3, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 3, 2017
@e1528532
Copy link
Contributor Author

e1528532 commented Apr 4, 2017

So after some fixies it seems to work quite well for the few things i tried (maven pom file, some jenkins config file i found on the internet including chinese characters, some random umlauts). Always converted from xerces to xerces again to see if the content stays the same apart from the format.

Can i have one of our jenkins files from the build server (a complex one if we have one) or get access to the build server so i can look it up myself? That would be another interesting test case to add.

If you have some examples where it still fails please show them so i can figure out whats going on. I checked a few snippets from the homepage and a lot of failures can be traced back to invalid characters , but thats mentioned in the README that there is no direct support for that, e.g.
https://www.libelektra.org/entries/details/nanana.batman~2Fgit~2Fv1~2Fpro-test would have an attribute named internal/ini/key/last which is for internal purposes of the ini plugin i guess, but is there a way to filter those plugin-specific keys when exporting? This seems to be added by the ini plugin, so unfortunately it would mean xerces export fails for all of those as there is this additional metakey present which cant be mapped to an xml attribute directly.

One idea is to ignore attributes/elements with invalid names, maybe as a plugin configuration if this behavior is wanted or not.

Also i changed some of my snippets to xerces instead of line, works good:
https://www.libelektra.org/entries/details/org.apache~2Fmaven~2Fv1~2Fjava-8-maven-configuration
https://www.libelektra.org/entries/details/none~2Factivemq~2Fv1~2Factivemq-plugin-maven-configuration

@markus2330
Copy link
Contributor

Thank you, looks like the xml plugin is getting great!

I uploaded our jenkins.xml in fc4b583
It is realistic except of the length (and content) of apiToken and passwordHash.

is there a way to filter those plugin-specific keys when exporting?

Yes, we should drop everything starting with internal/. See doc/METADATA.ini. But it should not be done within every individual plugin but instead in the import/export functionality of libtool. Then we can fix all tools at once.

If you have some examples where it still fails

#1451

@e1528532
Copy link
Contributor Author

e1528532 commented Apr 4, 2017

It is realistic except of the length (and content) of apiToken and passwordHash.

Hmm i just checked it and for me apiToken and passwordHash are equivalent after im- and exporting again. basically the file looks quite the same judging from the content.

I uploaded our jenkins.xml in fc4b583

Can i add our jenkins file as another unit test despite including the api key and password hash or should we rather keep those a secret?

But it should not be done within every individual plugin but instead in the import/export functionality of libtool

The issue i see with that is that plugins might depend on their metadata for serialization/deserialization, so they can't be stripped beforehand. But we could at least drop all metadata of other plugins (e.g. for xerces export drop everything except internal/xerces nodes) so that our plugins only have to care about their metadata, which they should be aware of anyway.

@markus2330
Copy link
Contributor

Hmm i just checked it

Sorry, with "realistic" I mean it is as found on the build server, not that there are any issues with your plugin. apiToken and passwordHash were modified to keep them secret.

Can i add our jenkins file as another unit test despite including the api key and password hash or should we rather keep those a secret?

It is already published, so yes: you can use it in a unit test.

The issue i see with that is that plugins might depend on their metadata for serialization/deserialization.

It would not be internal then. If it contains relevant information, it should be stored somewhere else.

e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 4, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 4, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 4, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 4, 2017
e1528532 added a commit to e1528532/libelektra that referenced this issue Apr 4, 2017
@e1528532
Copy link
Contributor Author

I think the basic version of xerces is already done, so we can close this issue. Pending limitations like invalid key names are adressed already in different issues #1451 #1400

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants