Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detail on Array mode #43

Closed
amitguptagwl opened this issue Feb 4, 2018 · 19 comments
Closed

Detail on Array mode #43

amitguptagwl opened this issue Feb 4, 2018 · 19 comments
Labels

Comments

@amitguptagwl
Copy link
Member

@Delagen , I'm rewriting the XML parser to handle large files and for other useful features. Hence I was revisiting the test cases. As you did the changes to parse attributes as array in #23 , I have a query to clear the expectations;

In the test "should parse nodes as arrays" in file xmlParser_spec.js, somewhere attribute is parsed as object and somewhere as array.


"any_name": [{
        "@attr": ["https://example.com/somepath"],
        "person": [{
   :
          "age": [34],
          "married": [{
              "@firstTime": "Yes",
              "#_text": "Yes"
          }],

Can you please clear me the expectation so I'll implement the same in new code?

In my understanding / expectations

<types>textdetail
    <type>val</type>
    <type attr="value">val</type>
    <type attr="value">val
        <nestedtag>val</nestedtag>
    </type>
    <type attr="value"></type>
    <type>val</type>
    <config>someconfig</config>
</types>
"types": [ {
            "#text": [ "textdetail" ],
            "type": [
                "val",
                { "@attr": ["value"], "#text": ["val"] },
                { "@attr": ["value"], "#text": ["val"] , "nestedtag" : ["val"]},
                { "@attr": ["value"]},
                "val"
            ],
            "config": [ "someconfig" ]
    }]
@Delagen
Copy link
Contributor

Delagen commented Feb 4, 2018

My PR was for attrNodeName, and mean that all attributes present as child object with this name using it original names (or prefixed if it specified).

arrayMode. It does not my change. But I expect something like:

"types": [ {
            "#text": "textdetail",
            "type": [
                "val",
                { "@attr": "value", "#text": "val" },
                { "@attr":"value", "#text": "val" , "nestedtag" : ["val"]},
                { "@attr": "value"},
                "val"
            ],
            "config": [ "someconfig" ]
    }]

Mean that single tag value should be wrapped as array of object, instead of single object

@amitguptagwl
Copy link
Member Author

amitguptagwl commented Feb 5, 2018

Thanks for your quick response. As you initially mentioned for simple migration from xml2js or others for this change, let me check how does that library transforms above example. So the consistency can be mentioned.

Thanks

@Delagen
Copy link
Contributor

Delagen commented Feb 5, 2018

@amitguptagwl https://github.com/Leonidas-from-XIV/node-xml2js#options
attrkey: object key for collecting attributes. Your previous implementation cannot wrap attributes in subobject, all placed in single level with prefix.
Test: https://github.com/Leonidas-from-XIV/node-xml2js/blob/master/test/parser.test.coffee#L264
You can read other options for future implementation. As for me only this prevent me to migrate.

@amitguptagwl
Copy link
Member Author

attrkey: object key for collecting attributes. Your previous implementation cannot wrap attributes in subobject, all placed in single level with prefix.

There was an option attrNodeName for the same purpose.

It's fine if something is not implemented, my main concern is if something is half implemented.

Thanks a lot for your active discussion and contribution.

@Delagen
Copy link
Contributor

Delagen commented Feb 5, 2018

There was an option attrNodeName for the same purpose.

That was my first PR to this project ))

@amitguptagwl
Copy link
Member Author

amitguptagwl commented Feb 5, 2018

Thanks @Delagen .

I have just compared the performance of xml2js(0.4.19) with both version V2, V3 and I feel fast-xml-parser can give the advantage in case of small files only. When it comes to big files the performance of xml2js is stable and very fast than this library. And it is pure js based. Hence I feel that there is no advantage to create V3 or adding new features of this library as xml2js is already solving most of the concerns.

@Delagen
Copy link
Contributor

Delagen commented Feb 5, 2018

Strange... at about 20M
√ xml2js with large (9049ms)
√ fast-xml-parser with large (279ms)
√ xml-js with large (33326ms)

@amitguptagwl
Copy link
Member Author

where did you get this figure from? May be I'm testing in the wrong way.

@Delagen
Copy link
Contributor

Delagen commented Feb 5, 2018

I'l take sample XML SOAP:response with ~80К CDATA base64 response
Make source with xmlStringLargeElements=200

var xmlStringLarge = `<root>${xmlString.repeat(xmlStringLargeElements)}</root>`;

And parse it 100 times.

On 500
xml-js out of memory

with 1000 elements (~80-90MB)
√ xml2js with large (45113ms)
√ fast-xml-parser with large (1317ms)

@amitguptagwl
Copy link
Member Author

amitguptagwl commented Feb 5, 2018

I'm reading 13m XML file in a string and using benchmark js to test

suite.add('xml2js', {
  'defer': true,
  'fn' : function(deferred) {
      xml2js.parseString(largeXmlData,function(err,result){
        if (err) throw err;
        deferred.resolve();
      });
    }
})
.on('complete', function() {
    console.log(this.name + ":" +   this.hz + " requests/second");
})
.run({ 'async': true });

@amitguptagwl
Copy link
Member Author

You're making me to push my changes.... Can you plz share the gist ? I'm definitely doing something wrong. Because for me xml2js always takes 3s for both small (1.5kb) and big (98mb) files.

@Delagen
Copy link
Contributor

Delagen commented Feb 5, 2018

@amitguptagwl Can you share benchmark project?
Update: I use only 5 iterations on large files.

I am using mocha with simple loop

it("xml2js with large", function() {
		this.timeout(60000);
		var callback = function(err, result) {
			if (err) {
				console.error(err); //eslint-disable-line no-console
			}
			result.root["soap:Envelope"].should.have.lengthOf(xmlStringLargeElements);
			for (var j = 0; j < xmlStringLargeElements; j++) {
				result.root["soap:Envelope"][j]["soap:Body"].should.ok;
			}
		};
		for (var i = 0; i < benchmarkLargeIterations; i++) {
			xml2js.parseString(xmlStringLarge, xml2jsOptions, callback);
		}
	});

Try to remove async & defer, because callback is syncronous

@amitguptagwl
Copy link
Member Author

It is just a file in benchmark folder. Let me share the code;

var Benchmark = require('benchmark');
var suite = new Benchmark.Suite("XML Parser benchmark");

var parser = require("../src/parser");
var xml2js = require("xml2js");
//var parser2 = require("../src/parserV2");

var fs = require("fs");
var path = require("path");
var fileNamePath = path.join(__dirname, "../spec/assets/midsize.xml");
var xmlData = fs.readFileSync(fileNamePath).toString();

suite
.add('xml to json + validation V1', function() {
  parser.validate(xmlData); 
  parser.parse(xmlData); 
})
/*.add('xml to json + validation V2 ', function() {
  parser2.parse(xmlData);
})*/
.add('xml2js ', function() {
  xml2js.parseString(xmlData,function(err,result){
    if (err) throw err;
  });
})
/* .add('xml2js', {
  'defer': true,
  'fn' : function(deferred) {
      xml2js.parseString(xmlData,function(err,result){
        console.log("err", err);
        if (err) throw err;
        deferred.resolve();
      });
    }
}) */

.on('start',function(){
	console.log("Running Suite: " + this.name);
})
.on('error',function(e){
	console.log("Error in Suite: " + this.name);
})
.on('abort',function(e){
	console.log("Aborting Suite: " + this.name);
})
/*.on('cycle',function(event){
	console.log("Suite ID:" + event.target.id);
})*/
// add listeners 
.on('complete', function() {
  for (var j = 0; j < this.length; j++) {
    console.log(this[j].name + ":" +   this[j].hz + " requests/second");
  }
})
// run async 
.run({ 'async': true });

@Delagen
Copy link
Contributor

Delagen commented Feb 5, 2018

@amitguptagwl xml2js speed vary on data. For example: modified 3 lines

var fileNamePath = path.join(__dirname, "../spec/assets/test.xml");
var xmlData = fs.readFileSync(fileNamePath).toString();
xmlData=`<root>${xmlData.repeat(1000)}</root>`;

where test.xml something like

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
    <soap:Body>
        <rpt:loadReportFileResponseElem
                xmlns:s="http://bus.x.com/common/support/v1"
                xmlns:rpt="http://bus.x.com/service/statement/v1">
            <s:code>0</s:code>
            <s:responseTime>2588</s:responseTime>
            <s:responseDbTime>1893</s:responseDbTime>
            <s:requestId>6b408fd09eb211e7a0807e34820340ec</s:requestId>
            <s:route>172.16.x.x:9192</s:route>
            <rpt:result>
 <rpt:file><![CDATA[
...~80К base64
]]></rpt:file>
            </rpt:result>
        </rpt:loadReportFileResponseElem>
    </soap:Body>
</soap:Envelope>

And result
xml to json V1:5.084196828113811 requests/second
xml2js :0.11925605306783423 requests/second
50!!! times

even with validation (which unneeded if user trust source)
xml to json + validation V1:0.7115240765649846 requests/second
xml2js :0.11756098869218376 requests/second
7 times

xml2js does not have remove namespaces feature that needed by common case, because namespaces names can change and without much complex code parsing of response became pain.
I wrote own cleanXml function that clean up namespaces from result object when used xml2js

on 10000 repeat
xml to json V1:0.4878174819040743 requests/second
xml2js FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

That old error on SAX parser.
Xml2js full of ugly errors Leonidas-from-XIV/node-xml2js#385 and others)
Thats all ))

@amitguptagwl
Copy link
Member Author

Agree, with the XML having CDATA benchmarks are as follows;

xml to json V1: 36.2170364893656 requests/second
xml to json + validation V1: 16.293542685483445 requests/second
xml to json + validation V2 : 18.840299721018933 requests/second
xml2js : 8.493040709182775 requests/second

for small file

xml to json V1: 44887.09328602369 requests/second
xml to json + validation V1: 17996.547289493574 requests/second
xml to json + validation V2 : 17024.9652624249 requests/second
xml2js : 8680.17982076353 requests/second

for small file without CDATA

xml to json V1:22418.135460499547 requests/second
xml to json + validation V1:9835.206488018259 requests/second
xml to json + validation V2 :7393.2310311734145 requests/second
xml2js :6392.260927006273 requests/second

But for big file without CDATA

For data you shared (When I remove CADATA part)
xml to json V1:89.67479082829274 requests/second
xml to json + validation V1:29.936565553420834 requests/second
xml to json + validation V2 :23.292112894835128 requests/second
xml2js :15.190920817625521 requests/second

File I made (13m)

xml to json V1:1.0754129079969013 requests/second
xml to json + validation V1:0.6726000748140463 requests/second
xml to json + validation V2 :0.8346327369734697 requests/second
xml2js :5978.957878234166 requests/second

I can work more to speed it up as it gives better performance at least in case of CDATA or when file size is less than 3-4 mb.

@amitguptagwl
Copy link
Member Author

amitguptagwl commented Feb 6, 2018

There is some more interesting result. I have created parser V3 which is not doing validation and I've done some more changes.

Now when I tested against 98m file

V1: error out due to RE
V2: 8648ms (include validation as well)
V3: 7441ms (no validation, less RE)
xml2js: 4ms

But when I read the output of xml2js, it just read 2 starting tags and skipped rest of the XML. Hence it was fast.

Moreover, I modified your example and added XML prolog and DOCTYPE. In this case xml2js error out.

Now the only problem is parser V1 is up to 4 times faster than V2 or V3 but it works only for small file up to 15mb then it errors out.

I could have rewritten the JS RE engine if I get the time to complete my other project grapes.

@Delagen
Copy link
Contributor

Delagen commented Feb 6, 2018

4 times not too big degradation for support of large files, I think it still faster than xml2js.
My old options for xml2js, may be that change some?

var defaultOptions = {
	explicitCharkey:    false,
	trim:               true,
	normalize:          false,
	normalizeTags:      false,
	attrkey:            "$",
	charkey:            "_",
	explicitArray:      false,
	ignoreAttrs:        false,
	mergeAttrs:         false,
	explicitRoot:       true,
	validator:          null,
	xmlns:              false,
	explicitChildren:   false,
	childkey:           "$$",
	charsAsChildren:    false,
	async:              false,
	strict:             true,
	attrNameProcessors: null,
	tagNameProcessors:  null,
	rootName:           "root",
	xmldec:             {
		version:    "1.0",
		encoding:   "utf-8",
		standalone: true
	},
	doctype:            null,
	renderOpts:         {
		pretty:  true,
		indent:  "  ",
		newline: "\n"
	}
};

Can you share your example XML?

@amitguptagwl
Copy link
Member Author

please check 'spec/assets/" folder. I'm using sample.xml, ptest.xml, and midsize.xml

@amitguptagwl
Copy link
Member Author

Performance (should have been discussed in separate thread) is degraded to handle some cases of XML but no impact due to large files now.

I have removed arraymode from official documents but it is still supported. As soon as I'll verify it, I'll make it as a part of official document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants