-
Notifications
You must be signed in to change notification settings - Fork 196
Breaking up large XML files
Because Karma uses a non-streaming XML parser, it doesn't work well with very large XML files. You can use a tools such as xml_strip to break these files up into more manageable fragments.
xml_strip is Perl-based and is part of the Perl XML::Twig package.
sudo cpan install xml_strip
xml_strip will be installed by default somewhere in your perl directory. For me this is time /opt/local/libexec/perl5.16/sitebin/
/opt/local/libexec/perl5.16/sitebin/xml_split -s input.xml
will create files of size about named input-00.xml, input-01.xml, ....
I used as 100M
input-00.xml is a skeleton file instructing xml_join how to put those files back together to recreate the exact original. We should ignore this. Each of the other input-NN.xml will be the top-level objects from our input file, wrapped in a tag which looks like <xml_split:root xmlns:xml_split="http://xmltwig.com/xml_split"> ... </xml_split:root>
We can simply get rid of the first and last line of each file
bash> rm input-00.xml bash> for file in input*.xml; do sed -i '' '1d;$d' $file done
time /opt/local/libexec/perl5.16/sitebin/xml_split -s100M ducksouth.com.xml