
This work is licensed under a Creative Commons Attribution-Share Alike 2.0 France License.
The other day I needed a small xml parser to convert an xml document into a different format. First I tried xml-light. This is a simple parser all written in ocaml that stores the parser xml document in an ocaml data structure. This data structure can be user to access various fields of the xml document. It does not offer a dom-like interface, but actually I consider this a feature. Unfortunately xml-light is terribly slow. To parse 30K-plus lines of xml it takes far too long to be considered for my application.
The next logic choice was to try Expat, that is a event-based parser and it is extremely fast. Since using an event based parser can be a bit cumbersome (and I already had written of bit of code using xml-light), I decided to write a small wrapper around expat to provide a xml-light interface to it.
The code is pretty simple and the main idea is taken from the cduce xml loader.
First we provide a small data structure to hold the xml document as we examine it. Nothing deep here. Notice that we use Start and String as we descend the tree and Element we we unwind the stack.
Then we need to provide expat handlers to store xml fragments on the stack as we go down. Note that we have an handler for cdata, but not an handler for pcdata as it is the default.
At the end we just register all handlers with the expat parser and we return the root of the xml document.
I've copied the xml-light methods and to access the document in a different file. I've also made everything lazy to save a bit of computing time if it is only necessary to access a part of a huge xml document.
The complete code can be found here: git clone https://www.mancoosi.org/~abate/repos/xmlparser.git
The other that I was made aware that this parser has a serious bug when used on a 32 bit machine. The problem is that the maximal string size on a 32bit machine is equal to Sys.max_string_length that is roughly 16Mb . If we read and parse a bit document document at once with IO.read_all , we immediately get an exception. The solution is to parse the document incrementally using the new function parser_ch below that get a channel instead of a string and run the expat parser incrementally :
Recent comments
18 weeks 1 day ago
1 year 18 weeks ago
1 year 22 weeks ago
1 year 35 weeks ago
1 year 37 weeks ago
1 year 39 weeks ago
1 year 42 weeks ago
1 year 43 weeks ago
2 years 20 weeks ago
2 years 24 weeks ago