Overview
The goal of the OCamlDuce project is to extend the OCaml language with features to make it easier to write safe and efficient complex applications that need to deal with XML documents. In particular, it relies on a notion of types and patterns to guarantee statically that all the possible input documents are correctly processed, and that only valid output documents are produced.
In a nutshell, OCamlDuce extends OCaml with a new kind of values (xvalues) to represent XML documents, fragments, tags, Unicode strings. In order to describe these values, it also extends the type algebra with socalled xtypes. The philosophy behind these types is that they represent set of xvalues. They can be very precise: indeed, each value can be seen as a singleton type (a set with a single value), and it is possible to form Boolean combinations of xtypes (intersection, union, difference).
OCamlDuce's type system can be understood as a refinement of OCaml. For each subexpression which is inferred to be of the xkind (using OCaml unification based typesystem), OCamlDuce will try to infer to best possible sound xtype. Here, best means smallest for the natural subtyping relation (set inclusion). The inference algorithm is actually a dataflow analysis: the xtype will collect all the values that can be produced by the expression, considering all the possible dataflow in the program. It it sometimes necessary to provide explicit type annotations to help the type checker infer this type, in particular when you define recursive functions or when you use iterators.
Subtyping is implicit for xtypes: if an expression is inferred to be of xtype t, which is a subtype of s, then it is possible to use this expression in any context which expects a value of type s.
Getting started
Most of the new language features are enclosed within double curly braces {{...}}. For instance, the following code sample defines a value x as an XML element (with tag a, an attribute href, and a simple string as content):
# let x = {{ <a href="http://www.cduce.org">['CDuce'] }};; val x : {{<a href=[ 'http://www.cduce.org' ]>[ 'CDuce' ]}} = {{<a href="http://www.cduce.org">[ 'CDuce' ]}}
What appears between the curly braces is called an xexpression. Similarly, there are xtypes (as seen above), and also xpatterns. The delimiters {{...}} are only used for syntactical reasons, to avoid clashed between OCaml and CDuce syntaxes and lexical conventions. As a matter of fact, an OCaml expression need not be a syntactical xexpression (delimited by double curly braces) to evaluate to an xvalue. For instance, once x has been declared as above, the expression x evaluates to an xvalue.
It is possible to use an arbitrary OCaml expression as part of an xexpression: it must simply be protected by a new pair of double curly braces. For instance, there is no ifthenelse construction for xexpressions, but you can write:
# {{ <a href={{if true then {{"a"}} else {{"z"}}}}>[] }};;  : {{<a href=[ 'a'  'z' ]>[ ]}} = {{<a href="a">[ ]}}
Only the highlighted parts are parsed as xexpressions. The ifthenelse subexpression is parsed as an OCaml expression, but its type is an xtype (namely {{[ 'a'  'z' ]}}).
Xvalues
Xvalues are intended to represent XML documents and fragments thereof: elements, tags, text, sequences. In this section, we present the xvalue algebra, the syntax of the corresponding xexpression constructors and the associated xtypes.
There are three kinds of atomic kind of xvalues:
 Unicode characters;
 qualified names;
 arbitrarily large integers.
Characters
Xcharacters are different from OCaml characters. They can represent the range of Unicode codepoints defined in the XML specification. Character literals are delimited by single quotes. The escape sequences \n, \r, \t, \b, \', \", \\ are recognized as usual. The numerical escape sequence are written \n; where n is an integer literal (note the extra semicolon). The source code is interpreted as being encoded in iso88591. As a consequence, Unicode characters which are not part of the Latin1 character set must be introduced with this numerical escape mechanism. The xtypes for xcharacters are:
 singletons;
 intervals, written c  d, where c and d are literals (example: type t = {{ 'a''z' }});
 the type of all xcharacters, written Char;
 the type of all Latin1 characters, written Latin1Char (defined as \0;  \255;).
Integers
Xintegers are arbitrarily large. Literals must be written in decimal. Negative literals must be in parenthesis. E.g.: (3). The xtypes for xintegers are:
 singletons;
 intervals, written i  j, where i and j are literals (example: type t = {{ 1020 }}); it is possible to replace i or j with ** to define openended intervals, e.g. type pos = {{ 1  ** }};
 the type of all xintegers, written Int;
 the type of all the integers which can be represented by a signed 32 (resp. 64) bit machine word, written Int32 (resp. Int64).
Qualified names
Qualified names are intended to represent XML tag names. Conceptually, they are made of a namespace URI and a local name. Since URIs tends to be long, literals are of the form `prefix:local where local is the local name and prefix is an namespace prefix bound to some URI (in the scope of the literal). The local name follows the definitions from the XML Namespaces specification; a dot character must be protected by a backslash and nonLatin1 characters are written as character literals \n;. See below for a explanation on how to bind prefixes to URIs. To refer to the default namespace (or the absence of namespace if not default has been defined), the syntax is simply `local. The xtypes for qualified names are:
 singletons;
 the type of all qualified names, written Atom;
 the type of all qualified names from a specified namespace, written `ns:*.
Records
Xrecords are mainly used to represent the set of attributes of an XML element. An xrecord is a binding from a finite set of labels to xvalues. Labels follows the same syntax as for qualified names without the leading backquote. However, if the namespace prefix is not given, the default namespace does not apply (the namespace URI is empty). The syntax for record xexpressions is { l1=e1 ... ln=en } where the li are labels and the ei are xexpressions. Fields can also be separated with a semicolon. It is legal to omit the expression for a field; the label is then taken as the content of the field (a value with this name must be defined in the current scope), e.g.: let x = ... and y = ... in {{ {x y z=3} }} is equivalent to let x = ... and y = ... in {{ {x=x y=y z=3} }}. The types for xrecords specify which labels are authorized/mandatory, and what the types of the corresponding fields are. There are two kind of record xtypes:
 Closed record types, which only allow a finite number of fields: { l1=t1 ... ln=tn };
 Open record types, which allow additional fields (with arbitrary type): { l1=t1 ... ln=tn .. } (the final two colons are in the syntax).
In both cases, it is possible to make one of the fields optional by changing = to =?.
The xtype of all xrecord is thus { .. }, and the xtype of xrecords with maybe a field l of type Int and maybe arbitrary other fields is { l=?Int .. }.
Sequences
Xsequences are finite and ordered collections of xvalues. The syntax for a sequence xexpression in [ e1 ... en ] (note that elements are not separated by semicolons as in OCaml list). Each item ei can either be:
 an xexpression;
 !e where e is an xexpression which evaluates to a sequence (whose content is inserted in the sequence which is currently defined); e.g. let x = [ 2 3 ] in [ 1 !x 4 ] is equivalent to [ 1 2 3 4 ];
 a string literal delimited by simple quotes; e.g. [ 'abc' ] is equivalent to [ 'a' 'b' 'c' ].
Xtypes for sequences are of the form [R] where R is a regular expression over xtypes which describe the possible contents of the sequences. The possible forms of regular expressions are:
 t (one single element of xtype t)
 R* (zero or more repetitions)
 R+ (one or more repetitions)
 R? (zero or one repetition)
 R1 R2 (sequence)
 R1R2 (alternation)
 (R)
 /t (guard: the tail of the sequence must comply with t).
 PCDATA (equivalent to Char*).
Strings
Strings are nothing but sequences of characters. There are two predefined types String and Latin1 (defined as [ Char* ] and [ Latin1Char* ]).
A string literal [ '...' ] can also be written "..." (without the square brackets). Note that simple (resp. double) quotes need to be escaped only when the string is delimited with double (resp. simple) quotes.
XML elements
An XML element is a triple of xvalues. The syntax for the corresponding xexpression constructor is <(e1) (e2)>e3. When e1 is a qualified name literal, it is possible to omit the leading backquote and the surrounding parentheses. Similarly, when e2 is an xrecord literal, it is possible to omit the curly braces and the parentheses. For instance, one can simply write <a href="abc">['def'] instead of <(`a) ({href="abc"})>['def'].
XML element xtype are written <(t1) (t2)>t3, and the same simplifications applies. For instance, if the namespace prefix ns has been defined, the following is a legal xtype <ns:* ..>[]; it describes XML elements whose tag is in the namespace bound to ns, with an empty content, and with an arbitrary set of attributes. An underscore in place of (t1) is equivalent to (Atom) (any tag).
Xexpressions
In the previous section, we have seen the syntax for xvalues constructors (constant literals, sequence, record, element constructors). In this section, we describe the other kinds of xexpressions.
Binary infix operators
The arithmetic operators on integers follow the usual precedence. They are written +,*,,div,mod (they are all infix).
Record concatenation: e1 ++ e2. The xexpressions e1 and e2 must evaluate to xrecords. The result is obtained by concatening them. If a field with the same label is present in both records, the rightmost one is selected.
Sequence concatenation: e1 @ e2, equivalent to [!e1 !e2].
Projections, filtering
If the xexpression e evaluates to a record or an XML element, the construction e.l will extract the value of field or attribute l. Similarly, the construction e.?l will extract the value of field or attribute l if present, and return the empty sequence [] otherwise.
If the xexpression e evaluates to a record, the construction e . l will produce a new record where the field l has been removed (if present).
If the xexpression e evaluates to an xsequence, the construction e/ will result in a new xsequence obtained by taking in order all the children of the XML elements from the sequence e. For instance, the xexpression [<a>[ 1 2 3 ] 4 5 <b>[ 6 7 8 ] ]/ evaluates to the xvalue [ 1 2 3 6 7 8 ].
If the xexpression e evaluates to an xsequence, the construction e.(t) (where t is an xtype) will result in a new xsequence obtained by filtering e to keep only the elements of type t. For instance, the xexpression [<a>[ 1 2 3 ] 4 5 <b>[ 6 7 8 ] ].(Int) evaluates to the xvalue [ 4 5 ].
Dynamic type checking
If e is an xexpression and t is an xtype, the construction (e :? t) returns the same result as e if it has type t, and otherwise raises a Failure exception whose argument explains why this is not the case.
# let f (x : {{ Any }}) = {{ (x :? <a>[ Int* ] ) }} in f {{ <a>[ 1 2 '3' ] }};; Exception: Failure "Value <a>[ 1 2 '3' ] does not match type <a>[ Int* ]\nValue '3' does not match type Int\n".
Pattern matching
OCamlDuce comes with a powerful pattern matching operation. Xpatterns are described below. The syntax for the pattern matching operation is: match e with p1 > e1  ...  pn > en. The typesystem ensures exhaustivivity for the pattern matching and infers precise types for the capture variables. It is also possile to use xpattern matching as a regular OCaml expression; xpatterns must be surrounded by {{..}}, e.g.: match e with {{p1}} > e1  ...  {{pn}} > en function {{p1}} > e1  ...  {{pn}} > en
Pattern matching follows is firstmatch policy. The first pattern that succeeds triggers the corresponding branch.
Local binding
The xexpression let p=e1 in e2 is equivalent to match e1 with p > e2. There is also an local binding with an xpattern in OCaml expressions: let p=e1 in e2.
Iterators
OCamlDuce comes with a sequence iterator map e with p1 > e1  ...  pn > en and a tree iterator map* e with p1 > e1  ...  pn > en.
For both constructions, the argument must evaluate to a sequence. The map iterator applies the patterns to each element of this sequence in turns and produces a new sequence by concatenating all the results (all the righthand sides must thus produce a sequence). The set of patterns must be exhaustive for all the possible elements of the input sequence.
The tree iterator is similar except that the patterns need not be exhaustive. If some element of the input sequence is not matched, it is simply copied into the result unless it is an XML element. In this case, the transformation is applied recursively to its content.
OCaml constructions
As a convenience, some of the OCaml expression constructors are allowed as xexpressions (without a need to go back to OCaml with double curly braces): (unqualified) value identifiers without apostrophes and function calls.
More on xtypes
We have seen how to write simple xtypes. We can then combine them with Boolean connectives:
 t1 & t2: intersection;
 t1  t2: union;
 t1  t2: difference.
The empty xtype is written Empty (it contains no value), and the universal xtype is written Any (it contains all the xvalues) or _.
When an xtype has been bound to some OCaml identifier (type t = {{...}}), it is possible to use this identifier in another xtype. Recursive definitions are allowed:
type t1 = {{ <a>[ t2* ] }} and t2 = {{ <b>[ t1* ] }}
Note that xvalues are always finite and acyclic. The type checker detects type definition which would yield empty types:
# type t = {{ <a>[ t+ ] }};; This definition yields an empty type
If t1 and t2 are record xtypes, we can combine them with the infix ++ operator, which mimics the corresponding operator on expressions (record concatenation). Similarly, we can use the infix @ concatenation operator on sequence xtypes.
Xpatterns
Xpatterns follow the same syntax as Xtypes. In particular, any Xtype is a valid Xpattern. In addition to Xtypes constructors, Xpatterns can have:
 capture variables (lowercase OCaml identifiers without apostrophes);
 constant bindings (x := c) where x is a capture variable and c is a literal xconstant (this pattern always succeeds and returns the binding x>c).
An identifier in an Xpattern can be either a reference to a named Xtype (if such a type declaration is in scope) or a capture variable (otherwise).
Here is a brief description of the semantics of patterns. Given an input value, a pattern can either succeed or fail. If it succeeds, it also produces a bindings from the capture variables in the pattern to xvalues.
 A pattern which is just a type (no capture variable) succeeds if and only if the value has the type.
 A pattern p1  p2 succeeds if either p1 or p2 succeed, and returns the corresponding binding; if both patterns succeeds, p1 wins. It is required that p1 and p2 have the same sets of capture variables.
 A pattern p1 & p2 succeeds if both p1 and p2 succeed, and returns the concatenation of the two bindings. It is required that p1 and p2 have disjoint sets of capture variables.
In record xpatterns, it is possible to omit the =p part of a field. The content is then replaced with the label name considered as a capture variable (or as a previously defined type). E.g. { x y=p } is equivalent to { x=x y=p }.
It is also possible to add an "else" clause: { x = (a,_)(a:=3) } will accept any record with atmost the field x. If the content is a pair, the capture variable a will be bound to its component; otherwise, it is set to 3.
In regular expressions, it is possible to extract whole subsequences with the notation x::R, e.g.: [ _* x::Int+ _* ]
If the same sequence capture variable appears several times (or below a repetition) in a regexp, it is bound to the concatenation of all matched subsequences. E.g.: [ (x::Int  _)* ] will collect in x all the elements of type Int from a sequence. It is not legal to have repeated simple capture variables.
The regexp operators +,*,? are greedy by default (they match as long as possible). They admit nongreedy variants +?,*?,??.
Namespace bindings
The binding of namespace prefixes to URIs can be done either by toplevel phrases (structure items) or by local declarations:
# {{ namespace ns = "http://..." }};; # let x = {{ `ns: x }};; val x : {{`ns:x}} = {{`ns:x}} # let x = {{ let namespace ns = "http://..." in `ns:x }};; val x : {{`ns:x}} = {{`ns:x}}
The toplevel definitions can also appear in module interfaces (signatures). A toplevel prefix binding is not exported by a module: its scope is limited to the current structure or signature. It is possible to specify a default namespace, and to reset it:
# {{ namespace "http://..." }};; # {{ `x }};;  : {{`ns1:x}} = {{`ns1:x}} # {{ namespace "" }};; # {{ `x }};;  : {{`x}} = {{`x}}
Note that the value prettyprinter invented some prefix for the namespace URI. The default prefix declaration also have a local form let namespace "..." in ... .
More on typechecking
Type inference
As we said above, the programmer is sometimes required to provide type annotations. To know where to put these annotation, it is necessary to get a basic understanding of how typechecking works.
The OCaml typechecker is run first to detect which subexpressions are of the xkind. A second ML typechecking pass is then done to introduce subsumption (implicit subtyping) steps where allowed. After these two passes, the OCamlDuce type checker obtains a dataflow summary of xvalues in the whole compilation unit. This is a directed graph, whose edges represent either simple dataflow or complex operation on xvalues. The nodes of the graph can be thought as xtype variables. A dataflow edge corresponds to a subtyping constraints, and an operation edge corresponds to a symbolic constraints which mimics the corresponding operation on values.
Some of the nodes are given an explicit type by the programmer, through type annotations (on expressions or function arguments) or the other usual mechanism in ML (data type declarations, signatures, ...).
Also, if there is a loop with only subtyping edges in the graph, all the nodes on the loop are merged together.
After this operation, the graph is required to be acyclic (assuming that the nodes with an explicit type are removed from the graph). It is the responsibility of the programmer to provide enough type annotation to achieve this property. Otherwise, a type error is issued.
# let rec f x = match x with 0 > {{ [] }}  n > {{ f {{n1}} @ ['.'] }};; Cycle detected: cannot typecheck # let rec f x : {{ String }} = match x with 0 > {{ [] }}  n > {{ f {{n1}} @ ['.'] }};; val f : int > {{String}} = <fun>
In the example above, there is a cycle between the result type for f and the type for the subexpression f {{n1}}. It is here broken with a type annotation on the result; it could have been broken by a type annotation on the expression f {{n1}}, or on the function f itself, or by a module signature.
Let us study another simple example:
# let f x = {{ x + 1 }} in f {{ 2 }}, f {{ 3 }};;  : {{34}} * {{34}} = ({{3}}, {{4}})
The typecheckers detects that the two xvalues 2 and 3 can flow to the argument of f. Its body is thus typechecked with the assumption that x has type 23. The computed result type is then 34.
The typeinference process described above is global by nature. The acyclicity condition is only imposed after a whole compilation unit has been typechecked by OCaml (and the information from the module interface as been integrated). When a type variable is inferred to be of the xkind, it is never generalized. As a consequence, there is no parametric polymorphism on xtypes.
In the toplevel, typechecking is done after each phrase. Consider the following session:
# let f x = {{ x + 1 }};; val f : {{Empty}} > {{Empty}} = <fun> # let a = f {{ 2 }};; Subtyping failed 2 <= Empty Sample: 2
The function f is inferred to have type {{Empty}} > {{Empty}} because when the first phrase is typechecked, the dataflow graph says that no value can flow to x, and thus the input type is empty (and similarly for the result type). If the two phrases were typechecked together (which would be the case it they had been compiled by the compiler, not in the toplevel), the type checker would have correctly inferred that the input type for f must contain 2.
Implicit subtyping
Coercion from an xtype to a super type is automatic in OCamlDuce. However, this automatic subsumption does not carry over to OCaml type constructor, even if there are covariant. Consider:
# let f (x : {{ Int }} * {{ Int }}) = 1;; val f : {{Int}} * {{Int}} > int = <fun> # let g (x : {{ 0 }} * {{ 0 }}) = f x;; This expression has type {{0}} * {{0}} but is here used with type {{Int}} * {{Int}} # let g (x : {{ 0 }} * {{ 0 }}) = let a,b = x in f (a,b);; val g : {{0}} * {{0}} > int = <fun> # let g (x : {{ 0 }} * {{ 0 }}) = f (x :> {{ Int }} * {{ Int }});; val g : {{0}} * {{0}} > int = <fun>
The first attempt to define g fails because the type for x is not an xtype and thus subsumption does not apply. In the second attempt, we extract the two components of the pair; since they are inferred to be xvalues, subtyping applies to both of them. Thus, when the pair (a,b) is reconstructed, it is legal to unify its type with the input type of f. The third definition for g gives an alternative solution: using explicit OCaml type coercions.
Exchanging values
OCamlDuce strongly seperates regular OCaml values from the new xvalues. They have different syntax, expressions, types, patterns, and even typechecking algorithms. This strong segregation is key point which allowed a simple integration between very different type systems.
At some point, it is still necessary to cross the frontier and translate OCaml values to xvalues or the opposite.
Fortunately, OCamlDuce provides automatic translations in both directions. Instead of double curly braces, you can enclose xexpressions in curly brace+colon {: ... :} (here, the ... is an xexpression). The effect is to translate the result of the xexpression (which must be an xvalue) to an OCaml value. Similarly, in an xexpression, you can obtain the xtranslation of an OCaml value with the same syntax {: ... :} (here, the ... is an OCaml expression).
Here is how the translation works. To each OCaml type t, we associate an xtype T(t) and a pair of translation function between t and T(t). Actually, not all the features are supported. For instance, free type variables, abstract types, object types, nonregular recursive types cannot be translated. In particular, since type variables are not allowed, the OCaml type must be fully known.
The translation for an OCaml type t is defined by structural induction on t. Sum types are translated to union types: a constant constructor A is translated to the qualified name `A; a nonconstant constructor A of t1 * ... * tn is translated to <A>[ T(t1) ... T(tn) ]. Closed polymorphic variants have the same translation. Record types are translated to closed record xtypes. Some other translations:

Here is an example:
# let f (x : {{ Int }}) = {{ x + 1 }} in List.map f {: [ 1 2 3 ] :};;  : {{Int}} list = [{{2}}; {{3}}; {{4}}]
In this example, the result type of the translation is inferred to be {{ Int }} list (because the type for f is given). The corresponding xtype is {{ [Int*] }}.
The standard library
In OCamlDuce, the Num library from OCaml is included in the standard library. In addition, there are two new module called Ocamlduce and Cduce_types in the standard library.
The module Cduce_types gives access to the internal representation of xvalues. It is currently undocumented.
The module Ocamlduce provides several useful functionality xvalues. See the ocamldoc generated documentation for a description of its interface.
Marshaling
OCamlDuce use some tricks on its internal representation of xvalues to reduce memory usage and improve performance. You need to pay special attention if you want to use OCaml serialization functions (module Marshal, functions input_value/output_value) on xvalues. In addition to your values, you also need to save and restore some piece of internal data using the functions Cduce_types.Value.extract_all and Cduce_types.Value.intract_all. Of course, this also applies if the value to be serialized contains deeply nested xvalues.
Here are generic serialization/deserializations functions that illustrate how to do it:
let my_output_value oc v = let p = Cduce_types.Value.extract_all () in output_value oc (p,v) let my_input_value ic = let (p,v) = input_value ic in Cduce_types.Value.intract_all p; v
Performance
Strings
OCaml users might be surprised by the fact that xstrings are simply represented as sequences in OCamlDuce. Does this mean that they are actually stored in memory as linked list? Certainly not! The internal representation of sequence values uses several tricks to improve performance and memory usage. In particular, a special form in the representation can store strings as byte buffers, as in OCaml. It an XML document is loaded, or if a Caml string is converted to an xvalue, this compact representation will be used.
Concatenation
Similarly, OCaml users might be relectutant to use the sequence concatenation @ on sequences. In OCaml, the complexity of this operator is linear in the size of its first argument (which need to be copied). OCamlDuce use a special form in its internal representation to store concatenation in a lazy way. The concatenation will really by computed only when the value is accessed. This means that it's perfectly ok to build a long sequence by adding new elements at the end one by one, as long as you don't simultaneously inspect the sequence.
Pattern matching
Another point which is worth knowing when programming in OCamlDuce is that patterns can be written in a declarative style without affective performance. The compiler uses static type information about matched values to produce efficient code for pattern matching. To illustrate this, consider the following sample:
x.ml: type a = {{ <a>[ a* ] }} type b = {{ <b>[ b* ] }} let f : {{ ab }} > int = function {{ a }} > 0  {{ _ }} > 1
y.ml: type a = {{ <a>[ a* ] }} type b = {{ <b>[ b* ] }} let f : {{ ab }} > int = function {{ <a>_ }} > 0  {{ _ }} > 1
The two functions have exactly the same semantics, but the first implementation is more declarative: it uses type checks to distinguish between a and b instead of saying how to distinguish between these two types. Imagine that the definition of these types change to:
type a = {{ <x kind="a">[ a* ] }} type b = {{ <x kind="b">[ b* ] }}
Then the first implementation still works as expected, but the second one needs to be rewritten.
Now one might believe that the second implementation is more efficient because it tells the compiler to check only the root tag, whereas the first implementation would force the compiler to produce code to check that all tags in the tree are as. But this is not what happens! Actually, you can check that the compiler will produce exactly the same code for both implementations. It considers the static type information about the argument of the pattern matching (here, the input type of the function), and computes an efficient way to evaluate patterns for the values of this type.
The map iterator
The map ... with ... iterator is implemented in a tailrecursive way. You can safely use it on very long sequences.
OCaml and OCamlDuce
Since the 3.08.4 release, OCamlDuce is binary compatible with the corresponding OCaml release. This means that OCamlDuce can use OCamlgenerated .cmi files and that it produces an OCamlcompatible .cmi file if the interface does not use any xtype (this file is equal to what would have been obtained by using OCaml).
It is thus possible to use existing libraries which were compiled for OCaml. It is also possible to use OCamlDuce to compile some modules and use them in an OCaml project provided their interface is pure OCaml.