Key concepts

CDuce is a strongly-typed functional programming language adapted to the manipulation of XML documents. Its syntax is reminiscent of the ML family, but CDuce has a completely different type system.

Let us introduce directly some key concepts:

  • Values are the objects manipulated by CDuce programs; we can distinguish several kind of values:
    • Basic values: integers, characters.
    • XML documents and fragments: elements, tag names, strings.
    • Constructed values: pairs, records, sequences.
    • Functional values.
  • Types denote sets of values that share common structural and/or behavioral properties. For instance, Int denotes the sets of all integers, and <a href=String>[] denotes XML elements with tag a that have an attribute href (whose content is a string), and with no sub-element.
    &,|,\ are the usual set-theoretic connectives for intersection, union, and difference, respectively. So the intersection of two types denotes the set of all values that have both the types, while their union denotes the union of their values.
  • Expressions are fragments of CDuce programs that produce values. For instance, the expression 1 + 3 evaluates to the value 4. Note that values can be seen either as special cases of expressions, or as the result of evaluating expressions.
  • Patterns are ``types + capture variables''. They allow to extract from an input value some sub-values, which can then be used in the rest of the program. For instance, the pattern <a href=x>[] extracts the value of the href attribute and binds it to the value identifier x.

A first example

let x = "Hello, " in
let y = "world!" in
x @ y

The expression binds two strings to value identifiers x and y, and then concatenates them. The general form of the local binding is:

let p = e in e'

where p is a pattern and e, e' are expressions.

Note: A small aside about the examples in this tutorial and their usage. The first program that prints "Hello word" can be tried directly on the on-line prototype: just select and copy it, click on the link to the on-line interpreter in the side bar (we suggest you open it in a new window), paste it in the execution window and run it. The second example instead cannot be run. This is visually signaled by the fact that it contains text in italics. We use italics for meta notation, that is e and e' stand for generic expressions, therefore it is useless to run this code (you would just obtain an error signaling that e is not bound or that the quote in e' is not closed). This is true also in general in what follows: code without italicized text can be copied and pasted in the on-line prototype as they are (of course you must first paste the declarations of the types they use); this is not possible whenever the code contains italicized text.

Patterns are much more than simple variables. They can be used to decompose values. For instance, if the words Hello and world are in the two elements of a pair, we can capture each of them and concatenate them as follows:

let (x,y) = ("Hello, " , "world!") in x @ y

Patterns can also check types. So for instance

let (x & String, y) = e in x 

would return a (static) type error if the first projection of e has not the static type String.

The form let x&t = e in e' is used so often that we introduced a special syntax for it:

let x : t = e in e'
Note the blank spaces around the colons [1]. This is because the XML recommendation allows colons to occur in identifiers: see the User's Manual section on namespaces. (the same holds true for the functional arrow symbol -> which must be surrounded by blanks and by colons in the formal parameters of a function: see this paragraph of the User's manual).

XML documents

CDuce uses its own notation to denote XML documents. In the next table we present an XML document on the left and the same document in CDuce notation on the right (in the rest of this tutorial we visually distinguish XML code from CDuce one by putting the former in light yellow boxes):

<?xml version="1.0"?>
<parentbook>
  <person gender="F">
    <name>Clara</name>
    <children>
      <person gender="M">
        <name>Pål André</name>
        <children/>
      </person>
    </children>
    <email>clara@lri.fr</email>
    <tel>314-1592654</tel>
  </person>
  <person gender="M">
    <name> Bob </name>
    <children>
      <person gender="F">
        <name>Alice</name>
        <children/>
      </person>
      <person gender="M">
        <name>Anne</name>
        <children>
          <person gender="M">
            <name>Charlie</name>
            <children/>
          </person>
        </children>
      </person>
    </children>
    <tel kind="work">271828</tel>
    <tel kind="home">66260</tel>
  </person>
</parentbook>
let parents : ParentBook =
<parentbook>[
  <person gender="F">[
    <name>"Clara"
    <children>[
      <person gender="M">[
        <name>['Pål' 'André'] 
        <children>[]
      ]
    ]
    <email>['clara@lri.fr']
    <tel>"314-1592654"
  ] 
  <person gender="M">[
    <name>"Bob"
    <children>[
      <person gender="F">[
        <name>"Alice" 
        <children>[]
      ]
      <person gender="M">[
        <name>"Anne"
        <children>[
          <person gender="M">[
            <name>"Charlie"
            <children>[]
          ] 
        ] 
      ] 
    ] 
    <tel kind="work">"271828"
    <tel kind="home">"66260"
  ] 
] 

Note the straightforward correspondence between the two notations: instead of using an closing tag, we enclose the content of each element in square brackets. In CDuce square brackets denote sequences, that is, heterogeneous (ordered) lists of blank-separated elements. In CDuce strings are not a primitive data-type but are sequences of characters.

To the purpose of the example we used different notations to denote strings as in CDuce "xyz", ['xyz'], ['x' 'y' 'z'], [ 'xy' 'z' ], and [ 'x' 'yz' ] define the same string literal. Note also that the "Pål André" string is accepted as CDuce supports Unicode characters.

Loading XML files

The program on the right hand-side in the previous section starts by binding the variable parents to the XML document. It also specifies that parents has the type ParentBook: this is optional but it usually allows earlier detection of type errors.

If the file XML on the left hand-side is stored in a file, say, parents.xml then it can be loaded from the file by load_xml "parents.xml" as the builtin function load_xml converts and XML document stored in a file into the CDuce expression representing it. However load_xml has type String->Any, where Any is the type of all values. Therefore if we try to reproduce the same binding as the above by writing the following declaration

let parents : ParentBook = load_xml "parents.xml" 

we would obtain a type error as we were trying to use an expression of type Any where an expression of type ParentBook is expected. The right way to reproduce the binding above is:

let parents : ParentBook =
     match load_xml "parents.xml" with
          x & ParentBook -> x
       |  _ -> raise "parents.xml is not a document of type ParentBook"

what this expression does is that before assigning the result of the load_xml expression to the variable parents it matches it against the type ParentBook. If it succeeds (i.e., if the XML file in the document has type ParentBook) then it performs the assignment (the variable x is bound to the result of the load_xml expression by the pattern x&ParentBook) otherwise it raises an exception.

Of course an exception such as "parents.xml is not a document of type ParentBook" it is not very informative about why the document failed the match an where the error might be. In CDuce it is possible to ask the program to perform this check and raise an informative exception (a string that describes and localize the problem) by using the dynamic type check construction (e:?t) which checks whether the expression exp has type t and it either returns the result of exp or raise an informative exception.

let parents  = load_xml "parents.xml" :? ParentBook

which perform the same test as the previous program but in case of failure give information to the programmer on the reasons why the type check failed. The dynamic type check can be also used in a let construction as follows

let parents :? ParentBook = load_xml "parents.xml"

which is completely equivalent to the previous one.

The command load_xml "parents.xml" is just an abbreviated form for load_xml "file://parents.xml". If CDuce is compiled with netclient or curl support, then it is also possible to use other URI schemes such as http:// or ftp://. A special scheme string: is always supported: the string following the scheme is parsed as it is. [2] So, for instance, load_xml "string:exp" parses litteral XML code exp (it corresponds to XQuery's { exp }), while load_xml ("string:" @ x) parses the XML code associated to the string variable x. Thus the following definition of x

let x : Any = <person>[ <name>"Alice" <children>[] ]

is completely equivalent to this one

let x = load_xml "string:<person><name>Alice</name> <children/></person>"

Type declarations

First, we declare some types:

type ParentBook = <parentbook>[Person*]
type Person = FPerson | MPerson 
type FPerson = <person gender="F">[ Name Children (Tel | Email)*] 
type MPerson = <person gender="M">[ Name Children (Tel | Email)*] 
type Name = <name>[ PCDATA ]
type Children = <children>[Person*] 
type Tel = <tel kind=?"home"|"work">['0'--'9'+ '-'? '0'--'9'+]
type Echar = 'a'--'z' | 'A'--'Z' | '_' | '0'--'9'
type Email= <email>[ Echar+ ('.' Echar+)* '@' Echar+ ('.' Echar+)+ ]

The type ParentBook describes XML documents that store information of persons. A tag <tag attr1=... attr2=... ...> followed by a sequence type denotes an XML document type. Sequence types classify ordered lists of heterogeneous elements and they are denoted by square brackets that enclose regular expressions over types (note that a regular expression over types is not a type, it just describes the content of a sequence type, therefore if it is not enclosed in square brackets it is meaningless). The definitions above state that a ParentBook element is formed by a possibly empty sequence of persons. A person is either of type FPerson or MPerson according to the value of the gender attribute. An equivalent definition for Person would thus be:

<person gender="F"|"M">[ Name Children (Tel | Email)*] 

A person element is composed by a sequence formed of a name element, a children element, and zero or more telephone and e-mail elements, in this order.

Name elements contain strings. These are encoded as sequences of characters. The PCDATA keyword is equivalent to the regexp Char*, then String, [Char*], [PCDATA], [PCDATA* PCDATA], ..., are all equivalent notations. Children are composed of zero or more Person elements. Telephone elements have an optional (as indicated by =?) string attribute whose value is either ``home'' or ``work'' and they are formed by a single string of two non-empty sequences of numeric characters separated by an optional dash character. Had we wanted to state that a phone number is an integer with at least, say, 5 digits (of course this is meaningful only if no phone number starts by 0) we would have used an interval type such as <tel kind=?"home"|"work">[10000--*], where * here denotes plus infinity, while on the lefthand side of -- (as in *--100) it denotes minus infinity.

Echar is the type of characters in e-mails addresses. It is used in the regular expression defining Email to precisely constrain the form of the addresses. An XML document satisfying these constraints is shown

[1] Actually only the first blank is necessary. CDuce accepts let x :t = e in e', as well

[2] All these schemes are available for load_html and load_file as well.

You can cut and paste the code on this page and test it on the online interpreter.