ℂDuce: Documentation: User's manual: Types and patterns

## Types and patterns

In CDuce, a type denotes a set of values, and a pattern extracts sub-values from a value. Syntactically, types and patterns are very close. Indeed, any type can be seen as a pattern (which accepts any value and extracts nothing), and a pattern without any capture variable is nothing but a type.

Moreover, values
also share a common syntax with types and patterns. This is motivated
by the fact that basic and constructed values (that is, any values without
functional values inside) are themselves singleton types.
For instance ** (1,2)** is both a value, a type and a pattern.
As a type, it can be interpreted as a singleton type,
or as a pair type made of two singleton types.
As a pattern, it can be interpreted as a type constraint,
or as a pair pattern of two type constraints.

In this page, we present all the types and patterns that CDuce recognizes. It is also the occasion to present the CDuce values themselves, the corresponding expression constructions, and fundamental operations on them.

## Capture variables and default patterns

A value identifier inside a pattern behaves as a capture variable: it accepts and bind any value.

Another form of capture variable is the default value pattern
** ( x := c )** where

**is a capture variable (that is, an identifier), and**

*x***is a scalar constant. The semantics of this pattern is to bind the capture variable to the constant, disregarding the matched value (and accepting any value).**

*c*
Such a pattern is useful in conjunction with the first match policy
(see below) to define "default cases". For instance, the pattern
** ((x & Int) | (x := 0), (y & Int) | (y := 0))**
accepts any pair and bind

**to the left component if it is an integer (and**

`x`**otherwise), and similarly for**

`0`**with the right component of the pair.**

`y`## Boolean connectives

CDuce recognize the full set of boolean connectives, whose interpretation is purely set-theoretic.

denotes the empty type (no value).`Empty`and`Any`denote the universal type (all the values); the preferred notation is`_`for types and`Any`for patterns, but they are strictly equivalent.`_`is the conjunction boolean connective. The type`&`has all the values that belongs to*t1*&*t2*and to*t1*. Similarly, the pattern*t2*accepts all the values accepted by both sub-patterns; a capture variable cannot appear on both side of this pattern.*p1*&*p2*is the disjunction boolean connective. The type`|`has all the values that belongs either to*t1*|*t2*or to*t1*. Similarly, the pattern*t2*accepts all the values accepted by any of the two sub-patterns; if both match, the first match policy applies, and*p1*|*p2*dictates how to capture sub-values. The two sub-patterns must have the same set of capture variables.*p1*is the difference boolean connective. The left hand-side can be a type or a pattern, but the right-hand side is necessarily a type (no capture variable).`\`

**Priorities**

All boolean connectives have the same priority. Their priority is higher than the function type constructor ** ->** and lower than the XML type constructor. Therefore

**is parsed as**

`Int -> Int & Bool -> Bool`**while**

`Int -> (Int & Bool) -> Bool`**is parsed as**

`<a>[Int*]|String`

`(<a>[Int*])|String`## Recursive types and patterns

A set of mutually recursive types can be defined by toplevel type declarations, as in:

type T1 = <a>[ T2* ] type T2 = <b>[ T1 T1 ]

It is also possible to use the syntax
**T where T1 = t1 and ... and Tn = tn**
where

**and the**

*T***are type identifiers and the**

*Ti***are type expressions. The same notation works for recursive patterns (for which there is no toplevel declarations).**

*ti*
There is an important restriction concerning recursive types:
any cycle must cross a *type constructor* (pairs, records, XML
elements, arrows). Boolean connectives do *not* count as type
constructors! The code sample above is a correct definition.
The one below is invalid, because there is an unguarded cycle
between ** T** and

**.**

`S`type T = S | (S,S) (* INVALID! *) type S = T (* INVALID! *)

## Scalar types

CDuce has three kind of atomic (scalar) values: integers, characters, and atoms. To each kind corresponds a family of types.

**Integers**.

CDuce integers are arbitrarily large. An integer literal is a sequence of decimal digits, plus an optional leading unary minus () character.`-`: all the integers.`Int`(where*i*--*j*and*i*are integer literals, or*j*for infinity): integer interval. E.g.:`*`,`100--*`[1] (note that`*--0`stands both for plus and minus infinity).`*`(where*i*is an integer literal): integer singleton type.*i*

**Floats**.

CDuce provider minimal features for floats. The only way to construct a value of typeis by the function`Float``float_of : String -> Float`**Characters**.

CDuce manipulates Unicode characters. A character literal is enclosed in single quotes, e.g.. The single quote and the backslash character must be escaped by a backslash:`'a', 'b', 'c'`,`'\''`. The double quote can also be escaped, but this is not mandatory. The usual`'\\'`are recognized. Arbitrary Unicode codepoints can be written in decimal`'\n', '\t', '\r'`(`'\`*i*;'is an decimal integer; note that the code is ended by a semicolon) or in hexadecimal*i*. Any other occurrence of a backslash character is prohibited.`'\x`*i*;': all the Unicode character set.`Char`(where*c*--*d*and*d*are character literals): interval of Unicode character set. E.g.:*d*.`'a'--'z'`(where*c*is an integer literal): character singleton type.*c*: all the Latin1 character set (equivalent to`Byte`).`'\0;'--'\255;'`

**Atoms**.

Atoms are symbolic elements. They are used in particular to denote XML tag names, and also to simulate ML sum type constructors and exceptions names. An atomic is writtenwhere```*xxx*follows the rules for CDuce identifiers. E.g.:*xxx*. The atom``yes, `No, `my-name`is used to denote empty sequences.``nil`: all the atoms.`Atom`(where*a*is an atom literal): atom singleton type.*a*: the two atoms`Bool`and``true`.``false`- See also: XML Namespaces.

## Pairs

Pairs is a fundamental notion in CDuce, as they constitute a building block for sequence. Even if syntactic sugar somewhat hides pairs when you use sequences, it is good to know the existence of pairs.

A pair expression is written ** (e1,e2)**
where

**and**

*e1***are expressions.**

*e2*
Similarly, pair types and patterns are written
** (t1,t2)** where

**and**

*t1***are types or patterns. E.g.:**

*t2***.**

`(Int,Char)`
When a capture variable **x** appears on both
side of a pair pattern

**, the semantics is the following one: when a value match**

*p*= (*p1*,*p2*)**, if**

*p***is bound to**

*x***by**

*v1***and to**

*p1***by**

*v2***, then**

*p2***is bound to the pair**

*x***by**

*(v1,v2)***.**

*p*
Tuples are syntactic sugar for pairs. For instance,
** (1,2,3,4)** denotes

**.**

`(1,(2,(3,4)))`## Sequences

### Values and expressions

Sequences are fundamental in CDuce. They represents the content of XML elements, and also character strings. Actually, they are only syntactic sugar over pairs.

Sequences expressions are written inside square brackets; element
are simply separated by whitespaces:
** [ e1 e2 ... en ]**.
Such an expression is syntactic sugar for:

**. E.g.:**

`(`*e1*,(*e2*,*...*(*en*,`nil)*...*))**.**

`[ 1 2 3 4 ]`
The binary operator ** @** denotes sequence concatenation.
E.g.:

**evaluates to**

`[ 1 2 3 ] @ [ 4 5 6 ]`**.**

`[ 1 2 3 4 5 6 ]`
It is possible to specify a terminator different from ** `nil**;
for instance

**denotes**

`[ 1 2 3 4 ;`*q*]**, and is equivalent to**

`(1,(2,(3,(4,`*q*))))**.**

`[ 1 2 3 4 ] @`*q*
Inside the square brackets of a sequence expression, it is possible
to have elements of the form ** ! e** (which is not
an expression by itself), where

**is an expression which should evaluate to a sequence. The semantics is to "open"**

*e***. For instance:**

*e***evaluates to**

`[ 1 2 ![ 3 4 ] 5 ]`**. Consequently, the concatenation of two sequences**

`[ 1 2 3 4 5 ]`**can also be written**

*e1*@*e2***or**

`[ !`*e1*!*e2*]**.**

`[ !`*e1*;*e2*]### Types and patterns

In CDuce, a sequence can be heterogeneous: the element can all have
different types. Types and patterns for sequences are specified
by regular expressions over types or patterns. The syntax is
** [ R ]** where

**is a regular expression, which can be:**

*R*- A type or a pattern, which correspond to a single element in the
sequence (in particular,
represents sequences of length 1,`[ _ ]`*not*arbitrary sequences). - A juxtaposition of regular expressions
which represents concatenation.*R1**R2* - A union of regular expressions
.*R1*|*R2* - A postfix repetition operator; the greedy operators are
,*R*?,*R*+, and the ungreedy operators are:*R**,*R*??,*R*+?. For types, there is no distinction in semantics between greedy and ungreedy.*R**? - A sequence capture variable
(only for patterns, of course). The semantics is to capture in*x*::*R*the subsequence matched by*x*. The same sequence capture variable can appear several times inside a regular expression, including under repetition operators; in that case, all the corresponding subsequences are concatenated together. Two instances of the same sequence capture variable cannot be nested, as in*R*.`[x :: (1 x :: Int)]`

Note the difference betweenand`[ x::Int ]`. Both accept sequences made of a single integer, but the first one binds`[ (x & Int) ]`to a sequence (of a single integer), whereas the second one binds it to the integer itself.`x` -
Grouping
. E.g.:`(`*R*).`[ x::(Int Int) y ]` -
Tail predicate
. The type/pattern`/`*p*applies to the current tail of the sequence (the subsequence starting at the current position). E.g.:*p*will bind`[ (Int /(x:=1) | /(x:=2)) _* ]`to`x`if the sequence starts with an integer and`1`otherwise.`2` -
Repetition
where*R****n*is a positive integer constant, which is just a shorthand for the concatenation of*n*copies of*n*.*R*

Sequence types and patterns also accepts the ** [ ...; ... ]**
notation. This is a convenient way to discard the tail of a sequence
in a pattern, e.g.:

**, which is equivalent to**

`[ x::Int* ; _ ]`**.**

`[ x::Int* _* ]`
It is possible to use the ** @**
operator (sequence concatenation) on types, including in recursive
definitions. E.g.:

type t = [ <a>(t @ t) ? ] (* [s?] where s=<a>[ s? s? ] *) type x = [ Int* ] type y = x @ [ Char* ] (* [ Int* Char* ] *) type t = ([Int] @ t) | [] (* [ Int* ] *)

however when used in recursive definitions ** @** but must be right linear so for instance the following definition are not allowed:

type t = (t @ [Int]) | [] (* ERROR: Ill-formed concatenation loop *) type t = t @ t (* ERROR: Ill-formed concatenation loop *)

## Strings

In CDuce, character strings are nothing but sequences of characters.
The type ** String** is pre-defined as

**. This allows to use the full power of regular expression pattern matching with strings.**

`[ Char* ]`
Inside a regular expression type or pattern, it is possible
to use ** PCDATA** instead of

**(note that both are not types on their own, they only make sense inside square brackets, contrary to**

`Char*`**).**

`String`
The type ** Latin1** is the subtype of

**defined as**

`String`**; it denotes strings that can be represented in the ISO-8859-1 encoding, that is, strings made only of characters from the Latin1 character set.**

`[ Byte* ]`
Several consecutive characters literal in a sequence can be
merged together between two single quotes:
** [ 'abc' ]** instead of

**. Also it is possible to avoid square brackets by using double quotes:**

`[ 'a' 'b' 'c' ]`**. The same escaping rules applies inside double quotes, except that single quotes may be escaped (but must not), and double quotes must be.**

`"abc"`## Records

Records are set of finite (name,value) bindings. They are used in particular to represent XML attribute sets. Names are actually Qualified Names (see XML Namespaces).

The syntax of a record expression is
** { l1=e1; ...; ln=en }**
where the

**are label names (same lexical conventions as for identifiers), and the**

*li***are expressions. When an expression**

*vi***is simply a variable whose name match the field label**

*ei***, it is possible to omit it. E.g.:**

*li***is equivalent to**

`{ x; y = 10; z }`**. The semi-colons between fields are optional.**

`{ x = x; y = 10; z = z }`
They are two kinds of record types. Open record types
are written ** { l1=t1; ...; ln=tn; ..
}**, and closed record types are written

**. Both denote all the record values where the labels**

`{`*l1*=*t1*;*...*;*ln*=*tn*}**are present and the associated values are in the corresponding type. The distinction is that that open type allows extra fields, whereas the closed type gives a strict enumeration of the possible fields. The semi-colon between fields is optional.**

*li*
Additionally, both for open and close record types,
it is possible to specify optional fields by using ** =?**
instead of

**between a label and a type. For instance,**

`=`**represents records with a**

`{ x =? Int; y = Bool }`**field of type**

`y`**, and an optional field**

`Bool`**(that when it is present, has type**

`y`**), and no other field.**

`Int`
The syntax is the same for patterns. Note that capture variables
cannot appear in an optional field. A common idiom is to bind
default values to replace missing optional fields: **
({ x = a } | (a := 1)) & { y = b }**. A special syntax
makes this idiom more convenient:

**.**

`{ x = a else (a:=1); y = b }`
As for record expressions, when the pattern
is simply a capture variable whose name match the field label,
it is possible to omit it. E.g.: ** { x; y = b; z }**
is equivalent to

**.**

`{ x = x; y = b; z = z }`
The ** +** operator (record concatenation, with priority given
to the right argument in case of overlapping) is available on record
types and patterns. This operator can be used to make a close
record type/pattern open, or to add fields:

type t = { a=Int b=Char } type s = t + {..} (* { a=Int b=Char .. } type u = s + { c=Float } (* { a=Int b=Char c=Float .. } *) type v = t + { c=Float } (* { a=Int b=Char c=Float } *)

## XML elements

In CDuce, the general form of an XML element is
** <(tag) (attr)>content** where

**,**

*tag***and**

*attr***are three expressions. Usually,**

*content***is a tag literal**

*tag***, and in this case, instead of writing**

```*xxx***, you can write:**

`<(``*tag*)>**. Similarly, when**

`<`*tag*>**is a record literal, you can omit the surrounding**

*attr***, and also the semicolon between attributes, E.g:**

`({...})`**.**

`<a href="http://..." dir="ltr">[]`
The syntax for XML elements types and patterns follows closely
the syntax for expressions:
** <(tag) (attr)>content**
where

**,**

*tag***and**

*attr***are three types or patterns. As for expressions, it is possible to simplify the notations for tags and attributes. For instance,**

*content***can be written:**

`<(`a) ({ href=String })>[]`**.**

`<a href=String>[]`The following sample shows several way to write XML types.

type A = <a x=String y=String ..>[ A* ] type B = <(`x | `y) ..>[ ] type C = <c x = String; y = String>[ ] type U = { x = String y =? String ..} type V = [ W* ] type W = <v (U)>V

## Functions

CDuce is an higher-order functional languages: functions are first-class citizen values, and can be passed as argument or returned as result, stored in data structure, etc...

A functional type has the form **t -> s**
where

**and**

*t***are types. Intuitively, this type corresponds to functions that accept (at least) any argument of type**

*s***, and for such an argument, returns a value of type**

*t***. For instance, the type**

*s***denotes functions that maps any pair of integer to an integer, and any pair of characters to a character.**

`((Int,Int) -> Int) & ((Char,Char) -> Char)`
The explanation above gives the intuition behind the interpretation
of functional types. It is sufficient to understand which
subtyping relations and equivalences hold between (boolean
combination) of functional types. For instance,
** (Int -> Int) & (Char -> Char)** is a subtype
of

**because with the intuition above, a function of the first type, when given a value of type**

`(Int|Char) -> (Int|Char)`**returns a value of type**

`Int|Char`**or of type**

`Int`**(depending on the argument).**

`Char`
Formally, the type **t -> s** denotes
the set of all CDuce abstractions

**such that**

`fun (`*t1*->*s1*;*...*;*tn*->*sn*)...**is a subtype of**

`(`*t1*->*s1*) &*...*& (*tn*->*sn*)**.**

*t*->*s*Functional types have no counterpart in patterns.

## References

References are mutable memory cells. CDuce has no built-in
reference type. Instead, references are implemented
in an object-oriented way. The type ** ref T**
denotes references of values of type

**. It is only syntactic sugar for the type**

*T***.**

`{ get = [] -> T ; set = T -> [] }`## OCaml abstract types

The notation ** !t** is used by the
CDuce/OCaml interface
to denote the OCaml abstract type

**.**

`t`## Type syntax outline

Below we give a brief and incomplete summary of the syntax of types. For the complete syntax, please refer to the preceding sections.

TYPESt::=vsingleton type (v is a scalar value)| `satom (s is a string)| Atomall the atoms| Boolbooleans (equivalent to`true |`false)| Charcharacters|c--dUnicode intervals (c, d character literals)| ByteLatin1 characters| Intintegers|i--jinteger intervals (i, j integer literals or *)|t->tfunctions| (t,t)pairs| {l=t;...;l=t}closed record types| {l=t;...;l=t; ..}open record types|t+trecord type concatenation| <tt>tXML types| [R]sequences (where R is a type regexp)|t@tsequence concatenation| Stringstrings (equivalent to[Char*])| Latin1Latin1 strings (equivalent to[Byte*])| reftreferences (equivalent to{ get = []->t; set =t->[] })|t&tintersection|t|tunion|t\tdifference| Emptyempty type| Anytop type|Xtype recursion variable|twhereX=trecursive typesandX=t: andX=t|'atype variables (development version only)REGULAR EXPRESSIONS ON TYPESR::=RRconcatenation|R|Runion|R+one or more times|R*zero or more times|R?zero or one time

ℂDuce: Documentation: User's manual: Types and patterns