CDuce: Types and patterns

ℂDuce: Documentation: User's manual: Types and patterns

Compiler/interpreter/toplevel Expressions

Introduction

In CDuce, a type denotes a set of values, and a pattern extracts sub-values from a value. Syntactically, patterns and types are very akin with two notable exceptions: type variables cannot occur in patterns and capture (expression) variables do not occur in types. Indeed, any closed type (i.e., a type without type variables) can be seen as a pattern which accepts any value of that type and extracts nothing, and a pattern without any capture variable is nothing but a closed type.

Moreover, values also share a common syntax with types and patterns. This is motivated by the fact that basic and constructed values (that is, any values without functional values inside) are themselves singleton types. For instance (1,2) is both a value, a type and a pattern. As a type, it can be interpreted as a singleton type, or as a pair type made of two singleton types. As a pattern, it can be interpreted as a type constraint, or as a pair pattern of two type constraints.

In this page, we present all the types and patterns that CDuce recognizes. It is also the occasion to present the CDuce values themselves, the corresponding expression constructions, and fundamental operations on them.

Type variables, instances, and subtyping

Since version 1.0 type variables may occur in types. A type variable is written 'xxx where “'” is the single quote character and xxx follows the rules for CDuce identifiers. E.g.: 'a, 'my-type, 'This.

A type s is an instance of a type t if s can be obtained from t by applying one or more type substitutions to it, that is, by replacing a given type for all occurrences of a type variable in t.

A type is closed if no type variable occurs in it. A type is open if it is not closed. Intuitively, an open type denotes the set of all its closed instances and if an expression has an open type t it means that it is has all the types that are instances of t (and, by subsumption, all their super-types, too: see below).

Type variables cannot occur in patterns.

As any other type, type variables can occur in expressions. For the scoping rules of type variables see Polymorphism, type variables scope.

A type s is a subtype of a type t (equivalently, t is a supertype of s) if every value of type s is also a value of type t. In that case we say that t is a supertype of s. Since a type is a set of values (i.e., the set of all values that have that type), then subtyping corresponds to set containment on types as sets of values. Intuitively, an expression of a given type can be safely used wherever an expression of a super-type is expected: this property is called subsumption.

Type substitutions preserve subtyping: if two types are in a subtyping relation so are all the pair of instances obtained by applying a same type susbtitutions to the two types.

Capture variables and default patterns

A value identifier inside a pattern behaves as a capture variable: it accepts and binds any value.

Another form of capture variable is the default value pattern ( x := c ) where x is a capture variable (that is, an identifier), and c is a scalar constant. The semantics of this pattern is to bind the capture variable to the constant, disregarding the matched value (and accepting any value).

Such a pattern is useful in conjunction with the first match policy (see below) to define "default cases". For instance, the pattern ((x & Int) | (x := 0), (y & Int) | (y := 0)) accepts any pair and bind x to the left component if it is an integer (and 0 otherwise), and similarly for y with the right component of the pair.

Boolean connectives

CDuce recognizes the full set of boolean connectives, whose interpretation is purely set-theoretic.

Empty denotes the empty type (it contains no value).
Any and _ denote the universal type (all the values); the preferred notation is Any for types and _ for patterns, but they are strictly equivalent.
& is the conjunction boolean connective. The type t1 & t2 has all the values that belongs to t1 and to t2. Similarly, the pattern p1 & p2 accepts all the values accepted by both sub-patterns; a capture variable cannot appear on both side of this pattern.
| is the disjunction boolean connective. The type t1 | t2 has all the values that belongs either to t1 or to t2. Similarly, the pattern p1 | p2 accepts all the values accepted by any of the two sub-patterns; if both match, the first match policy applies, and p1 dictates how to capture sub-values. The two sub-patterns must have the same set of capture variables.
\ is the difference boolean connective. The left hand-side can be a type or a pattern, but the right-hand side is necessarily a type (no capture variable).

Priorities
For Boolean connectives we have that the priority of intersection is hiher than unions and lower than differences. The priority of all Boolean connectives is higher than the function type constructor -> and lower than the XML type constructor. Therefore Int -> Int & Bool -> Bool is parsed as Int -> (Int & Bool) -> Bool while <a>[Int*]|String/Bool is parsed as (<a>[Int*])|(String/Bool)

A type with type variables is empty if and only if all its instances are empty

Type declarations, recursive types and recursive patterns

A set of mutually recursive types can be defined by toplevel type declarations, as in:

type T1 = <a>[ T2* ]
type T2 = <b>[ T1 T1 ]

It is also possible to use the syntax T where T1 = t1 and ... and Tn = tn where T and the Ti are type identifiers and the ti are type expressions. The same notation works for recursive patterns (for which there is no toplevel declaration).

There is an important restriction concerning recursive types: any cycle must cross a type constructor (pairs, records, XML elements, arrows). Boolean connectives do not count as type constructors! The code sample above is a correct definition. The one below is invalid, because there is an unguarded cycle between T and S.

type T = S | (S,S)  (* INVALID! *)
type S = T          (* INVALID! *)

(What follows is for the development version only:)

Type declarations can be parametrized by one or more type variables. The general syntax is:

type name('a1,...,'an) = t

where name is a type identifier, immediately followed by an opening parenthesis and a list of type variables 'a1, ..., 'an that must occur in t

For instance, a possible definition for associative maps parametric in types of the pairs is:

type Map('a,'b) = ( ('a,'b) , Map('a,'b) ) | `Nil

The same restriction for cycles applies also in this case for the whole parametric type. In this case the recursion variable is given by the name and the type parameters which must always appear in the same order (e.g., in the definition above Map('a,'b)). So for instance a definition such as

type Map('a,'b) = ( ('a,'b) , Map('b,'a) ) | `Nil  (* ERROR: Invalid instantiation of type 'Map' *)

is rejected since Map('a,'b) and Map('b,'a) are not textually the same.
Parametric type variables can then be instantiated by any type, such as Map(String , ('a,'a) -> Bool). The parametric notation is just a shorthand for the whole definition, therefore an instance is completely equivalent to using the right hand-side of the definition of the parametric types where the type variables are replaced by the specified types. This also means that parameters that are not used in the definition are useless.

Scalar types

CDuce has three kinds of atomic (scalar) values: integers, characters, and atoms. To each kind corresponds a family of types.

Integers.
CDuce integers are arbitrarily large. An integer literal is a sequence of decimal digits, plus an optional leading unary minus (-) character.
- Int: all the integers.
- i--j (where i and j are integer literals, or * for infinity): integer interval. E.g.: 100--*, *--0[1] (note that * stands both for plus and minus infinity).
- i (where i is an integer literal): integer singleton type.
Floats.
CDuce provider minimal features for floats. The only way to construct a value of type Float is by the function float_of : String -> Float
Characters.
CDuce manipulates Unicode characters. A character literal is enclosed in single quotes, e.g. 'a', 'b', 'c'. The single quote and the backslash character must be escaped by a backslash: '\'', '\\'. The double quote can also be escaped, but this is not mandatory. The usual '\n', '\t', '\r' are recognized. Arbitrary Unicode codepoints can be written in decimal '\i;' (i is an decimal integer; note that the code is ended by a semicolon) or in hexadecimal '\xi;'. Any other occurrence of a backslash character is prohibited.
- Char: all the Unicode character set.
- c--d (where c and d are character literals): interval of Unicode character set. E.g.: 'a'--'z'.
- c (where c is a character literal): character singleton type.
- Byte: all the Latin1 character set (equivalent to '\0;'--'\255;').
Atoms.
Atoms are symbolic elements. They are used in particular to denote XML tag names, and also to simulate ML sum type constructors and exceptions names. An atom is written `xxx where “`” is the backquote character and xxx follows the rules for CDuce identifiers) E.g.: `yes, `No, `my-name. The atom `nil is used to denote empty sequences.
- Atom: all the atoms.
- a (where a is an atom literal): atom singleton type.
- Bool: the two atoms `true and `false.
- See also: XML Namespaces.

Pairs

The notion of Pairs is fundamental in CDuce, since pairs constitute a building block for sequences. Even if syntactic sugar somehow hides pairs when you use sequences, it is important to know that the underlying representation of sequences are nested pairs.

A pair expression is written (e1,e2) where e1 and e2 are expressions.

Similarly, pair types and patterns are written (t1,t2) where t1 and t2 are types or patterns. E.g.: (Int,Char).

When a capture variable x appears on both side of a pair pattern p = (p1,p2), the semantics is the following one: when a value match p, if x is bound to v1 by p1 and to v2 by p2, then x is bound to the pair (v1,v2) by p.

Tuples are syntactic sugar for pairs. For instance, (1,2,3,4) denotes (1,(2,(3,4))).

Pair: the type of all pairs of values (it is equivalent to (Any,Any)).

Sequences

Values and expressions

Sequences are a key ingredient of CDuce. They represent both the content of XML elements and strings of characters. Actually, as said before, they are only syntactic sugar over pairs.

Sequences expressions are written inside square brackets; element are simply separated by whitespaces: [ e1 e2 ... en ]. Such an expression is syntactic sugar for: (e1,(e2, ... (en,`nil) ...)). E.g.: [ 1 2 3 4 ].

The binary operator @ denotes sequence concatenation. E.g.: [ 1 2 3 ] @ [ 4 5 6 ] evaluates to [ 1 2 3 4 5 6 ].

It is possible to specify a terminator different from `nil; for instance [ 1 2 3 4 ; q ] denotes (1,(2,(3,(4,q)))), and is equivalent to [ 1 2 3 4 ] @ q.

Inside the square brackets of a sequence expression, it is possible to have elements of the form ! e (which is not an expression by itself), where e is an expression which should evaluate to a sequence. The semantics is to "open" e. For instance: [ 1 2 ![ 3 4 ] 5 ] evaluates to [ 1 2 3 4 5 ]. Consequently, the concatenation of two sequences e1 @ e2 can also be written [ !e1 !e2 ] or [ !e1 ; e2 ].

Types and patterns

In CDuce, a sequence can be heterogeneous: the element can all have different types. Types and patterns for sequences are specified by regular expressions over types or patterns. The syntax is [ R ] where R is a regular expression, which can be:

A type or a pattern, which correspond to a single element in the sequence (in particular, [ _ ] represents sequences of length 1, not arbitrary sequences).
A juxtaposition of regular expressions R1 R2 which represents concatenation.
A union of regular expressions R1|R2.
A postfix repetition operator; the greedy operators are R?, R+, R*, and the ungreedy operators are: R??, R+?, R*?. For types, there is no distinction in semantics between greedy and ungreedy.
A sequence capture variable x::R (only for patterns, of course). The semantics is to capture in x the subsequence matched by R. The same sequence capture variable can appear several times inside a regular expression, including under repetition operators; in that case, all the corresponding subsequences are concatenated together. Two instances of the same sequence capture variable cannot be nested, as in [x :: (1 x :: Int)].
Note the difference between [ x::Int ] and [ (x & Int) ]. Both accept sequences made of a single integer, but the first one binds x to a sequence (of a single integer), whereas the second one binds it to the integer itself.
Grouping (R). E.g.: [ x::(Int Int) y ].
Tail predicate /p. The type/pattern p applies to the current tail of the sequence (the subsequence starting at the current position) without consuming it. E.g.: [ (Int /(x:=1) | /(x:=2)) _* ] will bind x to 1 if the sequence starts with an integer and 2 otherwise (notice the presence of _* to consume the tail of the sequence).
Repetition R ** n where n is a positive integer constant, which is just a shorthand for the concatenation of n copies of R.

Sequence types and patterns also accepts the [ ...; ... ] notation. This is a convenient way to discard the tail of a sequence in a pattern, e.g.: [ x::Int* ; _ ], which is equivalent to [ x::Int* _* ].

It is possible to use the @ operator (sequence concatenation) on types, including in recursive definitions. E.g.:

type t = [ <a>(t @ t) ? ]    (* [s?] where s=<a>[ s? s? ] *)

type x = [ Int* ]
type y = x @ [ Char* ]       (* [ Int* Char* ] *)

type t = ([Int] @ t) | []    (* [ Int* ] *)

however when used in recursive definitions @ but must be right linear so for instance the following definition are not allowed:

type t = (t @ [Int]) | []      (* ERROR: Ill-formed concatenation loop *)
type t = t @ t                 (* ERROR: Ill-formed concatenation loop *)

Strings

In CDuce, character strings are nothing but sequences of characters. The type String is pre-defined as [ Char* ]. This makes it possible to use the full power of regular expression pattern matching with strings.

Inside a regular expression type or pattern, it is possible to use PCDATA instead of Char* (note that both are not types on their own, they only make sense inside square brackets, contrary to String).

The type Latin1 is the subtype of String defined as [ Byte* ]; it denotes strings that can be represented in the ISO-8859-1 encoding, that is, strings made only of characters from the Latin1 character set.

Several consecutive characters literal in a sequence can be merged together between two single quotes: [ 'abc' ] instead of [ 'a' 'b' 'c' ]. Also it is possible to avoid square brackets by using double quotes: "abc". The same escaping rules applies inside double quotes, except that single quotes may be escaped (but must not), and double quotes must be.

Records

Records are finite sets of (name,value) bindings. They are used in particular to represent XML attribute sets. Names are actually Qualified Names (see XML Namespaces).

The syntax of a record expression is { l1=e1; ...; ln=en } where the li are label names (same lexical conventions as for identifiers), and the vi are expressions. When an expression ei is simply a variable whose name match the field label li, it is possible to omit it. E.g.: { x; y = 10; z } is equivalent to { x = x; y = 10; z = z }. The semi-colons between fields are optional.

They are two kinds of record types. Open record types are written { l1=t1; ...; ln=tn; .. }, and closed record types are written { l1 = t1; ...; ln = tn }. Both denote all the record values where the labels li are present and the associated values are in the corresponding type. The semi-colon between fields is optional. The distinction is that that open type allows extra fields, whereas the closed type gives a strict enumeration of the possible fields. As a consequence {..} is the type of all record values.

Additionally, both for open and close record types, it is possible to specify optional fields by using =? instead of = between a label and a type. For instance, { x =? Int; y = Bool } represents records with a y field of type Bool, and an optional field y (that when it is present, has type Int), and no other field.

The syntax is the same for patterns. Note that capture variables cannot appear in an optional field. A common idiom is to bind default values to replace missing optional fields: ({ x = a } | (a := 1)) & { y = b }. A special syntax makes this idiom more convenient: { x = a else (a:=1); y = b }.

As for record expressions, when the pattern is simply a capture variable whose name match the field label, it is possible to omit it. E.g.: { x; y = b; z } is equivalent to { x = x; y = b; z = z }.

The + operator (record concatenation, with priority given to the right argument in case of overlapping) is available on record types and patterns. This operator can be used to make a close record type/pattern open, or to add fields:

type t = { a=Int b=Char }
type s = t + {..}               (* { a=Int b=Char .. }
type u = s + { c=Float }        (* { a=Int b=Char c=Float .. } *)
type v = t + { c=Float }        (* { a=Int b=Char c=Float } *)

XML elements

In CDuce, the general form of an XML element is <(tag) (attr)>content where tag, attr and content are three expressions. Usually, tag is a tag literal `xxx, and in this case, instead of writing <(`tag)>, you can write: <tag>. Similarly, when attr is a record literal, you can omit the surrounding ({...}), and also the semicolon between attributes, E.g: <a href="http://..." dir="ltr">[].

The syntax for XML elements types and patterns follows closely the syntax for expressions: <(tag) (attr)>content where tag, attr and content are three types or patterns. As for expressions, it is possible to simplify the notations for tags and attributes. For instance, <(`a) ({ href=String })>[] can be written: <a href=String>[].

The following sample shows several way to write XML types.

type A = <a x=String y=String ..>[ A* ]
type B = <(`x | `y) ..>[ ]
type C = <c x = String; y = String>[ ]
type U = { x = String y =? String ..}
type V = [ W* ]
type W = <v (U)>V

AnyXML: the type of all XML values (it is equivalent to <(Atom) ({..})>[Any*]).

Functions

CDuce is an higher-order functional language: functions are first-class citizen values and, as such, they can be passed as argument or returned as result, stored in data structures, etc...

A functional type has the form t -> s where t and s are types. Intuitively, this type corresponds to functions that accept (at least) any argument of type t, and for such an argument on which they terminate they return a value of type s. For instance, the type ((Int,Int) -> Int) & ((Char,Char) -> Char) denotes functions that maps every pair of integers to an integer, and every pair of characters to a character.

The explanation above gives the intuition behind the interpretation of functional types. It is sufficient to understand which subtyping relations and equivalences hold between (boolean combination) of functional types. For instance, (Int -> Int) & (Char -> Char) is a subtype of (Int|Char) -> (Int|Char) because with the intuition above, a function of the first type, when given a value of type Int|Char returns a value of type Int or of type Char (depending on the argument).

Formally, the type t -> s denotes the set of all CDuce abstractions fun (t1 -> s1; ...; tn -> sn)... such that (t1 -> s1) & ... & (tn -> sn) is a subtype of t -> s.

Arrow: the type of all functions (it is equivalent to Empty -> Any).

Functional types have no counterpart in patterns.

References

References are mutable memory cells. CDuce has no built-in reference type. Instead, references are implemented in an object-oriented way. The type ref t denotes references of values of type t. It is only syntactic sugar for the type { get = [] -> t ; set = t -> [] }.

Currently polymorphic reference types are not supported. Therefore, CDuce accepts only reference types ref t where t is a closed type.

OCaml abstract types

The notation !t is used by the CDuce/OCaml interface to denote the OCaml abstract type t.

TO BE DONE

Type syntax outline

Below we give a brief and incomplete summary of the syntax of types. For the complete syntax, please refer to the preceding sections.

TYPES
  
 t ::= v                  singleton type (v is a scalar value) 
     | `s                 atom (s is a string)
     | Atom               all the atoms
     | Bool               booleans (equivalent to `true |`false)
     | Char               characters
     | c--d               Unicode intervals (c, d character literals)
     | Byte               Latin1 characters
     | Int                integers
     | i--j               integer intervals (i, j integer literals or *)

     | t -> t             functions
     | (t , t)            pairs
     | {l=t;...;l=t}      closed record types
     | {l=t;...;l=t; ..}  open record types
     | t + t              record type concatenation
     | <t t> t            XML types
     | [ R ]              sequences (where R is a type regexp)
     | t @ t              sequence concatenation
     | String             strings (equivalent to [Char*])
     | Latin1             Latin1 strings (equivalent to [Byte*])
     | ref t              references (equivalent to { get = []->t; set = t->[] } )

     
     | t & t              intersection
     | t | t              union
     | t \ t              difference
     | Empty              empty type
     | Any                top type

     | X                  type recursion variable
     | t where X = t      recursive types
           and X = t
                 :
           and X = t

     | 'a                 type variables (development version only)


REGULAR EXPRESSIONS ON TYPES
     
 R ::= R R                 concatenation
     | R|R                 union
     | R+                  one or more times    
     | R*                  zero or more times    
     | R?                  zero or one time

Webmaster - Site map

[1] You should be careful when putting parenthesis around a type of the form *--i. Indeed, (*--i) would be parsed as a comment. You have to put a whitespace after the left parenthesis.

ℂDuce: Documentation: User's manual: Types and patterns

Compiler/interpreter/toplevel Expressions