ℂDuce: Documentation: Papers and talks

## Papers

### Thesis

- Théorie, conception et réalisation d'un langage adapté à
XML. A. Frisch. Ph.D thesis (in French). See this page.
Abstract:
**(English)**This thesis describes the theoretical foundations of a type-safe and higher-order functional language, adapted to the manipulation of XML documents. The first part presents the semantical bases: type algebra with recursive types, boolean combination, arrow and product constructors; definition of a semantic subtyping relation via a set-theoretic notion of model for types; description of the functional kernel of the language, in particular its type system and its type-driven dynamic semantics. The second part focuses on the algorithmical aspects: computing the subtyping relation and compiling pattern matching with optimizations. The third part presents the CDuce language, built on top of the functional kernel, together with some of the original techniques used in its implementation.**(French)**Cette thèse décrit les fondements théoriques d'un langage de programmation fonctionnel d'ordre supérieur, typé, adapté à la manipulation de documents XML. La première partie présente les bases sémantiques: algèbre de types avec types récursifs, combinaisons boolénnes et constructeurs flèche et produit; définition d'une relation de sous-typage sémantique en passant par une notion de modèle ensembliste des types; présentation du noyau fonctionnel du langage, en particulier son système de types et sa sémantique dynamique dirigée par les types. La deuxième partie étudie les aspects algorithmiques: calcul de la relation de sous-typage et compilation optimisée du filtrage par motifs. La troisième partie présente le langage CDuce, construit au dessus du noyau fonctionnel, ainsi que certaines des techniques originales mises en oeuvre dans son implémentation. - Parametric Polymorphism for XML Processing Languages. Zhiwu Xu. Joint Paris Diderot University and Chinese Academy of Science Ph.D thesis. May 2013.
Abstract:
XML (eXtensible Markup Language) is a current standard format for exchanging semi-structured data, which has been applied to web services, database, research on formal methods, and so on. For a better processing of XML, recently there emerge many statically typed functional languages, such as XDuce, CDuce, XJ, XTatic, XACT, XHaskell, OCamlDuce and so on. But most of these languages lack parametric polymorphism or present it in a limited form. While parametric polymorphism is needed by XML processing, this is witnessed by the fact that it has repeatedly been requested to and discussed in various working groups of standards (e.g., RELAX NG and XQuery). We study in this thesis the techniques to extend parametric polymorphism into XML processing languages. Our solution consists of two parts: a definition of a polymorphic semantic subtyping relation and a definition of a polymorphic calculus. In the first part, we define and study a polymorphic semantic subtyping relation for a type system with recursive, product and arrow types and set-theoretic type connectives (i.e., intersection, union and negation). We introduce the notion of "convexity" on which our solution is built up and prove there exists at least one model that satisfies convexity. We also propose a sound, complete and terminating subtyping algorithm. The second part is devoted to the theoretical definition of a polymorphic calculus, which takes advance of the subtyping relation. The novelty of the polymorphic calculus is to decorate lambda-abstractions with sets of type-substitutions and to lazily propagate type-substitutions at the moment of the reduction. The second part also explores a semi-decidable local inference algorithm to infer the set of type-substitutions as well as the compilation of the polymorphic calculus into a variety of CDuce.

### Language Description

- CDuce: An XML-Centric General-Purpose Language. V. Benzaken, G. Castagna, and A. Frisch.
Proceedings of the
*ACM International Conference on Functional Programming*, 2003.Abstract:We present the functional language CDuce, discuss some design issues, and show its adequacy for working with XML documents. Distinctive features of CDuce are a powerful pattern matching, first class functions, overloaded functions, a very rich type system (arrows, sequences, pairs, records, intersections, unions, differences), precise type inference for patterns and error localization, and a natural interpretation of types as sets of values. We also outline some important implementation issues; in particular, a dispatch algorithm that demonstrates how static type information can be used to obtain very efficient compilation schemas.

- CDuce: a white paper. V. Benzaken, G. Castagna, and A. Frisch.
Workshop
*PLAN-X: Programming Language Technologies for XML*Pittsburgh PA, Oct. 2002.Abstract:Superseded by the previous paper

- A Full Pattern-based Paradigm for XML Query Processing. V. Benzaken, G. Castagna, and C. Miachon.
Proceedings of the
*7th International Symposium on Practical Aspects of Declarative Languages*, LNCS 3350, 2005.Abstract:In this article we investigate a novel execution paradigm---ML-like pattern-matching---for XML query processing and use it to extend the CDuce language with query capabilities. We show that such a paradigm is well adapted for a common and frequent set of queries and advocate that it constitutes a candidate for efficient execution of XML queries far better than the current XPath-based query mechanisms. We support our claim by comparing performances of XPath-based queries with pattern based ones, and by comparing the latter with the two efficiency-best XQuery processor we are aware of.

- Typed Iterators for XML. G. Castagna, and K. Nguyyễn.
Proceedings of the
*International Conference on Functionnal Programming ICFP*, 2008.Abstract:XML transformations are very sensitive to types: XML types describe the tags and attributes of XML elements as well as the number, kind, and order of their sub-elements. Therefore, operations, even simple ones, that modify these features may affect the types of documents. Operations on XML documents are performed by iterators that, to be useful, need to be typed by a kind of polymorphism that goes beyond what currently exists. For this reason these iterators are not programmed but, rather, hard-coded in the languages. However, this approach soon reaches its limits, as the hard-coded iterators cannot cover fairly standard usage scenarios. As a solution to this problem we propose a generic language to deﬁne iterators for XML data. This language can either be used as a compilation target (e.g., for XPATH) or it can be grafted on any statically typed host programming language (as long as this has product types) to endow it with XML processing capabilities. We show that our language mostly offers the required degree of polymorphism, study its formal properties, and show its expressiveness and practical impact by providing several usage examples and encodings.

### Theoretic Foundations

- Set-theoretic Foundation of Parametric Polymorphism and Subtyping.. G. Castagna, and Z. Xu. In ICFP '11: 16th ACM-SIGPLAN International Conference on Functional Programming, pag. 94-106, September, 2011. This paper received the CACM Research Highlights nomination by ACM SIGPLAN (see SIGPLAN Awards)Abstract: Working with XML data often yields to practical situations in which the programmer would need to assign parametric polymorphic types to higher-order functions. However, up to date, there is no satisfactory way to do it. The indirect purpose of this article is to define a system to remedy this lack. Its actual goal is to study parametric polymorphism for a type system with recursive, product, union, intersection, negation, and function types (the first three constructions are sufficient to encode XML types). We first recall why the definition of such a system was considered hard―when not impossible―and then present the main ideas at the basis of our solution. In particular, we introduce the notion of ``convexity'' on which our solution is built up and discuss its connections with parametricity as defined by Reynolds to whose study our work sheds new light..
- Semantic Subtyping: dealing set-theoretically with function, union, intersection, and negation types. A. Frisch, G. Castagna, and V. Benzaken. The Journal of the ACM, ACM Press. To appear. Abstract: Subtyping relations are usually defined either syntactically by a formal system or semantically by an interpretation of types into an untyped denotational model. This work shows how to define a subtyping relation semantically in the presence of Boolean connectives, functional types and dynamic dispatch on types, without the complexity of denotational models, and how to derive a complete subtyping algorithm.
- A Gentle Introduction to Semantic Subtyping. G. Castagna, and A. Frisch. Joint PPDP-ICALP '05 Keynote Talk. Proc. of PPDP 2005, ACM Press. Abstract:
Subtyping relations are usually defined either syntactically by a formal system or semantically by an interpretation of types into an untyped denotational model. In this work we show step by step how to define a subtyping relation semantically in the presence of functional types and dynamic dispatch on types, without the complexity of denotational models, and how to derive a complete subtyping algorithm.

The presentation is voluntarily kept informal and discursive and the technical details are reduced to a minimum since we rather insist on the motivations, the intuition, and the guidelines to apply the approach, the technical details being already published in the paper below.

- Semantic subtyping. A. Frisch, G. Castagna, and V. Benzaken. In
*LICS 2002*.Abstract:Usually subtyping relations are defined either syntactically by a formal system or semantically by an interpretation of types in an untyped denotational model. In this work we show how to define a subtyping relation semantically, for a language whose

*operational*semantics is*driven by types*; we consider a rich type algebra, with product, arrow, recursive, intersection, union and complement types. Our approach is to "bootstrap" the subtyping relation through a notion of set-theoretic model*of the type algebra*.The advantages of the semantic approach are manifold. Foremost we get "for free" many properties (e.g., the transitivity of subtyping) that, with axiomatized subtyping, would require tedious and error prone proofs. Equally important is that the semantic approach allows one to

*derive*complete algorithms for the subtyping relation or the propagation of types through patterns. As the subtyping relation has a natural (inasmuch as semantic) interpretation, the type system can give informative error messages when static type-checking fails. Last but not least the approach has an immediate impact in the definition*and the implementation*of languages manipulating XML documents, as this was our original motivation. - Semantic subtyping for the pi-calculus. G. Castagna, R. De Nicola, and D. Varacca.
*LICS 2005*.Abstract:Subtyping relations for the pi-calculus are usually defined in a syntactic way, by means of structural rules. We propose a semantic characterisation of channel types and use it to derive a subtyping relation. The type system we consider includes read-only and write-only channel types, product types, recursive types, as well as unions, intersections, and negations of types which are interpreted as the corresponding set-theoretic operations. We prove the decidability of the subtyping relation, formally describe the subtyping algorithm, and use the techniques developped for the decidability of subtyping to prove the decidability of the typing relation.

In order to fully exploit the expressiveness of the new type system (which subsumes several existing ones), we endow the pi-calculus with structured channels where communication is subjected to pattern matching that performs dynamic typecase. This paves the way toward a novel integration of functional and concurrent features, obtained by combining the pi-calculus with CDuce, a functional programming language for XML manipulation that is based on semantic subtyping.

- The Relevance of Semantic Subtyping. M. Dezani-Ciancaglini, A. Frisch, E. Giovannetti, and Y. Motohama.
In
*Intersection Types and Related Systems*Electronic Notes in Theoretical Computer Science 70 No.1 (2002).Abstract:We compare Meyer and Routley's minimal relevant logic B+ with the recent semantics-based approach to subtyping introduced by Frisch, Castagna and Benzaken in the definition of a type system with intersection and union. We show that - for the functional core of the system - such notion of subtyping, which is defined in purely set-theoretical terms, coincides with the relevant entailment of the logic B+.

- Types récursifs, combinaisons booléennes et fonctions
surchargées: application au typage de XML. A. Frisch. In french. Mémoire du DEA « Programmation: Sémantique,
Preuves et Langages » (Paris VII) Abstract:
Nous étudions un lambda-calcul typé avec une opération de filtrage qui permet d'exprimer des fonctions surchargées. L'algèbre de types a des types de base, les types produit et flèche, les types récursifs, les combinaisons booléennes finies arbitraires. Nous considérons une notion de sous-typage sémantique, issue de l'interprétation des types comme ensembles de valeurs du langage.

Les caractéristiques présentes dans le calcul et l'algèbre de types sont motivées par l'utilisation du calcul comme un noyau pour un langage adapté à la manipulation de documents XML.

Nous utilisons une approche sémantique pour étudier l'algèbre de types, tout en conservant une preuve syntaxique de préservation du typage par réduction.

### Implementation algorithms

- Regular tree language recognition with static information. A. Frisch. In
*TCS 2004*(Toulouse).Abstract:This paper presents our compilation strategy to produce efficient code for pattern matching in the CDuce compiler, taking into account static information provided by the type system.

- Greedy regular expression matching. A. Frisch, and L. Cardelli. In
*ICALP 2004*(Turku).Abstract:This paper studies the problem of matching sequences against regular expressions in order to produce structured values.

- Regular tree language recognition with static information. A. Frisch. In
*PLAN-X 2004*.Abstract:This paper presents our compilation strategy to produce efficient code for pattern matching in the CDuce compiler, taking into account static information provided by the type system. Indeed, this information allows in many cases to compute the result (that is, to decide which branch to consider) by looking only at a small fragment of the tree. Formally, we introduce a new kind of deterministic tree automata that can efficiently recognize regular tree languages with static information about the trees and we propose a compilation algorithm to produce these automata.

- Greedy regular expression matching. A. Frisch, and L. Cardelli. In
*PLAN-X 2004*.Abstract:This paper studies the problem of matching sequences against regular expressions in order to produce structured values. More specifically, we formalize in an abstract way a greedy disambiguation policy and propose efficient matching algorithms. We also formalize and address a folklore problem of non-termination in naive implementations of the greedy semantics.

Regular expression types and patterns have been introduced in the setting of XML-oriented functional languages. Traditionnaly, all the XML values and sequences share a common uniform runtime representation. Our work suggests an alternative implementation technique, where regular expression types define not only a set of abstract flat sequences, but also a custom structured representation for such values. This paves the way to a variety of language designs and implementations to integrate XML structural types in existing languages (class-based OO languages, imperative features, constrained runtime environment, ..).

### Security

- Security analysis for XML transformations. V. Benzaken, M. Burelle, and G. Castagna. In
*Asian '03*, LNCS, 2003.Abstract:In this article we give a formal definition of information flows in the context of XML transformations and, more generally, in the presence of type dependent computations. We formalize a sound technique to detect XML document transformations that may leak private or confidential information. By defining security annotations and by relating various kind of analyses to different query scenarios, we also establish a general framework for checking middleware-located information flows.

### Extensions

- Error Mining for Regular Expression Patterns. Giuseppe Castagna, Dario Colazzo, and Alain Frisch. In the
*9th Italian Conference On Theoretical Computer Science*, ICTCS 2005.Abstract: In the design of type systems for XML programming languages based on regular expression types and patterns the focus has been over result analysis, with the main aim of statically checking that a transformation always yields data of an expected output type. While being crucial for correct program composition, result analysis is not sufficient to guarantee that patterns used in the transformation are correct. In this paper we motivate the need of static detection of incorrect patterns, and provide a formal characterization based on pattern matching operational semantics, together with locally exact type analysis techniques to statically detect them.

- Parametric polymorphism for XML. H. Hosoya, A. Frisch, and G. Castagna. In The
*32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages*, 2005.Abstract:Although several type systems have been investigated for XML, parametric polymorphism is rarely treated. This well-established typing discipline can also be useful in XML processing in particular for programs involving "parametric schemas," i.e., schemas parameterized over other schemas (e.g., SOAP). The difficulty in treating polymorphism for XML lies in how to extend the "semantic" approach used in the mainstream (monomorphic) XML type systems. A naive extension would be "semantic" quantification over all substitutions for type variables. However, this approach reduces to an NEXPTIME-complete problem for which no practical algorithm is known, and, moreover, the type system obtained in this way is characterized by a hardly useful expressiveness. In this paper, we propose a lighter-weight way of extending the semantic approach, where we interpret type variables as markings in data values for indicating the parameterized subparts. As a result, we can construct a sensible polymorphic type system both with a semantic flavor and a set of practical algorithms needed for typechecking. Most of these algorithms can be obtained by local modifications to existing ones for a monomorphic system.

- OCaml + XDuce. A. Frisch. ICFP 2006Abstract:
This paper presents the core type system and type inference algorithm of OCamlDuce, a merger between OCaml and XDuce. The challenge was to combine two type checkers of very different natures while preserving the best properties of both (principality and automatic type reconstruction on one side; very precise types and implicit subtyping on the other side). Type inference can be described by two successive passes: the first one is an ML-like unification-based algorithm which also extracts data flow constraints about XML values; the second one is an XDuce-like algorithm which computes XML types in a direct way. An optional preprocessing pass, called strengthening, can be added to allow more implicit use of XML subtyping. This pass is also very similar to an ML type checker.

## Slides

- A Gentle Introduction to Semantic Subtyping. G. Castagna. Joint PPDP-ICALP '05 Keynote Talk. Proc. of PPDP 2005, ACM Press.
- Théorie, conception et réalisation d'un langage adapté à XML. A. Frisch. Slides of the PhD defense (Dec. 2004). In French.
- CDuce: un langage fonctionnel pour manipuler des documents XML. A. Frisch. Slides of the presentation given at the IRIT Seminar (Dec. 2004). In French.
- CDuce: un aperçu. A. Frisch. Slides of the presentation given at the Cristal working group (Oct. 2004). In French.
- CDuce. A. Frisch, G. Castagna, and V. Benzaken. Slides of the presentation given at the second PSD Workshop (Feb. 2004)
- Interfacing CDuce and OCaml. G. Castagna, J. Demouth, and A. Frisch. Slides of the presentation given at the second PSD Workshop (Feb. 2004)
- CDuce. A. Frisch, G. Castagna, and V. Benzaken. Slides of the presentation given at ICFP03 (Aug. 2003)
- Semantic Subtyping. A. Frisch, G. Castagna, and V. Benzaken. Slides of a presentation given at LRI Database Group (2002-12)
- CDuce: a white paper. V. Benzaken, G. Castagna, and A. Frisch. Slides of the presentation given at the PLANX 2002 Workshop in Pittsburgh (2002-09)
- CDuce: a white paper. V. Benzaken, G. Castagna, and A. Frisch. Slides of the presentation given at MyThS meeting in Bertinoro (2002-12)
- CDuce: document security. V. Benzaken, M. Burelle, and G. Castagna. Slides of the presentation given at FOSAD school (september 2002)
- Regular tree language recognition with static information. A. Frisch. Slides of the presentation given at the PLANX 2004 Workshop in Venice (2004-01-13)
- Greedy regular expression matching. A. Frisch, and L. Cardelli. Slides of the presentation given at the PLANX 2004 Workshop in Venice (2004-01-13)
- OOP and XML. A. Frisch. Slides of my presentation for the ECOOP 2004 discussion panel on OOP and XML
- Greedy regular expression matching. A. Frisch, and L. Cardelli. Slides of the presentation given at the ICALP 2004 conference in Turku (2004-07-15)

ℂDuce: Documentation: Papers and talks