module Ulexing:Runtime support for lexers generated bysig
..end
ulex
.
This module is roughly equivalent to the module Lexing from
the OCaml standard library, except that its lexbuffers handles
Unicode code points (OCaml type: int
in the range
0..0x10ffff
) instead of bytes (OCaml type: char
).
It is possible to have ulex-generated lexers work on a custom
implementation for lex buffers. To do this, define a module L
which
implements the start
, next
, mark
and backtrack
functions
(See the Internal Interface section below for a specification),
and the Error
exception.
They need not work on a type named lexbuf
: you can use the type
name you want. Then, just do in your ulex-processed source, before
the first lexer specification:
module Ulexing = L
Of course, you'll probably want to define functions like lexeme
to be used in the lexers semantic actions.
type
lexbuf
exception Error
Ulexing.lexeme_start
(resp. Ulexing.lexeme_end
) can be
used to find to positions of the first code point of the current
matched substring (resp. the first code point that yield the error).exception InvalidCodepoint of int
val create : (int array -> int -> int -> int) -> lexbuf
a
, a position pos
and a code point count n
. The
function should put n
code points or less in a
, starting at
position pos
, and return the number of characters provided. A
return value of 0 means end of input.val from_stream : int Stream.t -> lexbuf
val from_int_array : int array -> lexbuf
val from_latin1_stream : char Stream.t -> lexbuf
0..255
)val from_latin1_channel : Pervasives.in_channel -> lexbuf
val from_latin1_string : string -> lexbuf
val from_utf8_stream : char Stream.t -> lexbuf
val from_utf8_channel : Pervasives.in_channel -> lexbuf
val from_utf8_string : string -> lexbuf
type
enc =
| |
Ascii |
| |
Latin1 |
| |
Utf8 |
val from_var_enc_stream : enc Pervasives.ref -> char Stream.t -> lexbuf
In Ascii
mode, non-ASCII bytes (ie >127
) in the stream
raise an InvalidCodepoint
exception.
val from_var_enc_string : enc Pervasives.ref -> string -> lexbuf
Ulexing.from_var_enc_stream
with a string as input.val from_var_enc_channel : enc Pervasives.ref -> Pervasives.in_channel -> lexbuf
Ulexing.from_var_enc_stream
with a channel as input.lexbuf
, which, in the
code generated by ulex
, is bound to the lexer buffer passed to the
parsing function.
These functions can also be called when capturing a Ulexing.Error
exception to retrieve the problematic string.
val lexeme_start : lexbuf -> int
Ulexing.lexeme_start lexbuf
returns the offset in the
input stream of the first code point of the matched string.
The first code point of the stream has offset 0.val lexeme_end : lexbuf -> int
Ulexing.lexeme_end lexbuf
returns the offset in the input stream
of the character following the last code point of the matched
string. The first character of the stream has offset 0.val loc : lexbuf -> int * int
Ulexing.loc lexbuf
returns the pair
(Ulexing.lexeme_start lexbuf,Ulexing.lexeme_end lexbuf)
.val lexeme_length : lexbuf -> int
Ulexing.loc lexbuf
returns the difference
(Ulexing.lexeme_end lexbuf) - (Ulexing.lexeme_start lexbuf)
,
that is, the length (in code points) of the matched string.val lexeme : lexbuf -> int array
Ulexing.lexeme lexbuf
returns the string matched by
the regular expression as an array of Unicode code point.val get_buf : lexbuf -> int array
val get_start : lexbuf -> int
val get_pos : lexbuf -> int
val lexeme_char : lexbuf -> int -> int
Ulexing.lexeme_char lexbuf pos
returns code point number pos
in
the matched string.val sub_lexeme : lexbuf -> int -> int -> int array
Ulexing.lexeme lexbuf pos len
returns a substring of the string
matched by the regular expression as an array of Unicode code point.val latin1_lexeme : lexbuf -> string
Ulexing.lexeme
with a result encoded in Latin1.
This function throws an exception InvalidCodepoint
if it is not possible
to encode the result in Latin1.val latin1_sub_lexeme : lexbuf -> int -> int -> string
Ulexing.sub_lexeme
with a result encoded in Latin1.
This function throws an exception InvalidCodepoint
if it is not possible
to encode the result in Latin1.val latin1_lexeme_char : lexbuf -> int -> char
Ulexing.lexeme_char
with a result encoded in Latin1.
This function throws an exception InvalidCodepoint
if it is not possible
to encode the result in Latin1.val utf8_lexeme : lexbuf -> string
Ulexing.lexeme
with a result encoded in UTF-8.val utf8_sub_lexeme : lexbuf -> int -> int -> string
Ulexing.sub_lexeme
with a result encoded in UTF-8.val rollback : lexbuf -> unit
Ulexing.rollback lexbuf
puts lexbuf
back in its configuration before
the last lexeme was matched. It is then possible to use another
lexer to parse the same characters again. The other functions
above in this section should not be used in the semantic action
after a call to Ulexing.rollback
.ulex
. The lexer buffers have a unique internal slot that can store
an integer. They also store a "backtrack" position.val start : lexbuf -> unit
Ulexing.start lexbuf
informs the lexer buffer that any
code points until the current position can be discarded.
The current position become the "start" position as returned
by Ulexing.lexeme_start
. Moreover, the internal slot is set to
-1
and the backtrack position is set to the current position.val next : lexbuf -> int
Ulexing.next lexbuf next
extracts the next code point from the
lexer buffer and increments to current position. If the input stream
is exhausted, the function returns -1
.val mark : lexbuf -> int -> unit
Ulexing.mark lexbuf i
stores the integer i
in the internal
slot. The backtrack position is set to the current position.val backtrack : lexbuf -> int
Ulexing.backtrack lexbuf
returns the value stored in the
internal slot of the buffer, and performs backtracking
(the current position is set to the value of the backtrack position).