-
Notifications
You must be signed in to change notification settings - Fork 6
License
ocaml-community/ulex
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Ulex: a OCaml lexer generator for Unicode License: Copyright (C) 2003, 2004, 2005, 2006 Alain Frisch distributed under the terms of an MIT-like license : see LICENSE Author: email: [email protected] web: http://www.eleves.ens.fr/home/frisch, http://www.cduce.org -------------------------------------------------------------------------- Overview - ulex is a lexer generator. - it is implemented as an OCaml syntax extension: lexer specifications are embedded in regular OCaml code. - the lexers work with a new kind of "lexbuf" that supports Unicode; a single lexer can work with arbitrary encodings of the input stream. -------------------------------------------------------------------------- Lexer specifications ulex adds a new kind of expression to OCaml: lexer definitions. The syntax for the new construction is: lexer R1 -> e1 | R2 -> e2 ... | Rn -> en where the Ri are regular expressions and the ei are OCaml expressions (called actions). The keyword lexer can optionally be followed by a vertical var. The type of this expression if Ulexing.lexbuf -> t where t is the common type of all the actions. Unlike ocamllex, lexers work on stream of Unicode code points, not bytes. The actions have access to a variable "lexbuf", of type Ulexing.lexbuf. They can call function from the Ulexing module to extract (parts of) the matched lexeme, in the desired encoding. It is legal to define mutually recursive lexers with additional arguments: let rec lex1 x y = lexer 'a' -> lex2 0 1 lexbuf | _ -> ... and lex2 a b = lexer ... The syntax of regular expressions is derived from ocamllex. Additional features: - integer literals, where character literal are expected. They represent a Unicode code point. E.g.: [ 'a'-'z' 1000-1500 ] 65 - inside square brackets, a string represents the union of all its characters Note: OCaml source files are supposed to be encoded in Latin1. It is possible to define named regular expressions with the following construction, that can appear in place of of structure item: let regexp n = R where n is the regexp name to be defined. Because ulex is implemented as a syntax extension, it can deal with both original and revised syntax (and possibly others). Note that lexer specifications need not be named. Here is an example: (lexer ("#!" [^ '\n']* "\n")? -> ()) lexbuf -------------------------------------------------------------------------- Scoping rules for regular expressions declarations Regexp declarations look pretty-much like regular OCaml let-bindings. However, they can only be used as structure item (no local binding let regexp n = R in ...). Moreover, they don't respect OCaml scoping rule. Indeed, the lexical scope of a given "let regexp" is the whole source file following the declaration. Also the regexps names are not exported by a module, and you cannot use qualified names A.n (where n is a regexp defined in module A). -------------------------------------------------------------------------- Predefined regexps ulex provides a set of predefined regexps: - eof: the virtual end-of-file character - xml_letter, xml_digit, xml_extender, xml_base_char, xml_ideographic, xml_combining_char, xml_blank: as defined by the XML recommandation - tr8876_ident_char: characters names in identifiers from ISO TR8876 -------------------------------------------------------------------------- Running a lexer To run a lexer, you must call it on a Ulexing.lexbuf. Such an object represents a Unicode buffer. It can be created from Latin1-encoded or utf8-encoded strings, stream, or channels, or from integer arrays or streams (which represent Unicode code points). See the interface of the module Ulexing. There is also some support for parsing Utf-16 encoded streams and manipulating utf16 strings. See the interface of the module Utf16. It is possible to work with a custom implementation for lex buffers. To do this, you just have to ensure that a module called Ulexing is in scope of your lexer specifications. See custom_ulexing.ml in the distribution for an example, and the interface of Ulexing for a specification of what the module should export. -------------------------------------------------------------------------- Using ulex The first thing to do is to compile and install ulex. You need recent versions of OCaml. make all make all.opt (* optional *) 1. With findlib If you have findlib, you can use it to install and use ulex. The name of the findlib package is "ulex". Installation: make install Compilation of OCaml files with lexer specifications: ocamlfind ocamlc -c -package ulex -syntax camlp4o my_file.ml When linking, you must also include the ulex package: ocamlfind ocamlc -o my_prog -linkpkg -package ulex my_file.cmo 2. Without findlib You can use ulex without findlib. To compile, you need to run the source file through the Camlp4 syntax extension pa_ulex.cma. Moreover, you need to link the application with the runtime support library for ulex (ulexing.cma / ulexing.cmxa). -------------------------------------------------------------------------- Acknowledgments Thanks to Benus Becker for contributing an implementation of Utf16.
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published