Home My Page Projects Code Snippets Project Openings SML/NJ
Summary Activity Forums Tracker Lists Tasks Docs Surveys News SCM Files

SCM Repository

[smlnj] View of /smlnj-lib/trunk/Doc/src/RegExp/tutorial.adoc
ViewVC logotype

View of /smlnj-lib/trunk/Doc/src/RegExp/tutorial.adoc

Parent Directory Parent Directory | Revision Log Revision Log

Revision 7542 - (download) (annotate)
Mon Jul 4 19:50:50 2022 UTC (5 weeks, 4 days ago) by jhr
File size: 3690 byte(s)
documentation improvements
= Using the RegExp Library
:Author: John Reppy
:Date: {release-date}
:stem: latexmath
:source-highlighter: pygments
:VERSION: {smlnj-version}

== Introduction

The *RegExp Library* is designed for flexibility; it allows mixing and
matching of different front-end syntax with back-end engines, as well
as supporting arbitrary input sources.
This flexibility, however, comes at the cost of making some of the
simple applications a bit less obvious.  This tutorial shows how the
*RegExp Library* can be used in a variety of common applications.

== Assembling an Regular Expression Matcher

Before we can do anything else, we must assemble a regular-expression
matcher.  For the purposes of this tutorial, we use a combination
of the `AwkSyntax` front-end and the `BackTrackEngine` back-end.

structure RE = RegExpFn(
    structure P = AwkSyntax
    structure E = BackTrackEngine)

== Match trees

Regular expressions may contain xref:str-RegExpSyntax.adoc#con:Group[grouping]
operators.  When a pattern matches a string, these groups induce a nested
tree structure on the matched string.
The xref:str-MatchTree.adoc[``MatchTree``] structure defines
a polymorphic representation of this structure, along with a
number of utility functions for extracting information from
a match.

structure MT = MatchTree

== Example: scanning tokens

The `match` function in the `REGEXP` signature allows one to distinguish
between a set of possible regular expression matches.  One application of
this mechanism is a simple scanner.  Let us define a datatype for tokens,
which can be white space, numbers, or identifiers.

datatype tok
  = WS | NUM of IntInf.int | ID of string

We can then define the scanner as follows:

fun scanner getc gets = let
      fun getMatch cons (MT.Match({pos, len}, _)) = cons (gets (pos, len))
        RE.match [
            ("[ \t\n]+", getMatch (fn _ => WS)),
            ("[0-9]+", getMatch (fn s => NUM(valOf(IntInf.fromString s)))),
            ("[a-zA-Z][a-zA-Z0-9]*", getMatch ID)
          ] getc

Here the `getc` parameter is the standard character reader; we have also included
the `gets` parameter, which is a function of type

'strm * int -> string

for getting a string from a stream.  For many input sources, the `gets` function
has an efficient and direct implementation, but it can also be implemented in
terms of the `getc` function as follows:

fun gets getc (strm, n) = let
      fun getChrs (0, _, chrs) = String.implodeRev chrs
        | getChrs (n, strm, chrs) = (case getc strm
             of NONE => raise Fail "empty stream"
              | SOME(c, strm) => getChrs (n-1, strm, c::chrs)
            (* end case *))
        getChrs (n, strm, [])

Because this function is only called *after* the `scanner` function has matched
a sequence of `n` characters from `strm`, the `"empty stream"` case will not
occur for well behaving input streams.

Here is an example of using the scanner to tokenize strings, where we use the
*Basis Library* substring type to implement the stream type:

fun tokens s = let
      fun gets (ss, n) = Substring.string(Substring.slice (ss, 0, SOME n))
      val scan = scanner Substring.getc gets
      fun lp (ss, toks) = (case scan ss
             of SOME(tok, ss') => lp (ss', tok::toks)
              | NONE => List.rev toks
            (* end case *))
        lp (Substring.full s, [])

ViewVC Help
Powered by ViewVC 1.0.0