Home My Page Projects Code Snippets Project Openings SML/NJ
Summary Activity Forums Tracker Lists Tasks Docs Surveys News SCM Files

SCM Repository

[smlnj] View of /sml/trunk/src/cm/Doc/manual.tex
ViewVC logotype

View of /sml/trunk/src/cm/Doc/manual.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 592 - (download) (as text) (annotate)
Mon Apr 3 07:04:12 2000 UTC (19 years, 7 months ago) by blume
File size: 78459 byte(s)
merging branch blume_devel_v110p26p2_1 (elimination of corenv)
\documentclass[titlepage]{article}
\usepackage{times}
\usepackage{epsfig}

\marginparwidth0pt\oddsidemargin0pt\evensidemargin0pt\marginparsep0pt
\topmargin0pt\advance\topmargin by-\headheight\advance\topmargin by-\headsep
\textwidth6.7in\textheight9.1in %\renewcommand{\baselinestretch}{1.2}
\columnsep0.25in

\author{Matthias Blume \\
Research Institute for Mathematical Sciences \\
Kyoto University}

\title{{\bf CM}\\
The SML/NJ Compilation and Library Manager \\
{\it\small (for SML/NJ version 110.26.3 and later)} \\
User Manual}

\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 3pt minus 2pt}

\newcommand{\nt}[1]{{\it #1}}
\newcommand{\tl}[1]{{\underline{\bf #1}}}
\newcommand{\ttl}[1]{{\underline{\tt #1}}}
\newcommand{\ar}{$\rightarrow$\ }
\newcommand{\vb}{~$|$~}

\begin{document}

\bibliographystyle{alpha}

\maketitle

\pagebreak

\tableofcontents

\pagebreak

\section{Introduction}

This manual describes a new implementation of CM, the ``Compilation
and Library Manager'' for Standard ML of New Jersey (SML/NJ).  Like its
previous version, CM is in charge of managing separate compilation and
facilitates access to stable libraries.

Programming projects that use CM are typically composed of separate
{\em libraries}.  Libraries are collections of ML compilation units
and themselves can be internally sub-structured using CM's notion of
{\em groups}.  Using libraries and groups, programs can be viewed as a
{\em hierarchy of modules}.  The organization of large projects tends
to benefit from this approach~\cite{blume:appel:cm99}.

CM uses {\em cutoff} techniques~\cite{tichy94} to minimize
recompilation work and provides automatic dependency analysis to free
the programmer from having to specify a detailed module dependency
graph by hand~\cite{blume:depend99}.

This new version of CM emphasizes {\em working with libraries}.  This
contrasts with the previous implementation where the focus was on
compilation management while libraries were added as an afterthought.
Beginning now, CM takes a very library-centric view of the world.  In
fact, the implementation of SML/NJ itself has been restructured to
meet this approach.

\section{The CM model}

A CM library is a (possibly empty) collection of ML source files and
may also contain references to other libraries.  Each library comes
with an explicit export interface which lists all toplevel-defined
symbols of the library that shall be exported to its clients.  A
library is described by the contents of its {\em description file}.

\noindent Example:

\begin{verbatim}
Library
    signature BAR
    structure Foo
is
    bar.sig
    foo.sml
    helper.sml
    basis.cm
\end{verbatim}

This library exports two definitions, one for a structure named {\tt
Foo} and one for a signature named {\tt BAR}.  The specification for
such exports appear between the keywords {\tt Library} and {\tt is}.
The {\em members} of the library are specified after the keyword {\tt
is}.  Here we have three ML source files ({\tt bar.sig}, {\tt
foo.sml}, and {\tt helper.sml}) as well as a reference to one external
library ({\tt basis.cm}).  The entry {\tt basis.cm} typically denotes
the description file for the {\it Standard ML Basis
Library}~\cite{reppy99:basis}; most programs will want to list it in
their own description file(s).

\subsection{Library descriptions}

Members of a library do not have to be listed in any particular order
since CM will automatically calculate the dependency graph.  Some
minor restrictions on the source language are necessary to make this
work:
\begin{enumerate}
\item All top-level definitions must be {\em module} definitions
(structures, signatures, functors, or functor signatures).  In other
words, there can be no top-level type-, value-, or infix-definitions.
\item For a given symbol, there can be at most one ML source file per
library (or---more correctly---one file per library component; see
Section~\ref{sec:groups}) that defines the symbol at top level.
\item If more than one sub-library or sub-group is exporting the same
symbol, then the definition (i.e., the ML source file that actually
defines the symbol) must be identical in all cases.
\label{rule:diamond}
\item The use of ML's {\bf open} construct is not permitted at top
level.
\end{enumerate}

Note that these rules do not require the exports of sub-groups or
sub-libraries to be distinct from the exports of ML source files.  If
an ML source file re-defines an imported name, then the disambiguating
rule is that the definition from the ML source takes precedence over
the definition imported from the group or library.

Rule~\ref{rule:diamond} may come as a bit of a surprise considering
that each ML source file can be a member of at most one group or
library (see section~\ref{sec:multioccur}).  However, it is indeed
possible for two libraries to export the ``same'' definition provided
they both import that definition from a third library.  For example,
let us assume that {\tt a.cm} exports a structure {\tt X} which was
defined in {\tt x.sml}---one of {\tt a.cm}'s members.  Now, if both
{\tt b.cm} and {\tt c.cm} re-export that same structure {\tt X} after
importing it from {\tt a.cm}, it is legal for a fourth library {\tt
d.cm} to import from both {\tt b.cm} and {\tt c.cm}.

The full syntax for library description files also includes provisions
for a simple ``conditional compilation'' facility (see
Section~\ref{sec:preproc}), for access control (see
Section~\ref{sec:access}), and accepts ML-style nestable comments
delimited by \verb|(*| and \verb|*)|.

\subsection{Name visibility}

In general, all definitions exported from members of a library are
visible in all ML source files of that library.  The source code in
those source files can refer to them directly without further
qualification.  Here, ``exported'' means either a top-level definition
within an ML source file or a definition listed in a (sub-)library's
export list.

If a library is structured into library components using {\em groups}
(see Section~\ref{sec:groups}), then---as far as name visibility is
concerned---each component (group) is treated like a separate library.

Cyclic dependencies among libraries, library components, or ML source
files within a library are detected and flagged as errors.

\subsection{Groups}
\label{sec:groups}

CM's group model eliminates a whole class of potential naming problems
by providing control over name spaces for program linkage.  The group
model in full generality sometimes requires bindings to be renamed at
the time of import. As has been described
separately~\cite{blume:appel:cm99}, in the case of ML this can also be
achieved using ``administative'' libaries, which is why CM can get
away with not providing more direct support for renaming.

However, under CM, the term ``library'' does not only mean namespace
management (as it would from the point of view of the pure group
model) but also refers to actual file system objects.  It would be
inconvenient if name resolution problems would result in a
proliferation of additional library files.  Therefore, CM also
provides the notion of groups (or: ``library components'').  Name
resolution for groups works like name resolution for entire libraries,
but grouping is entirely internal to each library.

During development, each group has its own description file which will
be referred to by the surrounding library or other components thereof.
The syntax of group description files is the same as that of library
description files with the following exceptions:

\begin{itemize}
\item The initial keyword {\tt Library} is replaced with {\tt Group}.
It is followed by the name of the surrounding library's description
file in parentheses.
\item The export list can be left empty, in which case CM will
provide a default export list: all exports from ML source files plus
all exports from sub-components of the component.  (Note that this does
not include the exports of other libraries.)
\item There are some small restrictions on access control
specifications (see Section~\ref{sec:access}).
\end{itemize}

As an example, let us assume that {\tt foo-utils.cm} contains the
following text:

\begin{verbatim}
Group (foo-lib.cm)
is
    set-util.sml
    map-util.sml
    basis.cm
\end{verbatim}

Here, the library description file {\tt foo-lib.cm} would list {\tt
foo-utils.cm} as one of its members:

\begin{verbatim}
Library
    signature FOO
    structure Foo
is
    foo.sig
    foo.sml
    foo-utils.cm
    basis.cm
\end{verbatim}

\subsection{Multiple occurences of the same member}
\label{sec:multioccur}

The following rules apply to multiple occurences of the same ML source
file, the same library, or the same group within a program:

\begin{itemize}
\item Within the same description file, each member can be specified
at most once.
\item Libraries can be referred to freely from as many other groups or
libraries as the programmer desires.
\item A group cannot be used from outside the (uniquely defined)
library that it is a component of.  However, within that library it
can be referred to from arbitrarily many other groups.
\item The same ML source file cannot appear more than once.  If an ML
source file is to be referred to by multiple clients, it must first be
``wrapped'' into a library (or---if all references are from within the
same library---a group).
\end{itemize}

\subsection{Top-level groups}

Mainly to facilitate some superficial backward-compatibility, CM also
allows groups to appear at top level, i.e., outside of any library.
Such groups must omit the parenthetical library specification and then
cannot also be used within libraries. One could think of the top level
itself as a ``virtual unnamed library'' whose components are these
top-level groups.

\section{Naming objects in the file system}

\subsection{Motivation}

File naming has been an area notorious for its problems and was the
cause of most of the gripes from CM's users.  With this in mind, CM
now takes a different approach to file name resolution.

The main difficulty lies in the fact that files or even whole
directories may move after CM has already partially (but not fully)
processed them.  For example, this happens when the {\em autoloader}
(see Section~\ref{sec:autoload}) has been invoked and the session
(including CM's internal state) is then frozen (i.e., saved to a file)
via {\tt SMLofNJ.exportML}.  The new CM is now able to resume such a
session even when operating in a different environment, perhaps on a
different machine with different file system mounted, or a different
location of the SML/NJ installation.

To make this possible, CM provides a configurable mechanism for
locating file system objects.  Moreover, it invokes this mechanism
always as late as possible and is prepared to re-invoke it after the
configuration changes.

\subsection{Basic rules}

CM uses its own ``standard'' syntax for pathnames which happens to be
the same as the one used by most Unix-like systems: path name
components are separated by ``{\bf /}'', paths beginning with ``{\bf
/}'' are considered {\em absolute} while other paths are {\em
relative}.

Since this standard syntax does not cover system-specific aspects such
as volume names, it is also possible to revert to ``native'' syntax by
enclosing the name in double-quotes.  Of course, description files
that use path names in native syntax are not portable across operating
systems.

Absolute pathnames are resolved in the usual manner specific to the
operating system.  However, it is advisable to avoid absolute
pathnames because they are certain to ``break'' if the corresponding
file moves to a different location.

The resolution of relative pathnames is more complicated:

\begin{itemize}
\item If the first component of a relative pathname is a
``configuration anchor'' (see Section~\ref{sec:anchors}), then we call
the path {\em anchored}.  In this case the
whole name will be resolved relative to the value associated with that
anchor.  For example, if the path is {\tt foo/bar/baz} and {\tt
foo} is known as an anchor mapped to {\tt /usr/local}, then the
full name of the actual file system object referred to is {\tt
/usr/local/foo/bar/baz}. Note that the {\tt foo} component is not
stripped away during the resolution process; different anchors that
map to the same directory still remain different.
\item Otherwise, if the relative name appears in some description file
whose name is {\it path}{\tt /}{\it file}{\tt .cm}, then it will be
resolved relative to {\it path}, i.e., relative to the directory that
contains the description file.
\item If a non-anchored relative path is entered interactively, for
example as an argument to one of CM's interface functions, then it
will be resolved in the OS-specific manner, i.e., relative to the
current working directory.  However, CM will internally represent the
name in such a way that it remembers the corresponding working
directory.  Should the working directory change during an ongoing CM
session while there still is a reference to the name, then CM will
switch its mode of operation and prepend the path of the original
working directory. As a result, two names specified using identical
strings but at different times when different working directories were
in effect will be kept distinct and continue to refer to the file
system location that they referred to when they were first seen.
\end{itemize}

\subsection{Anchor configuration}
\label{sec:anchors}

The association of path name anchors with their corresponding
directory names is a simple one-way mapping.  At startup time, this
mapping is initialized by reading two configuration files: an
installation-specific one and a user-specific one.  After that, the
mapping can be maintained using CM's interface functions {\tt
CM.Anchor.anchor} and {\tt CM.Anchor.reset} (see
Section~\ref{sec:api}).

The default location of the installation-specific configuration file
is {\tt /usr/lib/smlnj-pathconfig}.  However, normally this default
gets replaced (via an environment variable named {\tt
CM\_PATHCONFIG\_DEFAULT}) at installation time by a path pointing to
wherever the installation actually puts the configuration file.
The user can specify a new location at startup time using the
environment variable {\tt CM\_PATHCONFIG}.

The default location of the user-specific configuration file is {\tt
.smlnj-pathconfig} in the user's home directory (which must be given
by the {\tt HOME} environment variable).  At startup time, this
default can be overridden by a fixed location which must be given as
the value of the environment variable {\tt CM\_LOCAL\_PATHCONFIG}.

The syntax of all configuration files is identical.  Lines are
processed from top to bottom. White space divides lines into tokens.
\begin{itemize}
\item A line with exactly two tokens associates an anchor (the first
token) with a directory in native syntax (the second token).  Neither
anchor nor directory name may contain white space and the anchor
should not contain a {\bf /}.  If the directory name is a relative
name, then it will be expanded by prepending the name of the directory
that contains the configuration file.
\item A line containing exactly one token that is the name of an
anchor cancels any existing association of that anchor with a
directory.
\item A line with a single token that consists of a single minus sign
{\bf -} cancels all existing anchors.  This typically makes sense only
at the beginning of the user-specific configuration file and
erases any settings that were made by the installation-specific
configuration file.
\item Lines with no token (i.e., empty lines) will be silently ignored.
\item Any other line is considered malformed and will cause a warning
but will otherwise be ignored.
\end{itemize}

\section{Using CM}

\subsection{Structure CM}
\label{sec:api}

Functions that control CM's operation are accessible as members of a
structure named {\tt CM}.  This structure itself is exported from a
library called {\tt smlnj/cm/full.cm} (or, alternatively, {\tt
smlnj/cm.cm}).  Other libraries can exploit CM's functionality simply
by putting a {\tt smlnj/cm/full.cm} entry into their own description file.
Section~\ref{sec:dynlink} shows one interesting use of this feature.

Initially, only a ``minimal'' version of structure {\tt CM} (exported
by library {\tt smlnj/cm/minimal.cm}) is pre-registered at the interactive
prompt.  To make the full version of structure {\tt CM} available, one
must explicitly load {\tt smlnj/cm/full.cm} using {\tt CM.autoload} or {\tt
CM.make}, both of which are also available in the minimal version.
(The minimal structure {\tt CM} contains four members: {\tt
CM.recomp}, {\tt CM.stabilize}, {\tt CM.make}, and {\tt CM.autoload}.)

Here is a description of all members:

\subsubsection*{Compiling}

Two main activities when using CM is to compile ML source code and to
build stable libraries:

\begin{verbatim}
  val recomp : string -> bool
  val stabilize : bool -> string -> bool
\end{verbatim}

{\tt CM.recomp} takes the name of a program's ``root'' description
file and compiles or recompiles all ML source files that are necessary
to provide definitions for the root library's export list.

{\tt CM.stabilize} takes a boolean flag and then the name of a library
and {\em stabilizes} this library.  A library is stabilized by writing
all information pertaining to it (including all of its library
components) into a single file.  Later, when the library is used in
other programs, all members of the library are guaranteed to be
up-to-date; no dependency analysis work and no recompilation work will
be necessary.  If the boolean flag is {\tt false}, then all
sub-libraries of the library must already be stable.  If the flag is
{\tt true}, then CM will recursively stabilize all libraries reachable
from the given root.

After a library has been stabilized it can be used even if none of its
original sources---including the description file---are present.

The boolean result of {\tt CM.recomp} and {\tt CM.stabilize} indicates
success or failure of the operation ({\tt true} = success).

\subsubsection*{Linking}

In SML/NJ, linking means executing top-level code of each compilation
unit.  The resulting bindings can then be registered at the interactive top
level.

\begin{verbatim}
  val make : string -> bool
  val autoload : string -> bool
\end{verbatim}

{\tt CM.make} first acts like {\tt CM.recomp}.  If the (re-)compilation
is successful, then it proceeds by linking all modules.  Provided
there are no link-time errors, it finally introduces new bindings at
top level.

During the course of the same {\tt CM.make}, the code of each
compilation module will be executed at most once.  Code in units that
are marked as {\it private} (see Section~\ref{sec:sharing}) will be
executed exactly once.  Code in other units will be executed only if
the unit has been recompiled since it was executed last time or if it
depends on another compilation unit whose code has been executed
since.

In effect, different invocations of {\tt CM.make} (and {\tt
CM.autoload}) will share dynamic state created at link time as much as
possible unless the compilation units in question have been explicitly
marked private.

{\tt CM.autoload} acts like {\tt CM.make}, only ``lazily''. See
Section~\ref{sec:autoload} for more information.

As before, the result of {\tt CM.make} indicates success or failure of
the operation.  The result of {\tt CM.autoload} indicates success or
failure of the {\em registration}.  (It does not know yet whether
loading will actually succeed.)

\subsubsection*{Registers}

Several internal registers control the operation of CM.  A register of
type $T$ is accessible via a variable of type $T$ {\tt controller},
i.e., a pair of {\tt get} and {\tt set} functions.  Any invocation of
the corresponding {\tt get} function reads the current value of the
register.  An invocation of the {\tt set} function replaces the
current value with the argument given to {\tt set}.

Controllers are members of {\tt CM.Control}, a sub-structure of
structure {\tt CM}.

\begin{verbatim}
  type 'a controller = { get: unit -> 'a, set: 'a -> unit }
  structure Control : sig
    val verbose : bool controller
    val debug : bool controller
    val keep_going : bool controller
    val parse_caching : int controller
    val warn_obsolete : bool controller
    val conserve_memory : bool controller
  end
\end{verbatim}

{\tt CM.Control.verbose} can be used to turn off CM's progress
messages.  The default is {\em true} and can be overriden at startup
time by the environment variable {\tt CM\_VERBOSE}.

In the case of a compile-time error {\tt CM.Contol.keep\_going}
instructs the {\tt CM.recomp} phase to continue working on parts of
the dependency graph that are not related to the error.  (This does
not work for outright syntax errors because a correct parse is needed
before CM can construct the dependency graph.)  The default is {\em
false} and can be overriden at startup by the environment variable
{\tt CM\_KEEP\_GOING}.

{\tt CM.Control.parse\_caching} sets a limit on how many parse trees
are cached in main memory.  In certain cases CM must parse source
files in order to be able to calculate the dependency graph.  Later,
the same files may need to be compiled, in which case an existing
parse tree saves the time to parse the file again.  Keeping parse
trees can be expensive in terms of memory usage.  Moreover, CM makes
special efforts to avoid re-parsing files in the first place unless
they have actually been modified.  Therefore, it may not make much
sense to set this value very high.  The default is {\em 100} and can
be overriden at startup time by the environment variable {\tt
CM\_PARSE\_CACHING}.

This version of CM uses an ML-inspired syntax for expressions in its
conditional compilation subsystem (see Section~\ref{sec:preproc}).
However, for the time being it will accept most of the original
C-inspired expressions but produces a warning for each occurrence of
an old-style operator. {\tt CM.Control.warn\_obsolete} can be used to
turn these warnings off. The default is {\em true} and can be
overriden at startup time by the environment variable {\tt
CM\_WARN\_OBSOLETE}.

{\tt CM.Control.debug} can be used to turn on debug mode.  This
currently has the effect of dumping a trace of the master-slave
protocol for parallel and distributed compilation (see
Section~\ref{sec:parmake}) to TextIO.stdOut. The default is {\em
false} and can be overriden at startup time by the environment
variable {\tt CM\_DEBUG}.

Using {\tt CM.Control.conserve\_memory}, CM can be told to be slightly
more conservative with its use of main memory at the expense of
occasionally incurring additional input from stable library files.
This does not save very much and, therefore, is normally turned off.
The default ({\em false}) can be overridden at startup by the
environment variable {\tt CM\_CONSERVE\_MEMORY}.

\subsubsection*{Path anchors}

Structure {\tt CM} also provides functions to explicitly manipulate
the path anchor configuration.  These functions are members of
structure {\tt CM.Anchor}.

\begin{verbatim}
  structure Anchor : sig
    val anchor : string -> string option controller
    val reset : unit -> unit
  end
\end{verbatim}

{\tt CM.Anchor.anchor} returns a pair of {\tt get} and {\tt set}
functions that can be used to query and modify the status of the named
anchor.  Note that the {\tt get}-{\tt set}-pair operates over type
{\tt string option}; a value of {\tt NONE} means that the anchor is
currently not bound (or, in the case of {\tt set}, that it is being
cancelled).  The (optional) string given to {\tt set} must be a
directory name in native syntax.  If it is specified as a relative
path name, then it will be expanded by prepending the name of the
current working directory.

{\tt CM.Anchor.reset} erases the entire existing path configuration
mapping.

\subsubsection*{Setting CM variables}

CM variables are used by the conditional compilation system (see
Section~\ref{sec:cmvars}).  Some of these variables are predefined,
but the user can add new ones and alter or remove those that already
exist.

\begin{verbatim}
  val symval : string -> int option controller
\end{verbatim}

Function {\tt CM.symval} returns a {\tt get}-{\tt set}-pair for the
symbol whose name string was specified as the argument.  Note that the
{\tt get}-{\tt set}-pair operates over type {\tt int option}; a value
of {\tt NONE} means that the variable is not defined.

\noindent Examples:
\begin{verbatim}
#get (CM.symval "X") ();       (* query value of X *)
#set (CM.symval "Y") (SOME 1); (* set Y to 1 *)
#set (CM.symval "Z") NONE;     (* remove definition for Z *)
\end{verbatim}

Some care is necessary as {\tt CM.symval} does not check whether the
syntax of the argument string is valid.  (However, the worst thing
that could happen is that a variable defined via {\tt CM.symval} is
not accessible\footnote{from within CM's description files} because
there is no legal syntax to name it.)

\subsubsection*{Library registry}

To be able to share associated data structures, CM maintains an
internal registry of all stable libraries that it has encountered
during an ongoing interactive session.  The {\tt CM.Library}
sub-structure of structure {\tt CM} provides access to this registry.

\begin{verbatim}
  structure Library : sig
    type lib
    val known : unit -> lib list
    val descr : lib -> string
    val osstring : lib -> string
    val dismiss : lib -> unit
  end
\end{verbatim}

{\tt CM.Library.known}, when called, produces a list of currently
known stable libraries.  Each such library is represented by an
element of the abstract data type {\tt CM.Library.lib}.

{\tt CM.Library.descr} extracts a string describing the location of
the CM description file associated with the given library.  The syntax
of this string is the same that is also being used by CM's
master-slave protocol (see section~\ref{sec:pathencode}).

{\tt CM.Library.osstring} produces a string denoting the given
library's description file using the underlying operating system's
native pathname syntax.  In other words, the result of a call to {\tt
CM.Library.osstring} is suitable as an argument to {\tt
TextIO.openIn}.

{\tt CM.Library.dismiss} is used to remove a stable library from CM's
internal registry.  Although removing a library from the registry may
recover considerable amounts of main memory, doing so also eliminates
any chance of sharing the associated data structures with later
references to the same library.  Therefore, it is not always in the
interest of memory-conscious users to use this feature.

Sharing of link-time state created by the library is {\em not}
affected by this.

\subsubsection*{Internal state}

For CM to work correctly, it must maintain an up-to-date picture of
the state of the surrounding world (as far as that state affects CM's
operation).  Most of the time, this happens automatically and should be
transparent to the user.  However, occasionally it may become
necessary to intervene expliticly.

Access to CM's internal state is facilitated by members of the {\tt
CM.State} structure.

\begin{verbatim}
  structure State : sig
    val pending : unit -> string list
    val synchronize : unit -> unit
    val reset : unit -> unit
  end
\end{verbatim}

{\tt CM.State.pending} produces a list of strings, each string naming
one of the symbols that are currently bound but not yet resolved by
the autoloading mechanism.

{\tt CM.State.synchronize} updates tables internal to CM to reflect
changes in the file system.  In particular, this will be necessary
when the association of file names to ``file IDs'' (in Unix: inode
numbers) changes during an ongoing session.  In practice, the need for
this tends to be rare.

{\tt CM.State.reset} completely erases all internal state in CM.  To
do this is not very advisable since it will also break the association
with pre-loaded libraries.  It may be a useful tool for determining
the amount of space taken up by the internal state, though.

\subsubsection*{Compile servers}

On Unix-like systems, CM supports parallel compilation.  For computers
connected using a LAN, this can be extended to distributed compilation
using a network file system and the operating system's ``rsh''
facility.  For a detailed discussion, see Section~\ref{sec:parmake}.

Sub-structure {\tt CM.Server} provides access to and manipulation of
compile servers.  Each attached server is represented by a value of
type {\tt CM.Server.server}.

\begin{verbatim}
  structure Server : sig
    type server
    val start : { name: string,
                  cmd: string * string list,
                  pathtrans: (string -> string) option,
                  pref: int } -> server option
    val stop : server -> unit
    val kill : server -> unit
    val name : server -> string
  end
\end{verbatim}

CM is put into ``parallel'' mode by attaching at least one compile
server.  Compile servers are attached using invocations of {\tt
CM.Server.start}.  The function takes the name of the server (as an
arbitrary but unique string) ({\tt name}), the Unix command used to
start the server in a form suitable as an argument to {\tt
Unix.execute} ({\tt cmd}), an optional ``path transformation
function'' for converting local path names to remote pathnames ({\tt
pathtrans}), and a numeric ``preference'' value that is used to choose
servers at times when more than one is idle ({\tt pref}).  The
optional result is the handle representing the successfully attached
server.

An existing server can be shut down and detached using {\tt
CM.Server.stop} or {\tt CM.Server.kill}.  The argument in either case
must be the result of an earlier call to {\tt CM.Server.start}.
Function {\tt CM.Server.stop} uses CM's master-slave protocol to
instruct the server to shut down gracefully.  Only if this fails it
may become necessary to use {\tt CM.Server.kill} which will send a
Unix TERM signal to destroy the server.

Given a server handle, function {\tt CM.Server.name} returns the
string that was originally given to the call of {\tt CM.Server.start}
used to created the server.

\subsubsection*{Plug-ins}

As an alternative to {\tt CM.make} or {\tt CM.autoload}, where the
main purpose is to subsequently be able to access the library from
interactively entered code, one can instruct CM to load libraries
``for effect''.

\begin{verbatim}
  val load_plugin : string -> bool
\end{verbatim}

Function {\tt CM.load\_plugin} acts exactly like {\tt CM.make} except
that even in the case of success no new symbols will be bound in the
interactive top-level environment.  That means that link-time
side-effects will be visible, but none of the exported definitions
become available.  This mechanism can be used for ``plug-in'' modules:
a core library provides hooks where additional functionality can be
registered later via side-effects; extensions to this core are
implemented as additional libraries which, when loaded, register
themselves with those hooks.  By using {\tt CM.load\_plugin} instead
of {\tt CM.make}, one can avoid polluting the interactive top-level
environment with spurious exports of the extension module.

CM itself uses plug-in modules in its member-class subsystem (see
section~\ref{sec:classes}).  This makes it possible to add new classes
and tools very easily without having to reconfigure or recompile CM,
not to mention modify its source code.

\subsubsection*{Building stand-alone programs}

CM can be used to build stand-alone programs. In fact SML/NJ
itself---including CM---is an example of this.  (The interactive
system cannot rely on an existing compilation manager when starting
up.)

A stand-alone program is constructed by the runtime system from
existing binfiles or members of existing stable libraries.  CM must
prepare those binfiles or libraries together with a list that
describes them to the runtime system.

\begin{verbatim}
  val mk_standalone : bool option -> string -> string list option
\end{verbatim}

Depending on the optional boolean argument, function {\tt
CM.mk\_standalone} first acts like either {\tt CM.recomp} or {\tt
CM.stabilize}.  {\tt NONE} means {\tt CM.recomp}, and {\tt (SOME $r$)}
means {\tt CM.stabilize $r$}.  After recompilation (or stabilization)
is successful, {\tt CM.mk\_standalone} constructs a topologically
sorted list of strings that, when written to a file, can be passed to the
runtime system in order to perform stand-alone linkage of the given
program. Upon failure, {\tt CM.mk\_standalone} returns {\tt NONE}.

\subsection{The autoloader}
\label{sec:autoload}

From the user's point of view, a call to {\tt CM.autoload} acts very
much like the corresponding call to {\tt CM.make} because the same
bindings that {\tt CM.make} would introduce into the top-level
enviroment are also introduced by {\tt CM.autoload}.  However, most
work will be deferred until some code that is entered later refers to
one or more of these bindings.  Only then will CM go and perform just
the minimal work necessary to provide the actual definitions.

The autoloader plays a central role for the interactive system.
Unlike in earlier versions, it cannot be turned off since it provides
many of the standard pre-defined top-level bindings.

The autoloader is a convenient mechanism for virtually ``loading'' an
entire library without incurring an undue increase in memory
consumption for library modules that are not actually being used.

\subsection{Sharing of state}
\label{sec:sharing}

Whenever it is legal to do so, CM lets multiple invocations of {\tt
CM.make} or {\tt CM.autoload} share dynamic state created by link-time
effects.  Of course, sharing is not possible (and hence not ``legal'')
if the compilation unit in question has recently been recompiled or
depends on another compilation unit whose code has recently been
re-executed.  The programmer can explicitly mark certain ML files as
{\em shared}, in which case CM will issue a warning whenever the
unit's code has to be re-executed.

State created by compilation units marked as {\em private} is never
shared across multiple calls to {\tt CM.make} or {\tt CM.autoload}.
To understand this behavior it is useful to introduce the notion of a
{\em traversal}.  A traversal is the process of traversing the
dependency graph on behalf of {\tt CM.make} or {\tt CM.autoload}.
Several traversals can be executed interleaved with each other because
a {\tt CM.autoload} traversal normally stays suspended and is
performed incrementally driven by input from the interactive top level
loop.

As far as sharing is concerned, the rule is that during one traversal
each compilation unit will be executed at most once.  In other words,
the same ``program'' will not see multiple instantiations of the same
compilation unit (where ``program'' refers to the code managed by one
call to {\tt CM.make} or {\tt CM.autoload}).  Each compilation unit
will be linked at most once during a traversal and private state
will not be confused with private state of other traversals that might
be active at the same time.

% Need a good example here.

\subsubsection*{Sharing annotations}

ML source files can be specified as being {\em private} or {\em
shared}.  This is done by adding a {\em tool parameter} specification
for the file in the library- or group description file (see
Section~\ref{sec:classes}).  In other words, to mark an ML file as
{\em private}, follow the file name with the word {\tt private} in
parentheses.  For {\em shared} ML files, replace {\tt private} with
{\tt shared}.

An ML source file that is not annotated will typically be treated as
{\em shared} unless it statically depends on some other {\em private}
source.  It is an error for a {\em shared} source to depend on a {\em
private} source.

\subsubsection*{Sharing with the interactive system}

The SML/NJ interactive system, which includes the compiler, is itself
created by linking modules from various libraries. Some of these
libraries can also be used in user programs.  Examples are the
Standard ML Basis Library {\tt basis.cm}, the SML/NJ library {\tt
smlnj-lib.cm}, and the ML-Yacc library {\tt ml-yacc-lib.cm}.

If a module from a library is used by both the interactive system and
a user program running under control of the interactive system, then
CM will let them share code and dynamic state.

\section{Member classes and tools}
\label{sec:classes}

Most members of groups and libraries are either plain ML files or
other description files.  However, it is possible to incorporate other
types of files---as long as their contents can in some way be expanded
into ML code or CM descriptions.  The expansion is carried out by CM's
{\it tools} facility.

CM maintains an internal registry of {\em classes} and associated {\em
rules}.  Each class represents the set of source files that its
corresponding rule is applicable to.  For example, the class {\tt
mlyacc} is responsible for files that contain input for the parser
generator ML-Yacc~\cite{tarditi90:yacc}.  The rule for {\tt mlyacc}
takes care of expanding an ML-Yacc specifications {\tt foo.grm} by
invoking the auxiliary program {\tt ml-yacc}.  The resulting ML files
{\tt foo.grm.sig} and {\tt foo.grm.sml} are then used as if their
names had directly been specified in place of {\tt foo.grm}.

CM knows a small number of built-in classes.  In many situations these
classes will be sufficient, but in more complicated cases it may be
worthwhile to add a new class.  Since class rules are programmed in
ML, adding a class is not as simple a matter as writing a rule for
{\sc Unix}' {\tt make} program~\cite{feldman79}.  Of course,
using ML has also advantages because it keeps CM extremely flexible in
what rules can do.  Moreover, it is not necessary to learn yet another
``little language'' in order to be able to program CM's tool facility.

When looking at the member of a description file, CM determines which
tool to use by looking at clues like the file name suffix.  However,
it is also possible to specify the class of a member explicitly.  For
this, the member name is followed by a colon {\bf :} and the name of
the member class.  All class names are case-insensitive.

In addition to genuine tool classes, there are two member classes
that refer to facilities internal to CM: {\tt sml} is the class of
ordinary ML source files and {\tt cm} is the class of CM library or
group description files.

CM automatically classifies file with a {\tt .sml} or {\tt .sig}
suffix as ML-source, file names ending in {\tt .cm}] as CM
descriptions.

\subsection{Tool parameters}

In many cases the name of the member that caused a rule to be invoked
is the only input to that rule.  However, rules can be written in such
a way that they take additional parameters.  Those parameters, if
present, must be specified in the CM description file between
parentheses following the name of the member and the optional member
class.

CM's core mechanism parses these tool options and breaks them up into
a list of items, where each item is either a filename (i.e., {\em
looks} like a filename) or a named list of sub-options.  However, CM
itself does not interpret the result but passes it on to the tool's
rule function.  It is in each rule's own responsibility to assign
meaning to its options.

The {\tt sml} class accepts one parameter which must be either the
word {\tt shared} or the word {\tt private}.  (Technically, the
strings {\tt private} and {\tt shared} fall under the {\em filename}
category from above, but the tool ignores that aspect and uses the
name directly.) If {\tt shared} is specified, then dynamic state
created by the compilation unit at link-time must be shared across
invocations of {\tt CM.make} or {\tt CM.autoload}.  As explained
earlier (Section~\ref{sec:sharing}), the {\tt private} annotation
means that dynamic state cannot be shared across such calls to {\tt
CM.make} or {\tt CM.autoload}.

The {\tt cm} class does not accept any parameters.

Named sub-option lists are specified by a name string followed by a
colon {\bf :} and a parenthesized list of other tool options.  If the
list contains precisely one element, the parentheses may be omitted.

\subsection{Built-in tools}

\subsubsection*{The ML-Yacc tool}

The ML-Yacc tool is responsible for files that are input to the
ML-Yacc parser generator.  Its class name is {\tt mlyacc}.  Recognized
file name suffixes are {\tt .grm} and {\tt .y}.  For a source file
$f$, the tool produces two targets $f${\tt .sig} and $f${\tt .sml},
both of which are always treated as ML source files.  Parameters are
passed on without change to the $f${\tt .sml} file but not to the
$f${\tt .sig} file.  This means that the parameter can either be the
word {\tt private} or the word {\tt shared}, and that this sharing
annotation will apply to the $f${\tt .sml} file.

The tool invokes the {\tt ml-yacc} command if the targets are
``outdated''.  A target is outdated if it is missing or older than the
source.  Unless anchored using the path anchor mechanism (see
Section~\ref{sec:anchors}), the command {\tt ml-yacc} will be located
using the operating system's path search mechanism (e.g., the {\tt
\$PATH} environment variable).

\subsubsection*{ML-Lex}

The ML-Lex tool governs files that are input to the ML-Lex lexical
analyzer generator~\cite{appel89:lex}.  Its class name is {\tt mllex}.
Recognized file name suffixes are {\tt .lex} and {\tt .l}.  For a
source file $f$, the tool produces one targets $f${\tt .sml} which
will always be treated as ML source code.  Tool parameters are passed
on without change to that file.

The tool invokes the {\tt ml-lex} command if the target is outdated
(just like in the case of ML-Yacc).  Unless anchored using the path
anchor mechanism (see Section~\ref{sec:anchors}), the command {\tt
ml-lex} will be located using the operating system's path search
mechanism (e.g., the {\tt \$PATH} environment variable).

\subsubsection*{ML-Burg}

The ML-Burg tool deals with files that are input to the ML-Burg
code-generater generator~\cite{mlburg93}.  Its class name is {\tt
mlburg}.  The only recognized file name suffix is {\tt .burg}.  For a
source file $f${\tt .burg}, the tool produces one targets $f${\tt
.sml} which will always be treated as ML source code.  Any tool
parameters are passed on without change to the target.

The tool invokes the {\tt ml-burg} command if the target is outdated.
Unless anchored using the path anchor mechanism (see
Section~\ref{sec:anchors}), the command {\tt ml-lex} will be located
using the operating system's path search mechanism (e.g., the {\tt
\$PATH} environment variable).

\subsubsection*{Shell}

The Shell tool can be used to specify arbitrary shell commands to be
invoked on behalf of a given file.  The name of the class is {\tt
shell}.  There are no recognized file name suffixes.  This means that
in order to use the shell tool one must always specify the {\tt shell}
member class explicitly.

The rule for the {\tt shell} class relies on tool parameters.  The
parameter list must be given in parentheses and follow the {\tt shell}
class specification.

Consider the following example:

\begin{verbatim}
  foo.pp : shell (target:foo.sml options:(shared)
                        /lib/cpp -P -Dbar=baz %s %t)
\end{verbatim}

This member specification says that file {\tt foo.sml} can be obtained
from {\tt foo.pp} by running it through the C preprocessor {\tt cpp}.
The fact that the target file is given as a tool parameter implies
that the member itself is the source.  The named parameter {\tt
options} lists the tool parameters to be used for that target. (In the
example, the parentheses around {\tt shared} are optional because it
is the only element of the list.) The command line itself is given by
the remaining non-keyword parameters.  Here, a single {\bf \%s} is
replaced by the source file name, and a single {\bf \%t} is replaced
by the target file name; any other string beginning with {\bf \%} is
shortened by its first character.

In the specification one can swap the positions of source and target
(i.e., let member name be the target) by using a {\tt source}
parameter:

\begin{verbatim}
  foo.sml : shell (source:foo.pp options:shared
                         /lib/cpp -P -Dbar=baz %s %t)
\end{verbatim}

Exactly one of the {\tt source} and {\tt target} parameters must be
specified; the other one is taken to be the member name itself.  The
target class can be given by writing a {\tt class} parameter whose
single sub-option must be the desired class name.

The usual distinction between native and standard filename syntax
applies to any given {\tt source} or {\tt target} parameters.

For example, if one were working on a Win32 system and the target file
is supposed to be in the root directory on volume {\tt D:},
then one must use native syntax to write it.  One way of doing this
would be:

\begin{verbatim}
  "D:\\foo.sml" : shell (source : foo.pp options : shared
                               cpp -P -Dbar=baz %s %t)
\end{verbatim}

\noindent As a result, {\tt foo.sml} is interpreted using native
syntax while {\tt foo.pp} uses standard conventions (although in this
case it does not make a difference).  Had we used the {\tt target}
version from above, one would have to write:

\begin{verbatim}
  foo.pp : shell (target : "D:\\foo.sml" options : shared
                                 cpp -P -Dbar=baz %s %t)
\end{verbatim}

The shell tool invokes its command whenever the target is outdated
with respect to the source.

\subsubsection*{Make}

The Make tool (class {\tt make}) can (almost) be seen as a specialized
version of the Shell tool.  It has no source and one target (the
member itself) which is always considered outdated.  As with the Shell
tool, it is possible to specify target class and parameters using the
{\tt class} and {\tt options} keyword parameters.

The tool invokes the shell command {\tt make} on the target.  Unless
anchored using the path anchor mechanism~\ref{sec:anchors}, the
command will be located using the operating system's path search
mechanism (e.g., the {\tt \$PATH} environment variable).

Any parameters other than the {\tt class} and {\tt options}
specifications must be plain strings and are given as additional
command line arguments to {\tt make}.  The target name is always the
last command line argument.

Example:

\begin{verbatim}
  bar-grm : make (class:mlyacc -f bar-grm.mk)
\end{verbatim}

Here, file {\tt bar-grm} is generated (and kept up-to-date) by
invoking the command:
\begin{verbatim}
make -f bar-grm.mk bar-grm
\end{verbatim}
\noindent The target file is then treated as input for {\tt ml-yacc}.

Cascading Shell- and Make-tools is easily possible.  Here is an
example that first uses Make to build {\tt bar.pp} and then filters
the contens of {\tt bar.pp} through the C preprocessor to arrive at
{\tt bar.sml}:

\begin{verbatim}
  bar.pp : make (class:shell
                     options:(target:bar.sml cpp -Dbar=baz %s %t)
                 -f bar-pp.mk)
\end{verbatim}

\section{Conditional compilation}
\label{sec:preproc}

In its description files, CM offers a simple conditional compilation
facility inspired by the pre-processor for the C language~\cite{k&r2}.
However, it is not really a {\it pre}-processor, and the syntax of the
controlling expressions is borrowed from SML.

Sequences of members can be guarded by {\tt \#if}-{\tt \#endif}
brackets with optional {\tt \#elif} and {\tt \#else} lines in between.
The same guarding syntax can also be used to conditionalize the export
list.  {\tt \#if}-, {\tt \#elif}-, {\tt \#else}-, and {\tt
\#endif}-lines must start in the first column and always
extend to the end of the current line.  {\tt \#if} and {\tt \#elif}
must be followed by a boolean expression.

Boolean expressions can be formed by comparing arithmetic expressions
(using operators {\tt <}, {\tt <=}, {\tt =}, {\tt >=}, {\tt >}, or
{\tt <>}), by logically combining two other boolean expressions (using
operators {\tt andalso}, {\tt orelse}, {\tt =}, or {\tt <>}, by
querying the existence of a CM symbol definition, or by querying the
existence of an exported ML definition.

Arithmetic expressions can be numbers or references to CM symbols, or
can be formed from other arithmetic expressions using operators {\tt
+}, {\tt -} (subtraction), \verb|*|, {\tt div}, {\tt mod}, or $\tilde{~}$
(unary minus).  All arithmetic is done on signed integers.

Any expression (arithmetic or boolean) can be surrounded by
parentheses to enforce precedence.

\subsection{CM variables}
\label{sec:cmvars}

CM provides a number of ``variables'' (names that stand for certain
integers). These variables may appear in expressions of the
conditional-compilation facility. The exact set of variables provided
depends on SML/NJ version number, machine architecture, and
operating system.  A reference to a CM variable is considered an
arithmetic expression. If the variable is not defined, then it
evaluates to 0.  The expression {\tt defined}($v$) is a boolean
expression that yields true if and only if $v$ is a defined CM
variable.

The names of CM variables are formed starting with a letter followed
by zero or more occurences of letters, decimal digits, apostrophes, or
underscores.

The following variables will be defined and bound to 1:
\begin{itemize}
\item depending on the operating system: {\tt OPSYS\_UNIX}, {\tt
OPSYS\_WIN32}, {\tt OPSYS\_MACOS}, {\tt OPSYS\_OS2}, or {\tt
OPSYS\_BEOS}
\item depending on processor architecture: {\tt ARCH\_SPARC}, {\tt
ARCH\_ALPHA32}, {\tt ARCH\_MIPS}, {\tt ARCH\_X86}, {\tt ARCH\_HPPA},
{\tt ARCH\_RS6000}, or {\tt ARCH\_PPC}
\item depending on the processor's endianness: {\tt BIG\_ENDIAN} or
{\tt LITTLE\_ENDIAN}
\item depending on the native word size of the implementation: {\tt
SIZE\_32} or {\tt SIZE\_64}
\item the symbol {\tt NEW\_CM}
\end{itemize}

Furthermore, the symbol {\tt SMLNJ\_VERSION} will be bound to the
major version number of SML/NJ (i.e., the number before the first dot)
and {\tt SMLNJ\_MINOR\_VERSION} will be bound to the system's minor
version number (i.e., the number after the first dot).

Using the {\tt CM.symval} interface one can define additional
variables or modify existing ones.

\subsection{Querying exported definitions}

An expression of the form {\tt defined}($n$ $s$), where $s$ is an ML
symbol and $n$ is an ML namespace specifier, is a boolean expression
that yields true if and only if any member included before this test
exports a definition under this name.  Therefore, order among members
matters after all (but it remains unrelated to the problem of
determining static dependencies)!  The namespace specifier must be one
of: {\tt structure}, {\tt signature}, {\tt functor}, or {\tt funsig}.

If the query takes place in the ``exports'' section of a description
file, then it yields true if {\em any} of the included members exports
the named symbol.

\noindent Example:

\begin{verbatim}
Library
  structure Foo
#if defined(structure Bar)
  structure Bar
#endif
is
#if SMLNJ_VERSION > 110
  new-foo.sml
#else
  old-foo.sml
#endif
#if defined(structure Bar)
  bar-client.sml
#else
  no-bar-so-far.sml
#endif
\end{verbatim}

Here, the file {\tt bar-client.sml} gets included if {\tt
SMLNJ\_VERSION} is greater than 110 and {\tt new-foo.sml} exports a
structure {\tt Bar} {\em or} if {\tt SMLNJ\_VERSION <= 110} and {\tt
old-foo.sml} exports structure {\tt Bar}. \\
Otherwise {\tt no-bar-so-far.sml} gets included instead.  In addition,
the export of structure {\tt Bar} is guarded by its own existence.
(Structure {\tt Bar} could also be defined by {\tt no-bar-so-far.sml}
in which case it would get exported regardless of the outcome of the
other {\tt defined} test.)

\subsection{Explicit errors}

A pseudo-member of the form {\tt \#error $\ldots$}, which---like other
{\tt \#}-items---starts in the first column and extends to the end of
the line, causes an explicit error message unless it gets excluded by
the conditional compilation logic.  The error message is given by the
remainder of the line after the word {\tt error}.

\subsection{EBNF for expressions}

\begin{tabular}{rcl}
\nt{letter} &\ar& \tl{A} \vb $\ldots$ \vb \tl{Z} \vb \tl{a} \vb $\ldots$ \vb \tl{z} \\
\nt{digit}  &\ar& \tl{0} \vb $\ldots$ \vb \tl{9} \\
\nt{ldau}   &\ar& \nt{letter} \vb \nt{digit} \vb \tl{'} \vb \tl{\_} \\
\\
\nt{number} &\ar& \nt{digit} \{\nt{digit}\} \\
\nt{sym}    &\ar& \nt{letter} \{\nt{ldau}\} \\
\\
\nt{aatom}  &\ar& \nt{number} \vb \nt{sym} \vb \tl{(} \nt{asum} \tl{)} \vb \tl{$\tilde{~}$} \nt{aatom} \\
\nt{aprod}  &\ar& \{\nt{aatom} (\tl{*} \vb \tl{div} \vb \tl{mod})\} \nt{aatom} \\
\nt{asum}   &\ar& \{\nt{aprod} (\tl{+} \vb \tl{-})\} \nt{aprod} \\
\\
\nt{ns}     &\ar& \tl{structure} \vb \tl{signature} \vb \tl{functor} \vb \tl{funsig} \\
\nt{mlsym}  &\ar& {\em a Standard ML identifier} \\
\nt{query}  &\ar& \tl{defined} \tl{(} \nt{sym} \tl{)} \vb \tl{defined} \tl{(} \nt{ns} \nt{mlsym} \tl{)} \\
\\
\nt{acmp}   &\ar& \nt{aexp} (\ttl{<} \vb \ttl{<=} \vb \ttl{>} \vb \ttl{>=} \vb \ttl{=} \vb \ttl{<>}) \nt{aexp} \\
\\
\nt{batom}  &\ar& \nt{query} \vb \nt{acmp} \vb \tl{not} \nt{batom} \vb \tl{(} \nt{bdisj} \tl{)} \\
\nt{bcmp}   &\ar& \nt{batom} [(\ttl{=} \vb \ttl{<>}) \nt{batom}] \\
\nt{bconj}  &\ar& \{\nt{bcmp} \tl{andalso}\} \nt{bcmp} \\
\nt{bdisj}  &\ar& \{\nt{bconj} \tl{orelse}\} \nt{bconj} \\
\\
\nt{expression} &\ar& \nt{bdisj}
\end{tabular}

\section{Access control}
\label{sec:access}

The basic idea behind CM's access control is the following: In their
description files, groups and libraries can specify a list of
{\em privileges} that the client must have in order to be able to use them.
Privileges at this level are just names (strings) and must be written
in front of the initial keyword {\tt Library} or {\tt Group}.  If one
group or library imports from another group or library, then
privileges (or rather: privilege requirements) are being inherited.
In effect, to be able to use a program, one must have all privileges
for all its libraries, sub-libraries and library components,
components of sub-libraries, and so on.

Of course, this alone would not yet be satisfactory.  The main service
of the access control system is that it can let a client use an
``unsafe'' library ``safely''.  For example, a library {\tt LSafe.cm}
could ``wrap'' all the unsafe operations in {\tt LUnsafe.cm} with
enough error checking that they become safe.  Therefore, a user of
{\tt LSafe.cm} should not also be required to possess the privileges
that would be required if one were to use {\tt LUnsafe.cm} directly.

In CM's access control model it is possible for a library to ``wrap''
privileges.  If a privilege $P$ has been wrapped, then the user of the
library does not need to have privilege $P$ even though the library is
using another library that requires privilege $P$.  In essence, the
library acts as a ``proxy'' who provides the necessary credentials for
privilege $P$ to the sub-library.

Of course, not everybody can be allowed to establish a library with
such a ``wrapped'' privilege $P$.  The programmer who does that should at
least herself have privilege P (but perhaps better, she should have
{\em permission to wrap $P$}---a stronger requirement).

In CM, wrapping a privilege is done by specifying the name of that
privilege within parenthesis.  The wrapping becomes effective once the
library gets stabilized via {\tt CM.stabilize}.  The (not yet
implemented) enforcement mechanism must ensure that anyone who
stabilizes a library that wraps $P$ has permission to wrap $P$.

Note that privileges cannot be wrapped at the level of CM groups.

Access control is a new feature. At the moment, only the basic
mechanisms are implemented, but there is no enforcement.  In other
words, everybody is assumed to have every possible privilege.  CM
merely reports which privileges ``would have been required''.

\section{The pervasive environment}

The {\em pervasive environment} can be thought of as a compilation
unit that all compilation units implicitly depend upon.  The pervasive
enviroment exports all non-modular bindings (types, values, infix
operators, overloaded symbols) that are mandated by the specification
for the Standard ML Basis Library~\cite{reppy99:basis}.  (All other
bindings of the Basis Library are exported by {\tt basis.cm} which is
a genuine CM library.)

The pervasive environment is the only place where CM conveys
non-modular bindings from one compilation unit to another.

\section{Files}

CM uses three kinds of files to store derived information during and
between sessions:

\begin{enumerate}
\item {\it Skeleton files} are used to store a highly abbreviated
version of each ML source file's abstract syntax tree---just barely
sufficient to drive CM's dependency analysis.  Skeleton files are much
smaller and easier to read than actual ML source code.  Therefore, the
existence of valid skeleton files makes CM a lot faster because
usually most parsing operations can be avoided that way.
\item {\it Binfiles} are the SML/NJ equivalent of object files.  They
contain executable code and a symbol table for the associated ML
source file.
\item {\it Library files} (sometimes called: {\em stablefiles}) contain
dependency graph, executable code, and symbol tables for an entire CM
library including all of its components (groups).
\end{enumerate}

Normally, all these files are stored in a subdirectory of directory
{\tt CM}. {\tt CM} itself is a subdirectory of the directory where the
original ML source file or---in the case of library files---the
original CM description file is located.

Skeleton files are machine- and operating system-independent.
Therefore, they are always placed into the same directory {\tt
CM/SKEL}. Parsing (for the purpose of dependency analysis) will be
done only once even if the same file system is accessible from
machines of different type.

Binfiles and library files contain executable code and other
information that is potentially system- and architecture-dependent.
Therefore, they are stored under {\tt CM/}{\it arch}{\tt -}{\it os}
where {\it arch} is a string indicating the type of the current
CPU architecture and {\it os} a string denoting the current operating
system type.

Library files are a bit of an exception in the sense that they do not
require any source files or any other derived files of the same
library to exist.  As a consequence, the location of such a library
file is best described as being relative to ``the location of the
original CM description file if that description file still existed''.
(Of course, nothing precludes the CM description file from actually
existing, but in the presence of a corresponding library file CM will
not take any notice.)

\subsection{Time stamps}

For skeleton files and binfiles, CM uses file system time stamps to
determine whether a file has become outdated.  The rule is that in
order to be considered ``up-to-date'' the time stamp on skeleton file
and binfile has to be exactly the same as the one on the ML source
file.  This guarantees that all changes to a source will be
noticed\footnote{except for the pathological case where two different
versions of the same source file have exactly the same time stamp}.

CM also uses time stamps to decide whether tools such as ML-Yacc or
ML-Lex need to be run (see Section~\ref{sec:tools}).  However, the
difference is that a file is considered outdated if it is older than
its source.  Some care on the programmers side is necessary since this
scheme does not allow CM to detect the situation where a source file
gets replaced by an older version of itself.

\section{Tools}
\label{sec:tools}

CM's tool set is extensible: new tools can be added by writing a few
lines of ML code.  The necessary hooks for this are provided by a
structure {\tt Tools} which is exported by the {\tt smlnj/cm/tools.cm}
library.

If the tool is implemented as a ``typical'' shell command, then all
that needs to be done is a single call to:

\begin{verbatim}
Tools.registerStdShellCmdTool
\end{verbatim}

For example, suppose you have made a
new, improved version of ML-Yacc (``New-ML-Yacc'') and want to
register it under a class called {\tt nmlyacc}.  Here is what you
write:

\begin{verbatim}
val _ = Tools.registerStdShellCmdTool
  { tool = "New-ML-Yacc",
    class = "nmlyacc",
    suffixes = ["ngrm", "ny"],
    cmdStdPath = "new-ml-yacc",
    template = NONE,
    extensionStyle =
        Tools.EXTEND [("sig", SOME "sml", fn _ => NONE),
                      ("sml", SOME "sml", fn x => x)],
    dflopts = [] }
\end{verbatim}

This code can either be packaged as a CM library or entered at the
interactive top level after loading the {\tt smlnj/cm/tools.cm} library
(via {\tt CM.make} or {\tt CM.load\_plugin}).

In our example, the shell command name for our tool is {\tt
new-ml-yacc}.  When looking for this command in the file system, CM
first tries to treat it as a path anchor (see
section~\ref{sec:anchors}).  For example, suppose {\tt new-ml-yacc} is
mapped to {\tt /bin}.  In this case the command to be
invoked would be {\tt /bin/new-ml-yacc}.  If path anchor resolution
fails, then the command name will be used as-is.  Normally this
causes the shell's path search mechanism to be used as a fallback.

{\tt Tools.registerStdShellCmdTool} creates the class and installs the
tool for it.  The arguments must be specified as follows:

\begin{description}
\item[tool] a descriptive name of the tool (used in error messages)
\item[class] the name of the class; the string must not contain
upper-case letters
\item[suffixes] a list of file name suffixes that let CM automatically
recognize files of the class
\item[cmdStdPath] the command string from above
\item[template] an optional string that describes how the command line
is to be constructed from pieces; \\
The string is taken verbatim except for embedded \% format specifiers:
  \begin{description}\setlength{\itemsep}{0pt}
  \item[\%c] the command name (i.e., the elaboration of {\tt cmdStdPath})
  \item[\%s] the source file name in native pathname syntax
  \item[\%$n$t] the $n$-th target file in native pathname syntax; \\
    ($n$ is specified as a decimal number, counting starts at $1$, and
    each target file name is constructed from the corresponding {\tt
    extensionStyle} entry; if $n$ is $0$ (or missing), then all
    targets---separated by single spaces---are inserted;
    if $n$ is not in the range between $0$ and the number of available
    targets, then {\bf \%$n$t} expands into itself) 
  \item[\%$n$o] the $n$-th tool parameter; \\
    (named sub-option parameters are ignored;
     $n$ is specified as a decimal number, counting starts at $1$;
     if $n$ is $0$ (or missing), then all options---separated by single
     spaces---are inserted;
     if $n$ is not in the range between $0$ and the number of available
     options, then {\bf \%$n$o} expands into itself) 
  \item[\%$x$] the character $x$ (where $x$ is neither {\bf c}, nor
    {\bf s}, {\bf t}, or {\bf o})
  \end{description}
If no template string is given, then it defaults to {\tt "\%c \%s"}.
\item[extensionStyle] a specification of how the names of files
generated by the tool relate to the name of the tool input file; \\
Currently, there are two possible cases:
\begin{enumerate}
\item ``{\tt Tools.EXTEND} $l$'' says that if the tool source file is
{\it file} then for each suffix {\it sfx} in {\tt (map \#1 $l$)} there
will be one tool output file named {\it file}{\tt .}{\it sfx}.  The
list $l$ consists of triplets where the first component specifies the
suffix string, the second component optionally specifies the
member class name of the corresponding derived file, and the
third component is a function to calculate tool options for the 
target from those of the source. (Argument and result type of these
functions is {\tt Tools.toolopts option}.)
\item ``{\tt Tools.REPLACE }$(l_1, l_2)$'' specifies that given the
base name {\it base} there will be one tool output file {\it base}{\tt
.}{\it sfx} for each suffix {\it sfx} in {\tt (map \#1 $l_2$)}.  Here,
{\it base} is determined by the following rule: If the name of the
tool input file has a suffix that occurs in $l_1$, then {\it base} is
the name without that suffix.  Otherwise the whole file name is taken
as {\it base} (just like in the case of {\tt Tools.EXTEND}).  As with
{\tt Tools.EXTEND}, the second components of the elements of $l_2$ can
optionally specify the member class name of the corresponding derived
file, and the third component maps source options to target options.
\end{enumerate}
\item[dflopts] a list of strings which is used for
substituting {\bf \%$n$o} fields in {\tt template} (see above) if no
options were specified.  (Note that the value of {\tt dflopts} is never
passed to the option mappers in {\tt Tools.EXTEND} or {\tt
Tools.REPLACE}.)
\end{description}

Less common kinds of rules can also be defined using the generic
interface {\tt Tools.registerClass}.

\subsection{Plug-in Tools}

If CM comes across a member class name $c$ that it does not know
about, then it tries to load a plugin module named $c${\tt -tool.cm}.
If it sees a file whose name ends in suffix $s$ for which no member
class has been specified and for which member classification fails,
then it tries to load a plugin module named $s${\tt -ext.cm}.  The
so-loaded module can then register the required tool which enables CM
to successfully deal with the previously unknown member.

This mechanism makes it possible for new tools to be added by simply
placing appropriately-named plug-in libraries in such a way that CM
can find them.  This can be done in one of two ways:

\begin{enumerate}
\item For general-purpose tools that are installed in some central
place, corresponding tool description files $c${\tt -tool.cm} and
$s${\tt -ext.cm} should be registered using the path anchor mechanism.
If this is done, actual description files can be placed in arbitrary
locations.
\item For special-purpose tools that are part of a specific program
and for which there is no need for central installation, one should
simply put the tool description files into the same directory as the
one that contains their ``client'' description file.
\end{enumerate}

\section{Parallel and distributed compilation}
\label{sec:parmake}

To speed up recompilation of large projects with many ML source files,
CM can exploit parallelism that is inherent in the dependency graph.
Currently, the only kind of operating system for which this is
implemented is Unix ({\tt OPSYS\_UNIX}), where separate processes are
used.  From there, one can distribute the work across a network of
machines by taking advantage of the network file system and the
``rsh'' facility.

To perform parallel compilations, one must attach ``compile servers'' to
CM.  This is done using function {\tt CM.Server.start} with the following
signature:

\begin{verbatim}
  structure Server : sig
    type server
    val start : { name: string,
                  cmd: string * string list,
                  pathtrans: (string -> string) option,
                  pref: int } -> server option
  end
\end{verbatim}

Here, {\tt name} is a string uniquely identifying the server and {\tt
cmd} is a value suitable as argument to {\tt Unix.execute}.

The program to be specified by {\tt cmd} should be another instance of
CM---running in ``slave mode''.  To start CM in slave mode, start {\tt
sml} with a single command-line argument of {\tt @CMslave}.  For
example, if you have installed in /path/to/smlnj/bin/sml, then a
server process on the local machine could be started by

\begin{verbatim}
  CM.Server.start { name = "A", pathtrans = NONE, pref = 0,
                    cmd = ("/path/to/smlnj/bin/sml",
                           ["@CMslave"]) };
\end{verbatim}

To run a process on a remote machine, e.g., ``thatmachine'', as
compute server, one can use ``rsh''.\footnote{On certain systems it
may be necessary to wrap {\tt rsh} into a script that protects rsh
from interrupt signals.}  Unfortunately, at the moment it
is necessary to specify the full path to ``rsh'' because {\tt
Unix.execute} (and therefore {\tt CM.Server.start}) does not perform
a {\tt PATH} search. The remote machine
must share the file system with the local machine, for example via NFS.

\begin{verbatim}
  CM.Server.start { name = "thatmachine",
                    pathtrans = NONE, pref = 0,
                    cmd = ("/usr/ucb/rsh",
                           ["thatmachine",
                            "/path/to/smlnj/bin/sml",
                            "@CMslave"]) };
\end{verbatim}

You can start as many servers as you want, but they all must have
different names.  If you attach any servers at all, then you should
attach at least two (unless you want to attach one that runs on a
machine vastly more powerful than your local one).  Local servers make
sense on multi-CPU machines: start as many servers as there are CPUs.
Parallel make is most effective on multiprocessor machines because
network latencies can have a severely limiting effect on what can be
gained in the distributed case.
(Be careful, though.  Since there is no memory-sharing to speak of
between separate instances of {\tt sml}, you should be sure to check
that your machine has enough main memory.)

If servers on machines of different power are attached, one can give
some preference to faster ones by setting the {\tt pref} value higher.
(But since the {\tt pref} value is consulted only in the rare case
that more than one server is idle, this will rarely lead to vastly
better throughput.) All attached servers must use the same
architecture-OS combination as the controlling machine.

In parallel mode, the master process itself normally does not compile
anything.  Therefore, if you want to utilize the master's CPU for
compilation, you should start a compile server on the same machine
that the master runs on (even if it is a uniprocessor machine).

The {\tt pathtrans} argument is used when connecting to a machine with
a different file-system layout.  For local servers, it can safely be
left at {\tt NONE}.  The ``path transformation'' function is used to
translate local path names to their remote counterparts.  This can be
a bit tricky to get right, especially if the machines use automounters
or similar devices.  The {\tt pathtrans} functions consumes and
produces names in CM's internal ``protocol encoding'' (see
Section~\ref{sec:pathencode}).

Once servers have been attached, one can invoke functions like
{\tt CM.recomp}, {\tt CM.make}, and {\tt CM.stabilize}.  They should
work the way the always do, but during compilation they will take
advantage of parallelism.

When CM is interrupted using Control-C (or such), one will sometimes
experience a certain delay if servers are currently attached and busy.
This is because the interrupt-handling code will wait for the servers
to finish what they are currently doing and bring them back to an
``idle'' state first.

\subsection{Pathname protocol encoding}
\label{sec:pathencode}

The master-slave protocol encodes pathnames in the following way:

A pathname consists of {\bf /}-separated arcs (like Unix patnames).
The first arc can be interpreted relative to the current working directory,
relative to the root of the file system, relative to the root of a
volume (on systems that support separate volumes), or relative to a
directory that corresponds to a pathname anchor.  The first character
of the pathname is used to distinguish between these cases.

\begin{itemize}
\item If the name starts with {\bf ./}, then the name is relative to
the working directory.
\item If the name starts with {\bf /}, then the name is relative to
the file system root.
\item If the name starts with {\bf \%}, then the substring between this
{\bf \%} and the first {\bf /} is used as the name of a volume and the
remaining arcs are interpreted relative to the root of that volume.
\item If the name starts with {\bf \$}, then the substring between
this {\bf \$} and the first {\bf /} must be the name of a pathname
anchor.  The remaining arcs are interpreted relative to the directory
that (on the slave side) is denoted by the anchor.
\item Any other name is interpreted relative to the current working
directory.
\end{itemize}

\subsection{Parallel bootstrap compilation}

The bootstrap compiler\footnote{otherwise not mentioned in this
document} with its main function {\tt CMB.make} and the corresponding
cross-compilation variants of the bootstrap compiler will also use any
attached compile servers.  If one intends to exclusively use the
bootstrap compiler, one can even attach servers that run on machines
with different architecture or operating system.

Since the master-slave protocol is fairly simple, it cannot handle
complicated scenarios such as the one necessary for compiling the
``init group'' (i.e., the small set of files necessary for setting up
the ``pervasive'' environment) during {\tt CMB.make}.  Therefore, this
will always be done locally by the master process.

\section{Example: Dynamic linking}
\label{sec:dynlink}

Autoloading is convenient and avoids wasted memory for modules that
should be available at the interactive prompt but have not actually
been used so far.  However, sometimes one wants to be even more
aggressive and save the space needed for a function until---at
runtime---that function is actually being dynamically invoked.

CM does not provide immediate support for this kind of {\em dynamic
linking}, but it is quite simple to achieve the effect by carefully
arranging some helper libraries and associated stub code.

Consider the following module:
\begin{verbatim}
structure F = struct
    fun f (x: int): int =
        G.g x + H.h (2 * x + 1)
end
\end{verbatim}

Let us further assume that the implementations of structures {\tt G}
and {\tt H} are rather large so that it would be worthwhile to avoid
loading the code for {\tt G} and {\tt H} until {\tt F.f} is called
with some actual argument.  Of course, if {\tt F} were bigger, then we
also want to avoid loading {\tt F} itself.

To achieve this goal, we first define a {\em hook} module which will
be the place where the actual implementation of our function will be
registered once it has been loaded.  This hook module is then wrapped
into a hook library.  Thus, we have {\tt f-hook.cm}:
\begin{verbatim}
Library
        structure F_Hook
is
        f-hook.sml
\end{verbatim}

and {\tt f-hook.sml}:

\begin{verbatim}
structure F_Hook = struct
    local
        fun placeholder (i: int) : int =
            raise Fail "F_Hook.f: unitinialized"
        val r = ref placeholder
    in
        fun init f = r := f
        fun f x = !r x
    end
end
\end{verbatim}

The hook module provides a reference cell into which a function of
type equal to {\tt F.f} can be installed.  Here we have chosen to hide
the actual reference cell behind a {\bf local} construct.  Accessor
functions are provided to install something into the hook
({\tt init}) and to invoke the so-installed value ({\tt f}).

With this preparation we can write the implementation module {\tt f-impl.sml}
in such a way that not only does it provide the actual
code but also installs itself into the hook:
\begin{verbatim}
structure F_Impl = struct
    local
        fun f (x: int): int =
            G.g x + H.h (2 * x + 1)
    in
        val _ = F_Hook.init f
    end
end
\end{verbatim}
\noindent The implementation module is wrapped into its implementation
library {\tt f-impl.cm}:
\begin{verbatim}
Library
        structure F_Impl
is
        f-impl.sml
        f-hook.cm
        g.cm       (* imports G *)
        h.cm       (* imports H *)
\end{verbatim}
\noindent Note that {\tt f-impl.cm} must mention {\tt f-hook.cm} for
{\tt f-impl.sml} to be able to access structure {\tt F\_Hook}.

Finally, we replace the original contents of {\tt f.sml} with a stub
module that defines structure {\tt F}:
\begin{verbatim}
structure F = struct
    local
        val initialized = ref false
    in
        fun f x =
            (if !initialized then ()
              else if CM.make "f-impl.cm" then initialized := true
              else raise Fail "dynamic linkage for F.f failed";
             F_Hook.f x)
    end
end
\end{verbatim}
\noindent The trick here is to explicitly invoke {\tt CM.make} the
first time {\tt F.f} is called.  This will then cause {\tt f-impl.cm}
(and therefore {\tt g.cm} and also {\tt h.cm}) to be loaded and the
``real'' implementation of {\tt F.f} to be registered with the hook
module from where it will then be available to this and future calls
of {\tt F.f}.

For the new {\tt f.sml} to be compiled successfully it must be placed
into a library {\tt f.cm} that mentions {\tt f-hook.cm} and {\tt
full-cm.cm}.  As we have seen, {\tt f-hook.cm} exports {\tt F\_Hook.f}
and {\tt smlnj/cm/full.cm} is needed\footnote{The reduced version of
structure {\tt CM} as exported by library {\tt smlnj/cm/minimal.cm} would
have been sufficient, too.}  because it exports {\tt CM.make}:
\begin{verbatim}
Library
        structure F
is
        f.sml
        f-hook.cm
        smlnj/cm/full.cm
\end{verbatim}

\noindent{\bf Beware!}  This solution makes use of {\tt smlnj/cm/full.cm}
which in turn requires the SML/NJ compiler to be present.  Therefore,
is worthwhile only for really large program modules where the benefits
of their absence are not outweighed be the need for the compiler.

\section{Some history}

Although its programming model is more general, CM's implementation is
closely tied to the Standard ML programming language~\cite{milner97}
and its SML/NJ implementation~\cite{appel91:sml}.

The current version is preceded by several other compilation managers,
the most recent goin by the same name ``CM''~\cite{blume95:cm}, while
earlier ones were known as IRM ({\it Incremental Recompilation
Manager})~\cite{harper94:irm} and SC (for {\it Separate
Compilation})~\cite{harper-lee-pfenning-rollins-CM}.  CM owes many
ideas to SC and IRM.

Separate compilation in the SML/NJ system heavily relies on mechanisms
for converting static environments (i.e., the compiler's symbol
tables) into linear byte stream suitable for storage on
disks~\cite{appel94:sepcomp}.  However, unlike all its predecessors,
the current implementation of CM is integrated into the main compiler
and no longer relies on the {\em Visible Compiler} interface.

\pagebreak

\bibliography{blume,appel,ml}

\end{document}

root@smlnj-gforge.cs.uchicago.edu
ViewVC Help
Powered by ViewVC 1.0.0