Home My Page Projects Code Snippets Project Openings SML/NJ
Summary Activity Forums Tracker Lists Tasks Docs Surveys News SCM Files

SCM Repository

[smlnj] View of /sml/trunk/src/cm/Doc/manual.tex
ViewVC logotype

View of /sml/trunk/src/cm/Doc/manual.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 434 - (download) (as text) (annotate)
Mon Sep 13 08:40:49 1999 UTC (20 years, 3 months ago) by blume
File size: 47551 byte(s)
CMB.symval added; manual update
\documentclass{article}
\usepackage{times}
\usepackage{epsfig}

\marginparwidth0pt\oddsidemargin0pt\evensidemargin0pt\marginparsep0pt
\topmargin0pt\advance\topmargin by-\headheight\advance\topmargin by-\headsep
\textwidth6.7in\textheight9.1in %\renewcommand{\baselinestretch}{1.2}
\columnsep0.25in

\author{Matthias Blume \\
Research Institute for Mathematical Sciences \\
Kyoto University}

\title{{\bf CM}\\
The SML/NJ Compilation and Library Manager \\
{\it\small (for SML/NJ version 110.20 and later)} \\
User Manual}

\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 3pt minus 2pt}

\newcommand{\nt}[1]{{\it #1}}
\newcommand{\tl}[1]{{\underline{\bf #1}}}
\newcommand{\ttl}[1]{{\underline{\tt #1}}}
\newenvironment{syntax}{\begin{tabbing} xyzwww \=\kill}{\end{tabbing}}
\newcommand{\ar}{$\rightarrow$\ }
\newcommand{\vb}{~$|$~}

\begin{document}

\bibliographystyle{alpha}

\maketitle

\section{Introduction}

This manual describes a new implementation of CM, the ``Compilation
and Library Manager'' for Standard ML of New Jersey (SML/NJ).  Like its
previous version, CM is in charge of managing separate compilation and
facilitates access to stable libraries.

Programming projects that use CM are typically composed of separate
{\em libraries}.  Libraries themselves can be internally
sub-structured using CM's notion of {\em groups}.  Using libraries and
groups, programs can be viewed as a {\em hierarchy of modules}.  The
organization of large projects tends to benefit from this
approach~\cite{blume:appel:cm99}.

CM uses {\em cutoff} techniques~\cite{tichy94} to minimize
recompilation work and provides automatic dependency analysis to free
the programmer from having to specify a detailed module dependency
graph by hand~\cite{blume:depend99}.

This new version of CM emphasis on {\em working with libraries}.  This
contrasts with the previous implementation where the focus was on
compilation management while libraries were added as an afterthought.
Beginning now, CM takes a very library-centric view of the world.  In
fact, the implementation of SML/NJ itself has been restructured to
meet this approach.

\section{The CM model}

A CM library is a collection of ML source files and references to
other libraries together with an explicit export interface.  The
export interface lists all toplevel-defined symbols of the library
that shall be exported to its clients.  A library is described by the
contents of its {\em description file}.

\noindent Example:

\begin{verbatim}
Library
    signature FOO
    structure Foo
is
    foo.sig
    foo.sml
    helper.sml
    basis.cm
\end{verbatim}

This library exports two definitions, one for a structure named {\tt
Foo} and one for a signature named {\tt FOO}.  The specification for
such exports appear between the keywords {\tt Library} and {\tt is}.
The {\em members} of the library are specified after the keyword {\tt
is}.  Here we have three ML source files ({\tt foo.sig}, {\tt
foo.sml}, and {\tt helper.sml}) and a reference to one external
library ({\tt basis.cm}).  The entry {\tt basis.cm} typically denotes
the description file for the {\it Standard ML Basis
Library}~\cite{reppy99:basis}; most programs will want to list it in
their own description file(s).

\subsection{Library descriptions}

Members of a library do not have to be listed in any particular order
since CM will automatically calculate the dependency graph.  Some
minor restrictions on the source language are necessary to make this
work:
\begin{enumerate}
\item All top-level definitions must be {\em module} definitions
(structures, signatures, functors, or functor signatures).  In other
words, there can be no top-level type-, value-, or infix-definitions.
\item For a given symbol, there can be at most one ML source file per
library (or---more correctly---one file per library component; see
Section~\ref{sec:groups}) that defines the symbol at top level.
\item For a given symbol, there can be at most one sub-library or one
sub-group that exports that symbol.
\item The use of ML's {\bf open} construct is not permitted at top
level.
\end{enumerate}

Note that these rules do not require the exports of sub-groups or
sub-libraries to be distinct from the exports of ML source files.
Here, the disambiguating rule is that the definition from the ML
source overrides the definition imported from the group or library.

The full syntax for library description files also includes provisions
for a simple ``conditional compilation'' facility (see
Section~\ref{sec:preproc}), for access control (see
Section~\ref{sec:access}), and accepts ML-style nestable comments
delimited by \verb|(*| and \verb|*)|.

\subsection{Name visibility}

In general, all definitions exported from members of a library are
visible in all ML source files of that library.  The source code in
those source files can refer to them directly.  Here, ``exported''
means either a top-level definition within an ML source file or a
definition listed in a (sub-)library's export list.

If a library is structured into library components using {\em groups}
(see Section~\ref{sec:groups}), then---as far as name visibility is
concerned---each component (group) is treated like a separate library.

Cyclic dependencies among libraries, library components, or ML source
files within a library are detected and flagged as errors.

\subsection{Groups}
\label{sec:groups}

CM's group model eliminates a whole class of potential naming problems
by providing control over name spaces for program linkage.  This has
been described separately~\cite{blume:appel:cm99} but it sometimes
involves the use of ``administrative'' libraries whose sole purpose is
to rename certain definitions.

However, under CM, ``library'' does not only refer to namespace
management but often also to an actual file system object.  It would
be inconvenient if name resolution problems would result in a
proliferation of additional library files.  Therefore, CM also
provides the notion of groups (or: library components).  Name
resolution for groups works like name resolution for entire libraries,
but grouping is entirely internal to each library.

During development, each group has its own description file which will
be referred to by the surrounding library or other components thereof.
The syntax of group description files is the same as that of library
description files with the following exceptions:

\begin{itemize}
\item The initial keyword {\tt Library} is replaced with {\tt Group}
followed by the name of the surrounding library's description file in
parentheses.
\item The export list can be left empty, in which case CM will
provide a default export list: all exports from ML source files plus
all exports from sub-components of the component.  (Note that this does
not include the exports of other libraries.)
\item There are some small restrictions on access control
specifications (see Section~\ref{sec:access}).
\end{itemize}

As an example, let us assume that {\tt foo-utils.cm} contains the
following text:

\begin{verbatim}
Group (foo-lib.cm)
is
    set-util.sml
    map-util.sml
    basis.cm
\end{verbatim}

Here, the library description file {\tt foo-lib.cm} would list {\tt
foo-utils.cm} as one of its members:

\begin{verbatim}
Library
    signature FOO
    structure Foo
is
    foo.sig
    foo.sml
    foo-utils.cm
    basis.cm
\end{verbatim}

\subsection{Multiple occurences of the same member}

The following rules apply to multiple occurences of the same ML source
file, the same library, or the same group within a program:

\begin{itemize}
\item Within the same description file, each member can be specified
at most once.
\item Libraries can be referred to freely from as many other groups or
libraries as the programmer desires.
\item A group cannot be used from outside the (uniquely defined)
library that it is a component of.  However, within that library it
can be referred to from arbitrarily many other groups.
\item The same ML source file cannot appear more than once.  If an ML
source file is to be referred to by multiple clients it must first be
``wrapped'' into a library (or---if that's sufficient---a group).
\end{itemize}

\subsection{Top-level groups}

Mainly to facilitate some superficial backward-compatibility, CM also
allows groups to appear at top level, i.e., outside of any library.
Such groups must omit the parenthetical library specification and then
cannot also be used within libraries. One could think of the top level
itself as a ``virtual unnamed library''.  Top-level groups are then
components of this virtual library.

\section{Naming objects in the file system}

\subsection{Motivation}

File naming has been an area notorious for its problems and cause of
most of the gripes from CM's users.  With this in mind, CM now takes a
different approach to file name resolution.

The main difficulty lies in the fact that files or even whole
directories may move after CM has already partially (but not fully)
processed them.  For example, this happens when the {\em autoloader}
(see Section~\ref{sec:autoload}) has been used before saving an ML
session via {\tt SMLofNJ.exportML}.  Under a correct installation, CM
will now be able to resume such a session even when operating in a
different environment, perhaps on a different machine with different
file system mounts, or a different location of the SML/NJ
installation.

For this, CM provides a configurable mechanism for locating file
system objects.  Moreover, it invokes this mechanism as late as
possible and is prepared to re-invoke it if the configuration changes.

\subsection{Basic rules}

CM uses its own ``standard'' syntax for pathnames which happens to be
the same as the one used by most Unix-like systems: path name
components are separated by ``{\bf /}'', paths beginning with ``{\bf
/}'' are considered {\em absolute} while other paths are {\em
relative}.

Since this standard syntax does not cover system-specific aspects such
as volume names, it is also possible to revert to ``native'' syntax by
enclosing the name in double-quotes.  Of course, description files
that use path names in native syntax are not portable across operating
systems.

Absolute pathnames are resolved in the usual manner specific to the
operating system.  However, it is advisable to avoid absolute
pathnames because they are certain to ``break'' if the corresponding
file moves to a different location.

The resolution of relative pathnames is more complicated:

\begin{itemize}
\item If the first component of a relative pathname is a
``configuration anchor'' (see Section~\ref{sec:anchors}), then we call
the path {\em anchored}.  In this case the
whole name will be resolved relative to the value associated with that
anchor.  For example, if the path is {\tt foo/bar/baz} and {\tt
foo} is known as an anchor mapped to {\tt /usr/local}, then the
full name of the actual file system object referred to is {\tt
/usr/local/foo/bar/baz}. Note that the {\tt foo} component is not
stripped away during the resolution process; different anchors that
map to the same directory still remain different.
\item Otherwise, if the relative name appears in some description file
whose name is {\it path}{\tt /}{\it file}{\tt .cm}, then it will be
resolved relative to {\it path}, i.e., relative to the directory that
contains the description file.
\item If a non-anchored relative path is entered interactively, for
example as an argument to one of CM's interface functions, then it
will be resolved in the OS-specific manner, i.e., relative to the
current working directory.  However, CM will internally represent the
name in such a way that it remembers the corresponding working
directory.  Should the working directory change during an ongoing CM
session while there still is a reference to the name, then CM will
switch its mode of operation prepend the path of the original working
directory. As a result, two names specified using identical
strings but with different working directories in effect will be kept
distinct and continue to refer to those file system location that they
referred to when they were first seen.
\end{itemize}

\subsection{Anchor configuration}
\label{sec:anchors}

The association of path name anchors with their corresponding
directory names is a simple one-way mapping.  At startup time, this
mapping is initialized by reading two configuration files: an
installation-specific one and a user-specific one.  After that, the
mapping can be maintained using CM's interface functions {\tt
CM.setAnchor}, {\tt CM.cancelAnchor}, and {\tt CM.resetPathConfig}
(see Section~\ref{sec:api}).

The default location of the installation specific configuration file
is {\tt /usr/lib/smlnj-pathconfig}.  However, normally this default
gets replaced (via an environment variable named {\tt
CM\_PATHCONFIG\_DEFAULT}) at installation time by a path pointing to
wherever the installation actually puts the configuration file.
The user can specify a new location at startup time using the
environment variable {\tt CM\_PATHCONFIG}.

The default location of the user-specific configuration file is {\tt
.smlnj-pathconfig} in the user's home directory (which must be given
by the {\tt HOME} environment variable).  At startup time, this
default can be overridden by a fixed location which must be given as
the value of the environment variable {\tt CM\_LOCAL\_PATHCONFIG}.

The syntax of all configuration files is identical.  Lines are
processed from top to bottom. White space divides lines into tokens.
\begin{itemize}
\item A line with exactly two tokens associates an anchor (the first
token) with a directory in native syntax (the second token).  Neither
anchor nor directory name may contain white space and the anchor
should not contain a {\bf /}.  If the directory name is a relative
name, then it will be expanded by prepending the name of the directory
that contains the configuration file.
\item A line containing exactly one token that is the name of an
anchor cancels any existing association of that anchor with a
directory.
\item A line with a single token that consists of a single minus sign
{\bf -} cancels all existing anchors.  This typically makes sense only
at the beginning of the user-specific configuration file and
erases any settings that were made by the installation-specific
configuration file.
\item Lines with no token (i.e., empty lines) will be silently ignored.
\item Any other line is considered malformed and will cause a warning
but will otherwise be ignored.
\end{itemize}

\section{Using CM}

\subsection{Structure CM}
\label{sec:api}

Functions that control CM's operation are accessible as members of a
structure named {\tt CM}.  Here is a description of the members of
this structure:

\subsubsection*{Compiling}

Two main activities when using CM is to compile ML source code and to
build stable libraries:

\begin{verbatim}
  val recomp : string -> bool
  val stabilize : bool -> string -> bool
\end{verbatim}

{\tt CM.recomp} takes the name of a program's ``root'' description
file and compiles or recompiles all ML source files that are necessary
to provide definitions for the root library's export list.

{\tt CM.stabilize} takes a boolean flag and then the name of a library
and {\em stabilizes} this library.  A library is stabilized by writing
all information pertaining to it (including all of its library
components) into a single file.  Later, when the library is used in
other programs, all members of the library are guaranteed to be
up-to-date; no dependency analysis work and no recompilation work will
be necessary.  I if the boolean flag is {\tt false}, then all
sub-libraries of the library must already be stable.  If the flag is
{\tt true}, then CM will recursively stabilize all libraries reachable
from the given root.

After a library has been stabilized it can be used even if none of its
original sources---including the description file---are present.

The boolean result of {\tt CM.recomp} and {\tt CM.stabilize} indicates
success or failure of the operation ({\tt true} = success).

\subsubsection*{Linking}

In SML/NJ, linking means executing top-level code of each compilation
unit.  The resulting bindings can then be bound at the interactive top
level.

\begin{verbatim}
  val make : string -> bool
  val autoload : string -> bool
\end{verbatim}

{\tt CM.make} first acts like {\tt CM.recomp}.  If the (re-)compilation
is successful, then it proceeds by linking all modules.  Provided
there are no link-time errors, it finally introduces new bindings at
top level.

During the course of the same {\tt CM.make}, the code of each
compilation module will be executed at most once.  Code in units that
are marked as {\it private} (see Section~\ref{sec:sharing}) will be
executed exactly once.  Code in other units will be executed only if
the unit has been recompiled since it was executed last time or if it
depends on another compilation unit whose code has been executed
since.

In effect, different invocations of {\tt CM.make} (and {\tt
CM.autoload}) will share dynamic state created at link time as much as
possible unless the compilation units in question have been explicitly
marked private.

{\tt CM.autoload} acts like {\tt CM.make}, only ``lazily''. See
Section~\ref{sec:autoload} for more information.

As before, the result of {\tt CM.make} indicates success or failure of
the operation.  The result of {\tt CM.autoload} indicates success or
failure of the {\em registration}.  (It does not know yet whether
loading will actually succeed.)

\subsubsection*{Flags}

Several flags control the operation of CM.  Any invocation of the
corresponding {\tt get} function reads the current value of the flag.  An
invocation of the {\tt set} function replaces the current value with
the argument given to {\tt set}.

\begin{verbatim}
  val verbose : { get: unit -> bool, set: bool -> unit }
  val debug : { get: unit -> bool, set: bool -> unit }
  val keep_going : { get: unit -> bool, set: bool -> unit }
  val parse_caching : { get: unit -> int, set: int -> unit }
  val warn_obsolete : { get: unit -> bool, set: bool -> unit }
\end{verbatim}

{\tt CM.verbose} can be used to turn off CM's progress messages.  The
default is {\em true} and can be overriden at startup time by the
environment variable {\tt CM\_VERBOSE}.

In the case of a compile-time error {\tt CM.keep\_going} instructs the
{\tt CM.recomp} phase to continue working on parts of the dependency
graph that are not related to the error.  (This does not work for
outright syntax errors because a correct parse is needed before CM can
construct the dependency graph.)  The default is {\em false} and can
be overriden at startup by the environment variable {\tt
CM\_KEEP\_GOING}.

{\tt CM.parse\_caching} sets a limit on how many parse trees are
cached in main memory.  In certain cases CM must parse source files in
order to be able to calculate the dependency graph.  Later, the same
files may need to be compiled, in which case an existing parse tree
saves the time to parse the file again.  Keeping parse trees can be
expensive in terms of memory usage.  Moreover, CM makes special
efforts to avoid re-parsing files in the first place unless they have
actually been modified.  Therefore, it may not make much sense to set
this value very high.  The default is {\em 100} and can be overriden
at startup time by the environment variable {\tt CM\_PARSE\_CACHING}.

This version of CM uses an ML-inspired syntax for expressions in its
conditional compilation subsystem (see Section~\ref{sec:preproc}).
However, for the time being it will accept most of the original
C-inspired expressions but produces a warning for each occurrence of
an old-style operator. {\tt CM.warn\_obsolete} can be used to turn
these warnings off. The default is {\em true} and can be overriden at
startup time by the environment variable {\tt CM\_WARN\_OBSOLETE}.

{\tt CM.debug} can be used to turn on debug mode.  This currently has
no effect since there is no debug code in the implementation. The
default is {\em false} and can be overriden at startup time by the
environment variable {\tt CM\_DEBUG}.

\subsubsection*{Path anchors}

Structure {\tt CM} also provides functions to explicitly manipulate
the path anchor configuration.

\begin{verbatim}
  val setAnchor : string * string -> unit
  val cancelAnchor : string -> unit
  val resetPathConfig : unit -> unit
\end{verbatim}

{\tt CM.setAnchor} creates a new association or replaces an existing
association of an anchor name with a directory name.  Both names must
be given as strings---the directory name in native syntax.  If the
directory name is a relative path name, then it will be expanded by
prepending the name of the current working directory.

{\tt CM.cancelAnchor} deletes the association of the given anchor name
with its directory should such an association currently exist.
Otherwise it will do nothing.

{\tt CM.resetPathConfig} erases the entire existing path configuration
mapping.

\subsubsection*{Setting CM variables}

CM variables are used by the conditional compilation system (see
Section~\ref{sec:cmvars}).   Some of these variables are predefined,
but the user can add new ones and alter or remove those that already
exist.

\begin{verbatim}
  val symval : string ->
        { get: unit -> int option, set: int option -> unit }
\end{verbatim}

Function {\tt CM.symval} returns a {\tt get}-{\tt set}-pair for the
symbol whose name string was specified as the argument.  Note that the
{\tt get}-{\tt set}-pair operates over type {\tt int option}; a value
of {\tt NONE} means that the variable is not defined.

\noindent Examples:
\begin{verbatim}
#get (CM.symval "X") ();       (* query value of X *)
#set (CM.symval "Y") (SOME 1); (* set Y to 1 *)
#set (CM.symval "Z") NONE;     (* remove definition for Z *)
\end{verbatim}

Some care is necessary as {\tt CM.symval} does not check whether the
syntax of the argument string is valid.  (However, the worst thing
that could happen is that the name of a variable cannot be written out
in CM's description files and that, therefore, the associated value
cannot be queried.)

\subsubsection*{Status inspection}

CM keeps a lot of internal state.  Some of this state can be inspected.

\begin{verbatim}
  val showPending : unit -> unit
  val listLibs : unit -> unit
\end{verbatim}

{\tt CM.showPending} lists to standard output the names of all symbols
which are currently registered as being bound at top level via the
autoloading mechanism and which so far have not actually been
resolved.

{\tt CM.listLibs} lists to standard output the path names of library
description files for those stable libraries that are currently known
to CM.  This list includes libraries which have been accessed
``implicitly'' by virtue of being a sub-library of another library
that has been accessed in the past.  Library state can take up
considerable space in main memory.  Use {\tt CM.dismissLib} (see
below) to remove a library from CM's registry.

\subsubsection*{Altering CM's internal state}

Sometimes it can become necessary to explicitly change or update CM's
internal state.

\begin{verbatim}
  val dismissLib : string -> unit
  val synchronize : unit -> unit
  val reset : unit -> unit
\end{verbatim}

{\tt CM.dismissLib} is used to remove a stable library from CM's
internal registry.  See the discussion of {\tt CM.listLibs} above.
Although removing a library from the registry may recover considerable
amounts of main memory, doing so also eliminates any chance of sharing
the associated data structures with later references to the same
library.  Therefore, it is not always in the interest of
memory-conscious users to use this feature.

Sharing of link-time state created by the library is {\em not}
affected by this.

{\tt CM.synchronize} updates tables internal to CM to reflect changes
in the file system.  In particular, this will be necessary when the
association of file names to ``file IDs'' (in Unix: inode numbers)
changes during an ongoing session.  In practice, the need for this
tends to be rare.

{\tt CM.reset} completely erases all internal state in CM.  This is
not very advisable since it will also break the association with
pre-loaded libraries.  It may be a useful tool for determining the
amount of space taken up by the internal state, though.

\subsection{The autoloader}
\label{sec:autoload}

From the user's point of view, a call to {\tt CM.autoload} acts very
much like the corresponding call ot {\tt CM.make} because the same
bindings that {\tt CM.make} would introduce into the top-level
enviroment are also introduced by {\tt CM.autoload}.  However, most
work will be deferred until some code entered later at the interactive
top level refers to one or more of these bindings.  Only then will CM
go and perform just the minimal work necessary to provide the actual
definitions.

In this version of CM the autoloader plays a central role.  Unlike
before, it cannot be turned off since it provides many of the standard
pre-defined top-level bindings in the interactive system.

The autoloader is a convenient tool for virtually ``loading'' an
entire library without incurring an undue increase in memory
consumption for library modules that are not actually being used.

\subsection{Sharing of state}
\label{sec:sharing}

By default, CM tries to let multiple invocations of {\tt CM.make} or
{\tt CM.autoload} share dynamic state created by link-time effects.
Of course, this is not possible if the compilation unit in question
has recently been recompiled or depends on another compilation unit
whose code has recently been re-executed.  The programmer can
explicitly mark certain ML files as {\em shared}, in which case CM
will issue a warning whenever the unit's code has to be re-executed.

State created by compilation units marked as {\em private} is never
shared across multiple calls to {\tt CM.make} or {\tt CM.autoload}.
However, each such call incurs an associated {\em traversal} of the
dependency graph, and during such a traversal each compilation unit
will be executed at most once.  In other words, the same ``program''
will not see multiple instantiations of the same compilation unit
(where ``program'' refers to the code managed by one call to {\tt
CM.make} or {\tt CM.autoload}).

As long as only {\tt CM.make} is involved, this behavior is not
difficult to describe since each traversal will have completed when
the call to {\tt CM.make} returns.  However, that is not true in the
case of {\tt CM.autoload}.  Like {\tt CM.make}, {\tt CM.autoload}
initiates a traversal. But unlike in the case of {\tt CM.make}, that
traversal remains ``suspended'' and will be performed incrementally as
necessary---driven by code entered at the interactive top level.  And
yet, it is still the case that each compilation unit will be linked at
most once during this traversal and private state will not be confused
with private state of other traversals that might be active at the
same time.

% Need a good example here.

\subsubsection*{Sharing annotations}

ML source files can be specified as being either {\em private} or {\em
shared}.  This is done by adding a {\em member class} specification
for the file in the library- or group description file (see
Section~\ref{sec:classes}).  In other words, to mark an ML file as
{\em private}, follow the file name with a colon {\bf :} and the word
{\tt private}.  For {\em shared} ML files, replace {\tt private} with
{\tt shared}.

An ML source file that is not annotated will typically be treated as
{\em shared} unless it statically depends on some other {\em private}
source.  It is an error for a {\em shared} source to depend on a {\em
private} source.

\subsubsection*{Sharing with the interactive system}

The SML/NJ interactive system, which includes the compiler, is itself
created by linking various libraries, and some of these libraries can
also be used in user programs.  Examples are the Standard ML Basis
Library {\tt basis.cm}, the SML/NJ library {\tt smlnj-lib.cm}, and the
ML-Yacc library {\tt ml-yacc-lib.cm}.

If a module from a library is used by both the interactive system and
a user program running under control of the interactive system, then
CM will let them share code and dynamic state.

\section{Member classes and tools}
\label{sec:classes}

In addition to using existing ML source files, CM can also invoke
tools that generate ML source code.  Examples are
program-generating programs such as ML-Yacc~\cite{tarditi90:yacc} or
ML-Lex~\cite{appel89:lex}, literate programming tools like
noweb~\cite{ramsey:simplified}, but also more generic ``generators''
such as the checkout program {\bf co} for RCS archives.
(Currently, CM knows ML-Yacc, ML-Lex, and ML-Burg, but other tools can
be added easily.)

Typically, CM determines which tool to use by looking at clues like
the file name suffix.  However, it is also possible to explicitly tell
CM which tool to use by specifying the {\em member class} of the
source in the description file.  For this, the file name is followed
by a colon {\bf :} and the name of the member class.  Class names are
case-insensitive.

In addition to genuine tool classes, there are a few member classes
that refer to facilities internal to CM: {\tt sml} is the class of
ordinary ML source files without sharing annotation, {\tt shared} is
the class of ML source files whose dynamic state must be shared across
invocations of {\tt CM.make} or {\tt CM.autoload}, {\tt private} is
the class of ML source files whose dynamic state cannot be shared
across invocations of {\tt CM.make} or {\tt CM.autoload}, and {\tt cm}
is the class of CM library or group description files.  Known tool
classes currently are {\tt mlyacc} for ML-Yacc sources, {\tt mllex}
for ML-Lex sources, and {\tt mlburg} for ML-Burg
sources~\cite{mlburg93}.

CM automatically classifies file with a {\tt .sml} or {\tt .sig} suffix
as (unannotated) ML-source, file names ending in {\tt .cm}] as CM
descriptions, {\tt .grm} or {\tt .y} files as input to ML-Yacc, {\tt
.lex} or {\tt .l} as input to ML-Lex, and file names ending in {\tt
.burg} as ML-Burg specifications.

\section{Conditional compilation}
\label{sec:preproc}

In its description files, CM offers a simple conditional
compilation facility inspired by the pre-processor for the C
language~\cite{k&r2}.  However, it is not really a pre-processor, and
the syntax of the controlling expressions is borrowed from SML.

Sequences of members can be guarded by {\tt \#if}-{\tt \#endif}
brackets with optional {\tt \#elif} and {\tt \#else} lines in between.
The same guarding syntax can also be used to conditionalize the export
list.  {\tt \#if}-, {\tt \#elif}-, {\tt \#else}-, and {\tt
\#endif}-lines must start in the first column and always
extend to the end of the current line.  {\tt \#if} and {\tt \#elif}
must be followed by a boolean expression.

Boolean expressions can be formed by comparing arithmetic expressions
(using operators {\tt <}, {\tt <=}, {\tt =}, {\tt >=}, {\tt >}, or
{\tt <>}), by logically combining two other boolean expressions (using
operators {\tt andalso}, {\tt orelse}, {\tt =}, or {\tt <>}, by
querying the existence of a CM symbol definition, or by querying the
existence of an exported ML definition.

Arithmetic expressions can be numbers or references to CM symbols, or
can be formed from other arithmetic expressions using operators {\tt
+}, {\tt -} (subtraction), \verb|*|, {\tt div}, {\tt mod}, or $\tilde{~}$
(unary minus).  All arithmetic is done on signed integers.

Any expression (arithmetic or boolean) can be surrounded by
parentheses to enforce precedence.

\subsection{CM variables}
\label{sec:cmvars}

CM provides a number of names that stand for certain integers.  The
exact set of provided variable names depends on SML/NJ version number,
machine architecture, and operating system.  A reference to a CM
variable is considered an arithmetic expression. If the variable is
not defined, then it evaluates to 0.  The expression {\tt
defined}($v$) is a boolean expression that yields true if and only if
$v$ is a defined CM variable.

The names of CM variables are formed starting with a letter followed
by zero or more occurences of letters, decimal digits, apostrophes, or
underscores.

The following variables will be defined and bound to 1:
\begin{itemize}
\item depending on the operating system: {\tt OPSYS\_UNIX}, {\tt
OPSYS\_WIN32}, {\tt OPSYS\_MACOS}, {\tt OPSYS\_OS2}, or {\tt
OPSYS\_BEOS}
\item depending on processor architecture: {\tt ARCH\_SPARC}, {\tt
ARCH\_ALPHA32}, {\tt ARCH\_MIPS}, {\tt ARCH\_X86}, {\tt ARCH\_HPPA},
{\tt ARCH\_RS6000}, or {\tt ARCH\_PPC}
\item depending on the processor's endianness: {\tt BIG\_ENDIAN} or
{\tt LITTLE\_ENDIAN}
\item depending on the native word size of the implementation: {\tt
SIZE\_32} or {\tt SIZE\_64}
\item the symbol {\tt NEW\_CM}
\end{itemize}

Furthermore, the symbol {\tt SMLNJ\_VERSION} will be bound to the
major version number of SML/NJ (i.e., the number before the first dot)
and {\tt SMLNJ\_MINOR\_VERSION} will be bound to the system's minor
version number (i.e., the number after the first dot).

Using the {\tt CM.symval} interface one can define additional
variables or modify existing ones.

\subsection{Querying exported definitions}

An expression of the form {\tt defined}($n$ $s$) where $s$ is an ML
symbol and $n$ is an ML namespace specifier is a boolean expression
that yields true if and only if any member included before this test
exports a definition under this name.  Therefore, order among members
matters after all (but it remains unrelated to the problem of
determining static dependencies)!  The namespace specifier must be one
of: {\tt structure}, {\tt signature}, {\tt functor}, or {\tt funsig}.

If the query takes place in the ``exports'' section of a description
file, then it yields true if {\em any} of the included members exports
the named symbol.

\noindent Example:

\begin{verbatim}
Library
  structure Foo
#if defined(structure Bar)
  structure Bar
#endif
is
#if SMLNJ_VERSION > 110
  new-foo.sml
#else
  old-foo.sml
#endif
#if defined(structure Bar)
  bar-client.sml
#else
  no-bar-so-far.sml
#endif
\end{verbatim}

Here, the file {\tt bar-client.sml} gets included if {\tt
SMLNJ\_VERSION} is greater than 110 and {\tt new-foo.sml} exports a
structure {\tt Bar} {\em or} if {\tt SMLNJ\_VERSION <= 110} and {\tt
old-foo.sml} exports structure {\tt Bar}. \\ Otherwise {\tt
no-bar-so-far.sml} gets included instead.  In addition, the export of
structure {\tt Bar} is guarded by its own existence.  (Structure {\tt
Bar} could also be defined by {\tt no-bar-so-far.sml} in which case it
would get exported regardless of the outcome of the other {\tt
defined} test.)

\subsection{Explicit errors}

A pseudo-member of the form {\tt \#error $\ldots$} which---like other
{\tt \#}-items---starts in the first column and extends to the end of
the line causes an explicit error message unless it gets excluded by
the conditional compilation logic.  The error message is given by the
remainder of the line after the word {\tt error}.

\subsection{BNF for expressions}

\begin{tabbing}
\nt{non-terminal}~\= \ar \kill
\nt{letter} \> \ar \tl{A} \vb $\ldots$ \vb \tl{Z} \vb \tl{a} \vb $\ldots$ \vb \tl{z} \\
\nt{digit}  \> \ar \tl{0} \vb $\ldots$ \vb \tl{9} \\
\nt{ldau}   \> \ar \nt{letter} \vb \nt{digit} \vb \tl{'} \vb \tl{\_} \\
\\
\nt{number} \> \ar \nt{digit} \{\nt{digit}\} \\
\nt{sym}    \> \ar \nt{letter} \{\nt{ldau}\} \\
\\
\nt{aatom}  \> \ar \nt{number} \vb \nt{sym} \vb \tl{(} \nt{asum} \tl{)} \vb \tl{$\tilde{~}$} \nt{aatom} \\
\nt{aprod}  \> \ar \{\nt{aprod} (\tl{*} \vb \tl{div} \vb \tl{mod})\} \nt{aatom} \\
\nt{asum}   \> \ar \{\nt{asum} (\tl{+} \vb \tl{-})\} \nt{aprod} \\
\\
\nt{ns}     \> \ar \tl{structure} \vb \tl{signature} \vb \tl{functor} \vb \tl{funsig} \\
\nt{mlsym}  \> \ar {\em a Standard ML identifier} \\
\nt{query}  \> \ar \tl{defined} \tl{(} \nt{sym} \tl{)} \vb \tl{defined} \tl{(} \nt{ns} \nt{mlsym} \tl{)} \\
\\
\nt{acmp}   \> \ar \nt{aexp} (\ttl{<} \vb \ttl{<=} \vb \ttl{>} \vb \ttl{>=} \vb \ttl{=} \vb \ttl{<>}) \nt{aexp} \\
\\
\nt{batom}  \> \ar \nt{query} \vb \nt{acmp} \vb \tl{not} \nt{batom} \vb \tl{(} \nt{bdisj} \tl{)} \\
\nt{bcmp}   \> \ar \nt{batom} [(\ttl{=} \vb \ttl{<>}) \nt{batom}] \\
\nt{bconj}  \> \ar \{\nt{bcmp} \tl{andalso}\} \nt{bcmp} \\
\nt{bdisj}  \> \ar \{\nt{bconj} \tl{orelse}\} \nt{bdisj} \\
\\
\nt{expression} \> \ar \nt{bdisj}
\end{tabbing}

\section{Access control}
\label{sec:access}

The basic idea behind CM's access control is the following: In their
description files groups and libraries can specify a list of
{\em privileges} that the client must have in order to be able to use it.
Privileges at this level are just names (strings) and must be written
in front of the initial keyword {\tt Library} or {\tt Group}.  If one
group or library imports from another group or library, then
privileges (or rather: privilege requirements) are being inherited.
In effect, to be able to use a program, one must have all privileges
for all its libraries, sub-libraries and library components,
components of sub-libraries, and so on.

Of course, this alone would not yet be satisfactory because there
should also be the possibility of setting up a ``safety wall:'' a
library {\tt LSafe.cm} could ``wrap'' all the unsafe operations in
{\tt LUnsafe.cm} with enough error checking that they become safe.
Therefore, a user of {\tt LSafe.cm} should not also be required to
possess the privileges that would be required if one were to use {\tt
LUnsafe.cm} directly.

In CM's access control model it is possible for a library to ``wrap''
privileges.  If a privilege $P$ has been wrapped, then the user of the
library does not need to have privilege $P$ even though the library is
using another library that requires privilege $P$.  In essence, the
library acts as a ``proxy'' who provides the necessary credentials for
privilege $P$ to the sub-library.

Of course, not everybody can be allowed to establish a library with
such a ``wrapped'' privilege $P$.  The programmer who does that should at
least herself have privilege P (but perhaps better, she should have
{\em permission to wrap $P$}---a stronger requirement).

In CM, wrapping a privilege is done by specifying the name of that
privilege within parenthesis.  The wrapping becomes effective once the
library gets stabilized via {\tt CM.stabilize}.  The (not yet
implemented) enforcement mechanism must ensure that anyone who
stabilizes a library that wraps $P$ has permission to wrap $P$.

Note that privileges cannot be wrapped at the level of CM groups.

Access control is a new feature. At the moment, only the basic
mechanisms are implemented, but there is no enforcement.  In other
words, everybody is assumed to have every possible privilege.  CM
merely reports which privileges "would have been required".

\section{The pervasive environment and primitive modules}

\subsection{The pervasive environment}

The {\em pervasive environment} can be thought of as a library that
all compilation units implicitly depend upon.  The pervasive
enviroment exports all non-modular bindings (types, values, infix
operators, overloaded symbols) that are mandated by the specification
for the Standard ML Basis Library~\cite{reppy99:basis}.  (All other
bindings of the Basis Library are exported by {\tt basis.cm} which is
a genuine CM library.)

The pervasive environment is the only place where CM conveys
non-modular bindings from one compilation unit to another.

\subsection{Primitive modules}

CM also knows about some ``primitive'' modules.  These modules give
access to certain compiler internals which are implemented in a way
that is outside the usual CM compilation model.  A user program can
access a primitive module by listing its name as one of the members in
its CM description files.  However, usage of any primitive module $M$
is protected by requiring the client to possess  privilege $M$
(see Section~\ref{sec:access}), i.e., a privilege that goes by the
same name as the module itself.

Currently, the following primitive module names are known: {\tt
built-in}, {\tt print-hook}, {\tt use-hook}, {\tt exn-info-hook}, {\tt
core}, {\tt init-utils}.

\section{Files}

CM uses three kinds of files to store derived information during and
between sessions:

\begin{enumerate}
\item {\it Skeleton files} are used to store a highly abbreviated
version of each ML source file's abstract syntax tree---just barely
sufficient to drive CM's dependency analysis.  Skeleton files are much
smaller and easier to read than actual ML source code.  Therefore, the
existence of valid skeleton files makes CM a lot faster because
usually most parsing operations can be avoided that way.
\item {\it Binfiles} are the SML/NJ equivalent of object files.  They
contain executable code and a symbol table for the associated ML
source file.
\item {\it Library files} (sometimes called: {\em stablefiles})
dependency graph, executable code, and symbol tables for an entire CM
library including all of its components (groups).
\end{enumerate}

Normally, all these files are stored in a subdirectory of directory
{\tt CM} which itself is a subdirectory of the directory where the
original ML source file or---in the case of library files---the
original CM description file is located.

Skeleton files are machine- and operating system-independent.
Therefore, they are always placed into the same directory {\tt
CM/SKEL}. Parsing (for the purpose of dependency analysis) will be
done only once even if the same file system is accessible from
machines of different type.

Binfiles and library files contain executable code and other
information that is potentially system- and architecture-dependent.
Therefore, they are stored under {\tt CM/}{\it arch}{\tt -}{\it os}
where {\it arch} is a string indicating the type of the current
CPU architecture and {\it os} a string denoting the current operating
system type.

Library files are a bit of an exception in the sense that they do not
require any source files or any other derived files of the same
library to exist.  As a consequence, the location of such a library
file is best described as being relative to ``the location of the
original CM description file if that description file still existed''.
(Of course, nothing precludes the CM description file from actually
existing, but in the presence of a corresponding library file CM will
not take any notice.)

\subsection{Time stamps}

For skeleton files and binfiles, CM uses file system time stamps to
determine whether a file has become outdated.  The rule is that in
order to be considered ``up-to-date'' the time stamp on skeleton file
and binfile has to be exactly the same as the one on the ML source
file.  This guarantees that all changes to a source will be
noticed\footnote{except for the pathological case where two different
versions of the same source file have exactly the same time stamp}.

CM also uses time stamps to decide whether tools such as ML-Yacc or
ML-Lex need to be run (see Section~\ref{sec:tools}).  However, the
difference is that a file is considered outdated if it is older than
its source.  Some care on the programmers side is necessary since this
scheme does not allow CM to detect the situation where a source file
gets replaced by an older version of itself.

\section{Tools}
\label{sec:tools}

CM's tool set is extensible: new tools can be added by writing a few
lines of ML code.  The necessary hooks for this are provided by a
structure {\tt Tools} which is exported by the {\tt cm-tools.cm}
library.

If the tool is implemented as a ``typical'' shell command, then all
that needs to be done is a single call to:

\begin{verbatim}
Tools.registerStdShellCmdTool
\end{verbatim}

For example, suppose you have made a
new, improved version of ML-Yacc (``New-ML-Yacc'') and want to
register it under a class called {\tt nmlyacc}.  Here is what you
write:

\begin{verbatim}
val command = Tools.newCmdGetterSetter ("NYACC", "new-ml-yacc")
val _ = Tools.registerStdShellCmdTool
  { tool = "New-ML-Yacc",
    class = "nmlyacc",
    suffixes = ["ngrm", "ny"],
    command = command,
    extensionStyle = Tools.EXTEND ["sig", "sml"],
    sml = true }
\end{verbatim}

This code can either by packaged as a CM library or entered at the
interactive top level after loading the {\tt cm-tools.cm} library
(e.g., via {\tt CM.autoload}).

The call to {\tt Tools.newCmdGetterSetter} makes a `command
getter-setter'' which is a value of type {\tt \{ get: unit -> string,
set: string -> unit \} }. It can be invoked to query or set the
command string for the tool.  Here, the default string is {\tt
new-ml-yacc} and can be customized at startup time using the
environment variable {\tt CM\_NYACC}.

{\tt Tools.registerStdShellCmdTool} creates the class and installs the
tool for it.  The arguments must be specified as follows:

\begin{description}
\item[tool] a descriptive name of the tool (used in error messages)
\item[class] the name of the class; the string must not contain
upper-case letters
\item[suffixes] a list of file name suffixes that let CM automatically
recognize files of the class
\item[command] the command getter-setter from above
\item[extensionStyle] a specification of how the names of files
generated by the tool relate to the name of the tool input file; \\
Currently, there are two possible cases:
\begin{enumerate}
\item ``{\tt Tools.EXTEND} $l$'' says that if the tool source file is
{\it file} then for each suffix {\it sfx} in $l$ there will be one tool
output file named {\it file}{\tt .}{\it sfx}.
\item ``{\tt Tools.REPLACE }$(l_1, l_2)$'' specifies that given the
base name {\it base} there will be one tool output file {\it base}{\tt
.}{\it sfx} for each suffix {\it sfx} in $l_2$.  Here, {\it base} is
determined by the following rule:  If the name of the tool input file
has a suffix that occurs in $l_1$ then {\it base} is the name without
that suffix.  Otherwise, the whole file name is taken as {\it base}
(just like in the case of {\tt Tools.EXTEND}).
\end{enumerate}
\item[sml] a boolean flag that indicates whether or not the tool
output is always to be considered ML source code; \\
If the flag is set to {\tt false}, then CM will take the names of the
output files and apply its usual classification mechanism---possibly
resulting in cascaded tool application.
\end{description}

Less common kinds of rules can also be defined using the generic
interface {\tt Tools.registerClass}.

\section{Some history}

Although its programming model is more general, CM's implementation is
closely tied to the Standard ML programming language~\cite{milner97}
and its SML/NJ implementation~\cite{appel91:sml}.

The current version is preceded by several other compilation managers,
the most recent goin by the same name ``CM''~\cite{blume95:cm}, while
earlier ones were known as IRM ({\it Incremental Recompilation
Manager})~\cite{harper94:irm} and SC (for {\it Separate
Compilation})~\cite{harper-lee-pfenning-rollins-CM}.  CM owes many
ideas to SC and IRM.

Separate compilation in the SML/NJ system heavily relies on mechanisms
for converting static environments (i.e., the compiler's symbol
tables) into linear byte stream suitable for storage on
disks~\cite{appel94:sepcomp}.  However, unlike all its predecessors,
the current implementation of CM is integrated into the main compiler
and no longer relies on the {\em Visible Compiler} interface.

\cleardoublepage

\tableofcontents

\pagebreak

\bibliography{blume,appel,ml}

\end{document}


root@smlnj-gforge.cs.uchicago.edu
ViewVC Help
Powered by ViewVC 1.0.0