Home My Page Projects Code Snippets Project Openings SML/NJ
Summary Activity Forums Tracker Lists Tasks Docs Surveys News SCM Files

SCM Repository

[smlnj] View of /sml/trunk/src/cm/Doc/manual.tex
ViewVC logotype

View of /sml/trunk/src/cm/Doc/manual.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 408 - (download) (as text) (annotate)
Thu Sep 2 14:22:37 1999 UTC (20 years, 10 months ago) by blume
File size: 25878 byte(s)
some more manual writing
\documentclass{article}
\usepackage{times}
\usepackage{epsfig}

\marginparwidth0pt\oddsidemargin0pt\evensidemargin0pt\marginparsep0pt
\topmargin0pt\advance\topmargin by-\headheight\advance\topmargin by-\headsep
\textwidth6.7in\textheight9.1in %\renewcommand{\baselinestretch}{1.2}
\columnsep0.25in

\author{Matthias Blume \\
Research Institute for Mathematical Sciences \\
Kyoto University}

\title{{\bf CM}\\
The SML/NJ Compilation and Library Manager \\
{\it\small (for SML/NJ version 110.20 and later)} \\
User Manual}

\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 3pt minus 2pt}

\begin{document}

\bibliographystyle{alpha}

\maketitle

\section{Introduction}

This manual describes a new implementation of CM, the ``Compilation
and Library Manager'' for Standard ML of New Jersey (SML/NJ).  Like its
previous version, CM is in charge of managing separate compilation and
facilitates access to stable libraries.

Programming projects that use CM are typically composed of separate
{\em libraries}.  Libraries themselves can be internally
sub-structured using CM's notion of {\em groups}.  Using libraries and
groups, programs can be viewed as a {\em hierarchy of modules}.  The
organization of large projects tends to benefit from this
approach~\cite{blume:appel:cm99}.

CM uses {\em cutoff} techniques~\cite{tichy94} to minimize
recompilation work and provides automatic dependency analysis to free
the programmer from having to specify a detailed module dependency
graph by hand~\cite{blume:depend99}.

The most important change with repect to the previous (``old'')
implementation of CM is a change of emphasis.  Until now the focus was
on compilation management while libraries were added as an
afterthought.  Beginning now, CM takes a very library-centric view of
the world.  In fact, the implementation of SML/NJ itself has been
restructured to meet this approach.

\section{The CM model}

When working with CM, the most important concept is the concept of a
{\em library}.  A library is a collection of ML source files and
references to other libraries together with an explicit export
interface.  The export interface lists all toplevel-defined symbols of
the library that shall be exported to its clients.  A library is
described by its {\em description file}.

\noindent Example:

\begin{verbatim}
Library
    signature FOO
    structure Foo
is
    foo.sig
    foo.sml
    helper.sml
    basis.cm
\end{verbatim}

This library exports two definitions, one for a structure named {\tt
Foo} and one for a signature named {\tt FOO}.  The specification for
such exports appear between the keywords {\tt Library} and {\tt is}.
The {\em members} of the library are specified after the keyword {\tt
is}.  Here we have three ML source files ({\tt foo.sig}, {\tt
foo.sml}, and {\tt helper.sml}) and a reference to one external
library ({\tt basis.cm}).  The entry {\tt basis.cm} typically denotes
the description file for the {\it Standard ML Basis
Library}~\cite{reppy99:basis}; most programs will want to list it in
their own description file(s).

\subsection{Library descriptions}

Members of a library do not have to be listed in any particular order
since CM will automatically calculate the dependency graph.  Three
minor restrictions on the source language are necessary to make this
work:
\begin{enumerate}
\item All top-level definitions must be {\em module} definitions
(structures, signatures, functors, or functor signatures).  In other
words, there can be no top-level type-, value-, or infix-definitions.
\item For a given symbol, there can be at most one ML source file per
library (or---more correctly---one file per library component; see
Section~\ref{sec:groups}) that defines the symbol at top level.
\item For a given symbol, there can be at most one sub-library or one
sub-group that exports that symbol.
\item The use of ML's {\bf open} construct is not permitted at top
level.
\end{enumerate}

Note that these rules do not require the exports of sub-groups or
sub-libraries to be distinct from the exports of ML source files.
Here, the disambiguating rule is that the definition from the ML
source overrides the definition imported from the group or library.

The full syntax for library description files also includes provisions
for a simple ``conditional compilation'' facility (see
Section~\ref{sec:preproc}), for access control (see
Section~\ref{sec:access}), and accepts ML-style nestable comments
delimited by \verb|(*| and \verb|*)|.

\subsection{Name visibility}

In general, all definitions exported from members of a library are
visible in all ML source files of that library.  The source code in
those source files can refer to them directly.  Here, ``exported''
means either a top-level definition within an ML source file or a
definition listed in a (sub-)library's export list.

If a library is structured into library components using {\em groups}
(see Section~\ref{sec:groups}), then each component (group) is treated
like a separate library.

Dependencies among libraries, library components, or ML source files
within a library are detected and flagged as errors.

\subsection{Groups}
\label{sec:groups}

CM's group model eliminates a whole class of potential naming problems
by providing control over name spaces for program linkage.  This has
been described separately~\cite{blume:appel:cm99} but it sometimes
involves the use of ``administrative'' libraries whose sole purpose is
to rename certain definitions.

However, under CM, ``library'' does not only refer to a concept but
often also to an actual file system object.  It would be inconvenient
if name resolution problems would result in a proliferation of
additional library files.  Therefore, CM also provides the notion of
groups (or: library components).  Name resolution for groups works
like name resolution for entire libraries, but grouping is entirely
internal to each library.

During development, each group has its own description file which well
be referred to by the surrounding library or other components thereof.
The syntax of group description files is the same as that of library
description files with the following exceptions:

\begin{itemize}
\item The initial keyword {\tt Library} is replaced with {\tt Group}
followed by the name of the surrounding library's description file in
parentheses.
\item The export list can be left empty, in which case CM will
provide a default export list: all exports from ML source files plus
all exports from sub-components of the component.  (Note that this does
not include the exports of other libraries.)
\item There are some small restrictions on access control
specifications (see Section~\ref{sec:access}).
\end{itemize}

As an example, let us assume that {\tt foo-utils.cm} contains the
following text:

\begin{verbatim}
Group (foo-lib.cm)
is
    set-util.sml
    map-util.sml
    basis.cm
\end{verbatim}

Here, the library description file {\tt foo-lib.cm} would list {\tt
foo-utils.cm} as one of its members:

\begin{verbatim}
Library
    signature FOO
    structure Foo
is
    foo.sig
    foo.sml
    foo-utils.cm
    basis.cm
\end{verbatim}

\subsection{Multiple occurences of the same member}

The following rules apply to multiple occurences of the same ML source
file, the same library, or the same group within a program:

\begin{itemize}
\item Within the same description file, each member can be specified
at most once.
\item Libraries can be referred to freely from as many other groups or
libraries as the programmer desires.
\item Each group cannot be used from outside the (uniquely defined)
library that it is a component of.  However, within that library it
can be referred to from arbitrarily many other groups.
\item The same ML source file cannot appear more than once.  If an ML
source file is to be referred to by multiple clients it must first be
``wrapped'' into a library or (if sufficient) a group.
\end{itemize}

\subsection{Top-level groups}

Mainly to facilitate some superficial backward-compatibility, CM also
allows groups to appear at top level, i.e., outside of any library.
Such groups must omit the parenthetical library specification and then
cannot also be used within libraries. One could think of the top level
itself as a ``virtual unnamed library'' whose components top-level
groups are.

\section{Naming objects in the file system}

\subsection{Motivation}

File naming has been an area notorious for its problems and cause for
most of the gripes from CM's users.  With this in mind, CM now takes a
different approach to file name resolution.

The main difficulty lies in the fact that files or even whole
directories may move after CM has already partially (but not fully)
processed them.  For example, this happens when the {\em auto loader}
(see Section~\ref{sec:autoload}) has been used before saving an ML
session via {\tt SMLofNJ.exportML}.  Under a correct installation, CM
will now be able to resume such a session even when operating in a
different environment, perhaps on a different machine with different
file system mounts, or a different location of the SML/NJ
installation.

For this, CM provides a configurable mechanism for locating file
system objects.  Moreover, it invokes this mechanism as late as
possible and is prepared to re-invoke it if the configuration changes.

\subsection{Basic rules}

CM uses its own ``standard'' syntax for pathnames which happens to be
the same as the one used by most Unix-like systems: path name
components are separated by ``{\bf /}'', paths beginning with ``{\bf
/}'' are considered {\em absolute} while other paths are {\em
relative}.

Since this standard syntax does not cover system-specific aspects such
as volume names, it is also possible to revert to ``native'' syntax by
enclosing the name in double-quotes.  Of course, description files
that use path names in native syntax are not portable across operating
systems.

Absolute pathnames are resolved in the usual operating-specific
manner.  However, it is advisable to avoid absolute pathnames because
they are certain to ``break'' if the corresponding file moves to a
different location.

The resolution of relative pathnames is more complicated.

\begin{itemize}
\item If the first component of a relative pathname is a
``configuration anchor'' (see Section~\ref{sec:anchors}), then we call
the path {\em anchored}.  In this case  the
whole name will be resolved relative to the value associated with that
anchor.  For example, if the path is {\tt foo/bar/baz} and {\tt
foo} is known as an anchor mapped to {\tt /usr/local}, then the
full name of the actual file system object referred to is {\tt
/usr/local/foo/bar/baz}. Note that the {\tt foo} component is not
stripped away during the resolution process; different anchors that
map to the same directory still remain different.
\item Otherwise, if the relative name appears in some description file
whose name is {\it path}{\tt /}{it file}{\tt .cm}, then it will be
resolved relative to {\it path}, i.e., relative to the directory that
contains the description file.
\item If a non-anchored relative path is entered interactively, for
example as an argument to one of CM's interface functions, then it
will be resolved in the OS-specific manner, i.e., relative to the
current working directory.  However, CM will remember what that
directory is at the time the name was first seen.  Should the working
directory change during an ongoing CM session, then CM will switch its
mode of operation for that name and prepend the name of the original
working directory.  In effect, the name will continue to refer to the
same file system object regardless of what the current working
directory is.
\end{itemize}

\subsection{Anchor configuration}
\label{sec:anchors}

The configuration of path name anchors to their corresponding
directory names is a simple one-way mapping.  At startup time, this
mapping is initialized by reading two configuration files: an
installation-specific one and a user-specific one.  After that, the
mapping can be maintained using CM's interface functions {\tt
CM.setAnchor}, {\tt CM.cancelAnchor}, and {\tt CM.resetPathConfig}
(see Section~\ref{sec:api}).

The default location of the installation specific configuration file
is {\tt /usr/lib/smlnj-pathconfig}.  However, normally this default
gets replaced (via an environment variable named {\tt
CM\_PATHCONFIG\_DEFAULT}) at installation time by a path pointing to
wherever the installation actually puts the configuration file.
The user can specify a new location at startup time using the
environment variable {\tt CM\_PATCONFIG}.

The default location of the user-specific configuration file is {\tt
.smlnj-pathconfig} in the user's home directory (which must be given
by the {\tt HOME} environment varibale).  At startup time, this
default can be overridden by a fixed location which must be given as
the value of the environment variable {\tt CM\_LOCAL\_PATHCONFIG}.

The syntax of all configuration files is identical.  Lines are
processed from top to bottom. White space divides lines into tokens.
\begin{itemize}
\item A line with exactly two tokens associates an anchor (the first
token) with a directory in native syntax (the second token).  Neither
anchor nor directory name may contain white space and the anchor
should not contain a {\bf /}.  If the directory name is a relative
name, then it will be expanded by prepending the name of the directory
that contains the configuration file.
\item A line containing exactly one token that is the name of an
anchor cancels any existing association of that anchor with a
directory.
\item A line with a single token that consists of a single minus sign
{\bf -} cancels all existing anchors.  This typically makes sense only
at the beginning of the user-specific configuration file and
eradicates any settings that were made by the installation-specific
configuration file.
\item Lines with no token (i.e., empty lines) will be silently ignored.
\item Any other line is considered malformed and will cause a warning
but will otherwise be ignored.
\end{itemize}

\section{Using CM}

\subsection{Structure CM}
\label{sec:api}

Functions that control CM's operation are accessible as members of a
structure named {\tt CM}.  Here is a description of the members of
this structure:

\subsubsection*{Compiling}

Two main activities when using CM is to compile ML source code and to
build stable libraries:

\begin{verbatim}
  val recomp : string -> bool
  val stabilize : bool -> string -> bool
\end{verbatim}

{\tt CM.recomp} takes the name of a program's ``root'' description
file and compiles or recompiles all ML source files that are necessary
to provide definitions for the root library's export list.

{\tt CM.stabilize} takes a boolean flag and then the name of a library
and {\em stabilizes} this library.  A library is stabilized by writing
all information pertaining to it (including all of its library
components) into a single file.  Later, when the library is used in
other programs, all members of the library are guaranteed to be
up-to-date; no dependency analysis work and no recompilation work will
be necessary.  I if the boolean flag is {\tt false}, then all
sub-libraries of the library must already be stable.  If the flag is
{\tt true}, then CM will recursively stabilize all libraries reachable
from the given root.

After a library has been stabilized it can be used even if none of its
original sources---including the description file---are present.

\subsubsection*{Linking}

In SML/NJ, linking means executing top-level code of each compilation
unit.  The resulting bindings can then be bound at the interactive top
level.

\begin{verbatim}
  val make : string -> bool
  val autoload : string -> bool
\end{verbatim}

{\tt CM.make} first acts like {\tt CM.recomp}.  If the (re-)compilation
is successful, then it proceeds by linking all modules.  Provided
there are no link-time errors, it finally introduces new bindings at
top level.

During the course of the same {\tt CM.make}, the code of each
compilation module will be executed at most once.  Code in units that
are marked as {\it private} (see Section~\ref{sec:sharing}) will be
executed exactly once.  Code in other units will be executed only if
the unit has been recompiled since it was executed last time or if it
depends on another compilation unit whose code has been executed
since.

In effect, different invocations of {\tt CM.make} (and {\tt
CM.autoload}) will share dynamic state created at link time as much as
possible unless the compilation units in question have been explicitly
marked private.

{\tt CM.autoload} acts like {\tt CM.make}, only ``lazily''. See
Section~\ref{sec:autoload} for more information.

\subsubsection*{Flags}

Several flags control the operation of CM.  Any invocation of the
corresponding function reads the current value of the flag.  An
invocation with {\tt NONE} just reads it, an invocation with {\tt
SOME} $v$ reads it and then replaces it with a new value $v$.

\begin{verbatim}
  val verbose : bool option -> bool
  val debug : bool option -> bool
  val keep_going : bool option -> bool
  val parse_caching : int option -> int
  val warn_obsolete : bool option -> bool
\end{verbatim}

{\tt CM.verbose} can be used to turn off CM's progress messages.  The
default is {\em true} and can be overriden at startup time by the
environment variable {\tt CM\_VERBOSE}.

In the case of a compile-time error {\tt CM.keep\_going} instructs the
{\tt CM.recomp} phase to continue working on parts of the dependency
graph that are not related to the error.  (This does not work for
syntax errors because a correct parse is needed before CM can
construct its dependency graph.)  The default is {\em false} and can
be overriden at startup by the environment variable {\tt CM\_KEEP\_GOING}.

{\tt CM.parse\_caching} sets a limit on how many parse trees are
cached in main memory.  In certain cases CM must parse source files in
order to be able to calculate the dependency graph.  Later, the same
files may need to be compiled, in which case an existing parse tree
saves the time to parse the file again.  Keeping parse trees can be
expensive in memory usage.  Moreover, CM makes special efforts to
avoid parsing files unless they have actually been modified.
Therefore, it may not make much sense to set this value very high.
The default is {\em 100} and can be overriden at startup time by the
environment variable {\tt CM\_PARSE\_CACHING}.

This version of CM uses an ML-inspired syntax for expressions in its
conditional compilation subsystem.  However, for the time being it
will accept old C-inspired expressions but produce a warning for each
occurrence. {\tt CM.warn\_obsolete} can be used to turn these warnings
off. The default is {\em true} and can be overriden at startup time by
the environment variable {\tt CM\_WARN\_OBSOLETE}.

{\tt CM.debug} can be used to turn on debug mode.  This currently has
no effect since there is no separate debug mode. The default is {\em
false} and can be overriden at startup time by the environment
variable {\tt CM\_DEBUG}.

\subsubsection*{Path anchors}

Structure {\tt CM} also provides functions to explicitly manipulate
the path anchor configuration.

\begin{verbatim}
  val setAnchor : string * string -> unit
  val cancelAnchor : string -> unit
  val resetPathConfig : unit -> unit
\end{verbatim}

{\tt CM.setAnchor} creates a new association or replaces an existing
association of an anchor name with a directory name.  Both names must
be given as strings---the directory name in native syntax.  If the
directory name is a relative path name, then it will be expanded by
prepending the name of the current working directory.

{\tt CM.cancelAnchor} deletes the association of the given anchor name
with its directory should such an association currently exist.
Otherwise it will do nothing.

{\tt CM.resetPathConfig} erases the entire existing path configuration
mapping.

\subsubsection*{Status inspection}

CM keeps a lot of internal state.  Some of this state can be inspected.

\begin{verbatim}
  val showPending : unit -> unit
  val listLibs : unit -> unit
\end{verbatim}

{\tt CM.showPending} lists to standard output the names of all symbols
which are currently registered as being bound at top level via the
autoloading mechanism and which so far have not actually been
resolved.

{\tt CM.listLibs} lists to standard output the path names of library
description files for those stable libraries that are currently known
to CM.  This list includes those libraries which have been accessed
``implicitly'' by virtue of being a sub-library of another library
that has been accessed in the past.  Library state can take up
considerable space in main memory.  Use {\tt CM.dismissLib} (see
below) to remove a library from CM's registry.

\subsubsection*{Altering CM's internal state}

Sometimes it can become necessary to explicitly change or update CM's
internal state.

\begin{verbatim}
  val dismissLib : string -> unit
  val synchronize : unit -> unit
  val reset : unit -> unit
\end{verbatim}

{\tt CM.dismissLib} is used to remove a stable library from CM's
internal registry.  See the discussion of {\tt CM.listLibs} above.
Although removing a library from the registry may recover considerable
amounts of main memory, doing so also eliminates any chance of sharing
the associated data structures with later references to the same
library.  Therefore, it is not always in the interest of
memory-conscious users to use this feature.

Sharing of dynamic state created by the library is {\em not} affected
by this.

{\tt CM.synchronize} updates tables internal to CM to reflect changes
in the file system.  In particular, this will be necessary when the
association of file names to ``file IDs'' (in Unix: inode numbers)
changes during an ongoing session.  In practice, the need for this
tends to be rare.

{\tt CM.reset} completely erases all internal state in CM.  This is
not very advisable since it will also break the association with
pre-loaded libraries.  It may be a useful tool for determining the
amount of space taken up by the internal state, though.

\subsection{The auto loader}
\label{sec:autoload}

From the user's point of view, a call to {\tt CM.autoload} acts very
much like the corresponding call ot {\tt CM.make} because the same
bindings that {\tt CM.make} would introduce into the top-level
enviroment are also introduced by {\tt CM.autoload}.  However, most
work will be deferred until some code entered later at the interactive
top level refers to one or more of these bindings.  Only then will CM
go and perform just the minimal work necessary to provide the actual
definitions.

In this version of CM the autoloader plays a central role.  Unlike
before, it cannot be turned off since it provides many of the standard
pre-defined top-level bindings in the interactive system.

In essence, the autoloader is a convenient tool for virtually
``loading'' an entire library without incurring an undue increase in
memory consumption for library modules that are not actually being
used.

\subsection{Sharing of state}
\label{sec:sharing}

By default, CM tries to let multiple invocations of {\tt CM.make} or
{\tt CM.autoload} share dynamic state created by link-time effects.
Of course, this is not possible if the compilation unit in question
has recently been recompiled or depends on another compilation unit
whose code has recently been re-executed.  The programmer can
explicitly mark certain ML files as {\em shared}, in which case CM
will issue a warning whenever the unit's code gets re-executed.

State created by compilation units marked as {\em private} is never
shared across multiple calls to {\tt CM.make} or {\tt CM.autoload}.
However, each such call incurs an associated {\em traversal} of the
dependency graph, and during such a traversal each compilation unit
will be executed at most once.  In other words, the same program will
not see multiple instantiations of the same compilation unit.

As long as only {\tt CM.make} is involved, this is not difficult to
describe since each traversal will have completed when the call to
{\tt CM.make} returns.  However, that is not true in the case of {\tt
CM.autoload}.  {\tt CM.autoload} also initiates a traversal, but that
traversal remains ``suspended'' and will be performed incrementally as
necessary---driven by code compiled at the interactive top level.  And
yet, it is still the case that each compilation unit will be linked at
most once during this traversal and private state will not be confuse
with private state of other traversals that might be active at the same
time.

% Need a good example here.

\subsubsection*{Sharing annotations}



\section{Conditional compilation}
\label{sec:preproc}

\section{Access control}
\label{sec:access}

\section{Some history}

Although its programming model is more general, CM's implementation is
closely tied to the Standard ML programming language~\cite{milner97}
and its SML/NJ implementation~\cite{appel91:sml}.

The current version is preceded by several other compilation managers,
the most recent goin by the same name ``CM''~\cite{blume95:cm}, while
earlier ones were known as IRM ({\it Incremental Recompilation
Manager})~\cite{harper94:irm} and SC (for {\it Separate
Compilation})~\cite{harper-lee-pfenning-rollins-CM}.  CM owes many
ideas to SC and IRM.

Separate compilation in the SML/NJ system heavily relies on mechanisms
for converting static environments (i.e., the compiler's symbol
tables) into linear byte stream suitable for storage on
disks~\cite{appel94:sepcomp}.  However, unlike all its predecessors,
the current implementation of CM is integrated into the main compiler
and no longer relies on the {\em Visible Compiler} interface.

\cleardoublepage

\tableofcontents

\pagebreak

\bibliography{blume,appel,ml}

\end{document}


root@smlnj-gforge.cs.uchicago.edu
ViewVC Help
Powered by ViewVC 1.0.0