Home My Page Projects Code Snippets Project Openings SML/NJ
Summary Activity Forums Tracker Lists Tasks Docs Surveys News SCM Files

SCM Repository

[smlnj] View of /sml/trunk/src/system/README
ViewVC logotype

View of /sml/trunk/src/system/README

Parent Directory Parent Directory | Revision Log Revision Log

Revision 430 - (download) (annotate)
Wed Sep 8 09:47:00 1999 UTC (22 years ago) by monnier
File size: 23211 byte(s)
This commit was generated by cvs2svn to compensate for changes in r429,
which included commits to RCS files with non-trunk default branches.
As an SML/NJ compiler developer, please read this document carefully.
The new CM has a lot of good things to offer, but you must be aware
of the many changes that it incurs to the process of compiling the
SML/NJ compiler.
			Matthias Blume (July 1999)

* Libraries

The new way of building the compiler is heavily library-oriented.
Aside from a tiny portion of code that is responsible for defining the
pervasive environment, _everything_ lives in libraries.  Building the
compiler means compiling and stabilizing these libraries first.  Some
of the libraries exist just for reasons of organizing the code, the
other ones are potentially useful in their own right.  Therefore, as a
beneficial side-effect of compiling the compiler, you will end up with
stable versions of these libraries.

At the moment, the following libraries are constructed when compiling
the compiler ("*" means that I consider the library potentially useful
in its own right):

	* basis.cm	- The SML'97 basis library
	- cm-hook.cm	- an internal library for organizational purposes
	- cm-lib.cm	- the library that implements CM's functionality
	* comp-lib.cm	- a helper library for the compiler, MLRISC, and CM
	* host-cm.cm	- the library that exports the public interface to
			  the compilation manager (i.e., structure CM)
	* host-cmb.cm	- the library that exports the public interface to
			  the bootstrap compiler (i.e., structure CMB)
	- host-compiler-0.cm
			- an internal library for organizational purposes
	* host-compiler.cm
			- the library that exports the public interface to
			  the visible compiler (i.e., structure Compiler)
	- intsys.cm	- an internal library for organizational purposes
			  (In fact, its the "root" of the main hierarchy.)
	* ml-yacc-lib.cm - needs no further comment
	* smlnj-lib.cm	- needs no further comment
	* target-compilers.cm
			- library exporting target-specific versions of
			  structure Compiler and of structure CMB
			  (The existence of this library is the moral
			   equivalent of "CMB.retarget" in the old CM.)
	* viscomp-lib.cm - library that implements the compiler
			  (At the moment, its interface is rather thin.  We
			   should think about how to structure the interface
			   in such a way that it becomes a useful equivalent
			   to the old "full" compiler.)

* Before you can use the bootstrap compiler (CMB)...

To be able to use CMB at all, you must first say

	CM.autoload "host-cmb.cm";

after you start sml.

* Compiling the compiler -- a two-step procedure

Until now (with the old CM), once we managed to run CMB.make() to
completion we had a directory full of binfiles that were ready to be
used by the boot procedure.  This is no longer the case.

The boot procedure now wants to use stable libraries (except for the
part that makes up the pervasive environment).  Having stable
libraries around during development of these very libraries would be a
bit annoying because if CM sees a stable library it will no longer
bother to check the corresponding source files -- even if they have
changed.  Therefore, libraries are not stabilized until you think you
are ready for that.  Thus, you should run:

	CMB.make ();

until you no longer get compile errors.  CMB.make will return true in
this case.  Then you say:

	CMB.deliver ();

This command creates a second directory parallel to the "bin"
directory -- the "boot" directory.  It will hold everything necessary
to bootstrap a new heap image.  You will probably find that
CMB.deliver() compiles a number of additional files even though
CMB.make completed successfully.  This is because CMB.make compiles
just those modules that will actually go into the heap image, but
CMB.deliver must also build the remaining files -- files that are part
of libraries to be stabilized but which are not used by the compiler.

After you have made the boot directory, if you want to continue
developing the compiler (i.e., make changes to some sources,
recompile, etc.), you must first get rid of that boot directory.
Running the "makeml" script (see below) will automatically remove the
boot directory.

The names of "bin" and "boot" directories are




respectively, with "comp" being the default for <prefix>.  To change
the prefix, use CMB.make' and CMB.deliver' with the new prefix
provided as the optional string argument to these functions.

* Making the heap image

The heap image is made by running the "makeml" script that you find
here in this directory.  By default it will try to refer to the
comp.boot.<arch>-<os> directory.  You can change this using the -boot
argument (which takes the full name of the boot directory to be used).

The "feel" of using makeml should be mostly as it used to.  However,
internally, there are some changes that you should be aware of:

1. The script will make a heap image and also move its associated
   libraries into a separate directory.

2. There is no "-full" option anymore.  This functionality should
   eventually be provided by a library with a sufficiently rich export

3. No image will be generated if you use the -rebuild option.
   Instead, the script quits after making new bin and new boot
   directories.  You must re-invoke makeml with a suitable "-boot"
   option to actually make the image.  The argument to "-rebuild"
   is the <prefix> for the new bin and boot directories (see above).

4. Unless you use "-rebuild", makeml will delete the boot directory
   (thus readying you for further "CMB.make();" runs).

* Testing a newly generated heap image

If you use a new heap image by saying "sml @SMLload=..." then things
will not go as you may expect because along with the new heap image
should go those new stable libraries, but unless you do something
about it, the new CM will look for its stable libraries in places
where you stored your _old_ stable libraries.

After you have made the new heap image, the new libraries are in a
separate directory whose name is derived from the name of the heap
image.  The "testml" script that you also find here will run the heap
image and instruct it to look for its libraries in that new library
"testml" takes the name of the heap image as its single argument.  It
expects the library directory to be the one that makeml builds.

* Installing a heap image for more permanent use

Since you have been using the new CM already, it can be assumed that
you have already set up a correct pathname configuration.  (For more
information on pathname configurations, see below.)  With a correct
pathname configuration in place, you can "install" a newly generated
heap image by replacing the old image with the new one _AND AT THE
SAME TIME_ replacing the old stable libaries with the new ones.

* Cross-compiling

All cross-compilers live in the "target-compilers.cm" library.  You
must first say

	CM.autoload "target-compilers.cm";

before you can access them.  (This step corresponds to the old
CMB.retarget call.)  After that, _all_ cross-compilers are available
at the same time.  However, the ones that you are not using don't take
up any undue space because they only get loaded once you actually
mention them at the top-level.  The names of the structures currently
exported by target-compilers.cm are:

	structure Alpha32UnixCMB
	structure HppaUnixCMB
	structure PPCMacOSCMB
	structure PPCUnixCMB
	structure SparcUnixCMB
	structure X86UnixCMB
	structure X86Win32CMB

	structure Alpha32Compiler
	structure HppaCompiler
	structure PPCCompiler
	structure SparcCompiler
	structure X86Compiler

(PPCMacOSCMB is not very useful at the moment because there is no
implementation of the basis library for the MacOS.)

* Path configuration

+ Basics:

One of the new features of CM is its handling of path names.  In the
old CM, one particular point of trouble was the autoloader.  It
analyzes a group or library and remembers the locations of associated
files.  Later, when the necessity arises, those files will be read.
Therefore, one was asking for trouble if the current working directory
was changed between analysis- and load-time, or, worse, if files
actually moved about (as is often the case if build- and
installation-directories are different, or, to put it more generally,
if CM's state is frozen into a heap image and used in a different

Maybe it would have been possible to work around most of these
problems by fixing the path-lookup mechanism in the old CM and using
it extensively.  But path-lookup (as in the Unix-shell's "PATH") is
inherently dangerous because one can never be too sure what it will be
that is found on the path.  A new file in one of the directories early
in the path can upset the program that hopes to find something under
the same name later on the path.  Even when ignoring security-issues
like trojan horses and such, this definitely opens the door for
various unpleasant surprises.  (Who has not ever named a test version
of a program "test" an found that it acts strangely only to discover
later that /bin/test was run instead?)

Thus, the new scheme used by CM is a fixed mapping of what I call
"configuration anchors" to corresponding directories.  The mapping can
be changed, but one must do so explicitly.  In effect, it does not
depend on the contents of the file system.  Here is how it works:

If I specify a relative pathname in one of CM's description files
where the first component (the first arc) of that pathname is known to
CM as a configuration anchor, then the corresponding directory
(according to CM's mapping) is prepended to the path.  Suppose the
path name is "a/foo.sml" and "a" is a known anchor that maps to
"/usr/lib/smlnj", then the resulting complete pathname is
"/usr/lib/smlnj/a/foo.sml".  The pathname can be a single arc (but
does not have to be).  For example, the anchor "basis.cm" is typically
mapped to the directory where the basis library is stored.

Now, the important point is that one can change the mapping of the
anchor, and the path name will also change accordingly -- even very
late in the game.  CM avoids "elaborating" path names until it really
needs them when it is time to open files.  CM is also willing to
re-elaborate the same names if there is reason to do so. Thus, the
"basis.cm" library that was analyzed "here" but then moved "there"
will also be found "there" if the anchor has been re-set accordingly.

+ Different configurations at different times:

During compilation of the compiler, CMB uses a path configuration that
is read from the file "pathconfig" located here in this directory.
Warning: The names in that pathconfig file are relative pathnames and
will work only if you are in this directory.  (This will typically be
the case since you are compiling the compiler. Normally, however, path
configurations should map anchors to absolute pathnames.)

At bootstrap time, the same anchors are mapped to the corresponding
sub-directory of the "boot" directory: basis.cm is mapped to
comp.boot.<arch>-<os>/basis.cm -- which means that CM will look for a
library named comp.boot.<arch>-<os>/basis.cm/basis.cm -- and so forth.

By the way, you will perhaps notice that there is no file
but there _is_ the corresponding stable archive
CM always looks for stable archives first.

This mapping (from anchors to names in the boot directory) is the one
that will get frozen into the generated heap image at boot time.
Thus, unless it is changed, CM will look for its libraries in the boot
directory.  The aforementioned "testml" script will make sure that
the mapping is changed to the one specified in a new "pathconfig" file
which was created by makeml and placed into the test library
directory.  It points all anchors to the corresponding entry in the
test library directory.  Thus, "testml" will let a new heap image run
with its corresponding new libraries.

Normally, however, CM consults other pathconfig files at startup --
files that live in standard locations.  These files are used to modify
the path configuration to let anchors point to their "usual" places.
The names of the files that are read (if present) are configurable via
environment variables.  At the moment they default to
The first one is configurable via CM_PATHCONFIG (and the default is
configurable at boot time via CM_PATHCONFIG_DEFAULT); the last is
In fact, the makeml script sets the CM_PATHCONFIG_DEFAULT variable
before making the heap image.  Therefore, heap images generated by
makeml will look for their global pathconfig file in


For example, I always keep my "good" libraries in `pwd`/../../lib --
where both the main "install" script and the "installml" script (see
below) also put them -- so I don't have to do anything special about
my pathconfig file.

Once I have new heap image and libraries working, I replace the old
"good" image with the new one:

  mv <image>.<arch>-<osvariant> ../../bin/.heap/sml.<arch>-<osvariant>

and then:

  rm -r ../../lib/*.cm
  mv <image>.libs/*.cm ../../lib

For convenience, there is a script called "installml" that automates
this task.  Using the script has the added advantage that it will not
clobber libraries that belong to other than the current architecture.
(The rather heavy-handed "rm/mv" approach above will delete all stable
libraries for all architectures.)

Of course, you can organize things differently for yourself -- the
path configuration mechanism should be sufficiently flexible.

* Libraries vs. Groups

With the old CM, "group" was the primary concept while "library" and
"stabilization" could be considered afterthoughts.  This has changed.
Now "library" is the primary concept, "stabilization" is semantically
significant, and "groups" are a secondary mechanism.

Libraries are used to "structure the world"; groups are used to give
structure to libraries.  Each group can be used either in precisely
one library (in which case it cannot be used at the interactive
toplevel) or at the toplevel (in which case it cannot be used in any
library).  In other words, if you count the toplevel as a library,
then each group has a unique "owner" library.  Of course, there still
is no limit on how many times a group can be mentioned as a member of
other groups -- as long as all these other groups belong to the same
owner library.

If you want to take a collection of files whose purpose fits that of a
library, then, please, make them into a library (i.e., not a group!).
The purpose of groups is to deal with name-space issues _within_

Aside from the fact that I find this design quite natural, there is
actually a technical reason for it: when you stabilize a library
(groups cannot be stabilized), then all its sub-groups (not
sub-libraries!)  get "sucked into" the stable archive of the library.
In other words, even if you have n+1 CM description files (1 for the
library, n for n sub-groups), there will be just one file representing
the one stable archive (per architecture/os) for the whole thing.  For
example, I structured the standard basis into one library with two
sub-groups, but once you compile it (CMB.deliver) there is only one
stable file that represents the whole basis library.  If groups were
allowed to appear in more than one library, then stabilization would
duplicate the group (its code, its environment data structures, and
even its dynamic state).

There is a small change to the syntax of group description files: they
must explicitly state which library they belong to. CM will verify
that.  The owner library is specified in parentheses after the "group"
keyword.  If the specification is missing (that's the "old" syntax),
then the the owner will be taken to be the interactive toplevel.

There are several examples of this throughout the system's source
hierarchy.  One notable case is MLRISC.  It should probably be made
into a library of its own, but I leave this job to Lal.  At the moment
MLRISC.cm is a sub-group of viscomp-lib.cm.

* Pervasive environment, core environment, other "primitive" environments

Just a handful of files is compiled at the beginning in order to
establish a number of "primitive" environments -- including the
"pervasive" environment and the "core" environment.  The pervasive
environment no longer includes the entire basis library but only
non-modular bindings (top-level bindings of variables and types).

CM cannot automatically determine dependencies for these initial
source files, but it still does use its regular cutoff recompilation
mechanism.  Therefore, dependencies must be given explicitly.  This is
done by a special description file which currently lives in
Init/init.cmi.  See the long comment at the beginning of that file for
more details.

* Autoloader

The new system heavily relies on the autoloader.  As a result, almost
no static environments need to get unpickled at bootstap time.  The
construction of such environments is deferred until they become
necessary.  Because of this, I was able to reduce the size of the heap
image by more than one megabyte (depending on the architecture).  The
downside (although not really terribly bad) is that there is a short
wait when you first touch an identifier that hasn't been touched
before.  (I acknowledge that the notion of "short" may depend on your
sense of urgency. :-)

The reliance on the autoloader (and therefore CM's library mechanism)
means that in order to be able to use the system, your paths must be
properly configured.

Two libraries get pre-registered at bootstap time: the basis library
("basis.cm") and CM itself ("host-cm.cm").  The latter is crucial:
without it one wouldn't be able to register any other libraries
via CM.autoload.  The registration of basis.cm is a mere convenience.

Here are some other useful libraries that are not pre-registered but
which can easily be made accessible via CM.autoload (or, non-lazily,
via CM.make):

	host-compiler.cm	- provides "structure Compiler"
	host-cmb.cm		- provides "structure CMB"
	target-compilers.cm	- provides "structure <Arch>Compiler" and
				  "structure <Arch><OS>CMB" for various
				  values of <Arch> and <OS>
	smlnj-lib.cm		- the SML/NJ library

* Internal sharing

Dynamic values of loaded modules are shared.  This is true even for
those modules that are used by the interactive compiler itself.  If
you load a module from a library that is also used by the interactive
compiler, then "loading" means "loading the static environmnent" -- it
does not mean "loading the code and linking it".  Instead, you get to
share the compiler's dynamic values (and therefore the executable
code as well).

Of course, if you load a module that hasn't been loaded before and
also isn't used by the interactive system, then CM will get the code
and link (execute) it.

* Access control

In some places, you will find that the "group" and "library" keywords
in description files are preceeded by certain strings, sometimes in
parentheses.  These strings are the names of "privileges".  Don't
worry about them too much at the moment.  For the time being, access
control is not enforced, but the infrastructure is in place.

* Preprocessor

The syntax of expressions in #if and #elif clauses is now more ML-ish
instead of C-ish.  (Hey, this is ML after all!)  In particular, you
must use "andalso", "orelse", and "not" instead of "&&", "||" and "!".
Unary minus is "~".

A more interesting change is that you can now query the exports of

  - Within the "members" section of the description (i.e., after "is"):
    The expression
	defined(<namespace> <name>)
    is true if any of the included members preceeding this clause exports
    a symbol "<namespace> <name>".
  - Within the "exports" section of the description (i.e., before "is):
    The same expression is true if _any_ of the members exports the
    named symbol.
    (It would be more logical if the exports section would follow the
     members section, but for esthetic reasons I prefer the exports
     section to come first.)


	|Library		   |
	|  structure Foo	   |
	|#if defined(structure Bar)|
	|  structure Bar	   |
	|#endif			   |
	|is			   |
	|#if SMLNJ_VERSION > 110   |
	|  new-foo.sml		   |
	|#else			   |
	|  old-foo.sml		   |
	|#endif			   |
	|#if defined(structure Bar)|
	|  bar-client.sml	   |
	|#else			   |
	|  no-bar-so-far.sml	   |
	|#endif			   |

Here, the file "bar-client.sml" gets included if SMLNJ_VERSION is
greater than 110 and new-foo.sml exports a structure Bar _or_ if
SMLNJ_VERSION <= 110 and old-foo.sml exports structure Bar. Otherwise
"no-bar-so-far.sml" gets included instead.  In addition, the export of
structure Bar is guarded by its own existence.  (Structure Bar could
also be defined by "no-bar-so-far.sml" in which case it would get
exported regardless of the outcome of the other "defined" test.)

Some things to note:

  - For the purpose of the pre-processor, order among members is
    significant.  (For the purpose of dependency analysis, order continues
    to be not significant).
  - As a consequence, in some cases pre-processor dependencies and
    compilation-dependencies may end up to be opposites of each other.
    (This is not a problem; it may very well be a feature.)

* The Basis Library is no longer built-in

The SML'97 basis is no longer built-in.  If you want to use it, you
must specify "basis.cm" as a member of your group/library.

* No more aliases

The "alias" feature is no longer with us.  At first I thought I could
keep it, but it turns out that it causes some fairly fundamental
problems with the autoloader.  However, I don't think that this is a
big loss because path anchors make up for most of it.  Moreover,
stable libraries can now easily be moved to convenient locations
without having to move large source trees at the same time. (See my
new build/install.sh script for examples of that.)

* Don't use relative or absolute pathnames to refer to libraries

Don't use relative or absolute pathnames to refer to libraries.  If
you do it anyway, you'll get an appropriate warning at the time when
you do CMB.deliver().  If you use relative or absolute pathnames to
refer to library B from library A, you will be committed to keeping B
in the same relative (to A) or absolute location.  This, clearly,
would be undesirable.

ViewVC Help
Powered by ViewVC 1.0.0