SCM Repository
Annotation of /sml/trunk/ckit/doc/overview.html
Parent Directory
|
Revision Log
Revision 597 - (view) (download) (as text)
1 : | dbm | 597 | <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> |
2 : | <html> | ||
3 : | <head> | ||
4 : | <title>ckit Overview</title> | ||
5 : | </head> | ||
6 : | |||
7 : | <body> | ||
8 : | <center> | ||
9 : | <h1>ckit: A Front End for C in SML</h1> | ||
10 : | </center> | ||
11 : | <h3>1. Getting Started</h3> | ||
12 : | <p> | ||
13 : | On unpacking the ckit sources, you should see a src directory, a doc directory | ||
14 : | and a README file (and possibly other directories, depending on the distribution). | ||
15 : | <p> | ||
16 : | The src directory contains the following subdirectories: | ||
17 : | <dl> | ||
18 : | <dt>parser/ | ||
19 : | <dd>lexer and parser, parse trees. | ||
20 : | <dt>ast/ | ||
21 : | <dd>abstract syntax trees (Ast), type-checker, pretty-printer. | ||
22 : | <dt>variants/ | ||
23 : | <dd>flags for controlling the parser and type-checker. | ||
24 : | </dl> | ||
25 : | To build the system, cd to src, run SML/NJ and type | ||
26 : | <pre> | ||
27 : | - CM.make(); | ||
28 : | </pre> | ||
29 : | To test the parser on "test.c", type | ||
30 : | <pre> | ||
31 : | - ParseToAst.fileToAst "test.c"; | ||
32 : | </pre> | ||
33 : | This parses and typechecks "test.c" and returns an abstract syntax tree for | ||
34 : | "test.c". Alternatively, to parse, type-check and then pretty-print "test.c", | ||
35 : | type | ||
36 : | <pre> | ||
37 : | - ParseToAst.fileToC "test.c"; | ||
38 : | </pre> | ||
39 : | <p> | ||
40 : | <h3>2. Using the Frontend</h3> | ||
41 : | <p> | ||
42 : | C source programs are processed in two steps. The lexer and parser | ||
43 : | translate the source to parse trees (Parser.parseFile), and the | ||
44 : | "elaboration" or static semantics phase (BuildAst.makeAst) performs | ||
45 : | type checking and translates to abstract syntax. The parse tree | ||
46 : | datatypes are defined in parse/parse-tree-sig.sml and the abstract | ||
47 : | syntax types in ast/ast-sig.sml. These definitions are fairly | ||
48 : | straightforward and should be self-explanatory. | ||
49 : | <p> | ||
50 : | Top level driving functions are in module <code>ParseToAst</code> (see | ||
51 : | ast/parse-to-ast-sig.sml). The following subsections describe some | ||
52 : | commonly used ckit functions. | ||
53 : | <p> | ||
54 : | <h4>2.1. <code>ParseToAst.fileToAst: string -> ParseToAst.astBundle</code></h4> | ||
55 : | <p> | ||
56 : | This is the main function to parse a file and produce abstract syntax. | ||
57 : | When applied to a string (the C source file name), it produces a | ||
58 : | bundle of information of type <code>astBundle</code>: | ||
59 : | <pre> | ||
60 : | type astBundle = | ||
61 : | {ast: Ast.ast, | ||
62 : | tidtab: Bindings.tidBinding Tidtab.uidtab, | ||
63 : | errorCount: int, | ||
64 : | warningCount: int, | ||
65 : | auxiliaryInfo: {astTypes: Tables.astIndexTab, | ||
66 : | implicits: Tables.astIndexTab, | ||
67 : | env: State.symtab}} | ||
68 : | </pre> | ||
69 : | where: | ||
70 : | <menu> | ||
71 : | <li> <code>ast</code> is the abstract syntax tree. | ||
72 : | <li> <code>tidtab</code> is the type identifier table that maps type identifiers into their meanings. | ||
73 : | <li> <code>errorCount</code> is the count of all errors encountered during parsing and type checking. | ||
74 : | <li> <code>warningCount</code> is the count of all warnings encountered during parsing and type checking. | ||
75 : | <li> <code>astTypes</code> is a table mapping ast indexes into the types of the corresponding ast expressions. | ||
76 : | <li> <code>env</code> is used to carry over global symbol information in some mult-file parsing applications. | ||
77 : | </menu> | ||
78 : | |||
79 : | <h4>2.2. <code>ParseToAst.fileToC : string -> unit</code></h4> | ||
80 : | <p> | ||
81 : | Process a file and pretty print the resulting ast. | ||
82 : | |||
83 : | <h4>2.3. <code>Parser.parseFile : Error.errorState -> string -> ParseTree.externalDecl list</code></h4> | ||
84 : | To get a hold of a parse tree (parser/parse-tree-sig.sml), | ||
85 : | use <code>Parser.parseFile</code> (see parser/parser-sig.sml). | ||
86 : | This function takes an <code>errorState</code> and the | ||
87 : | name of a (preprocessed) C source file and returns a list of external | ||
88 : | declaration parse trees corresponding to the top-level declarations in | ||
89 : | the source file. See parser/parse-tree-sig.sml for definitions of | ||
90 : | the parse tree types and parser/util/error-sig.sml for documentation | ||
91 : | on Error.errorState. | ||
92 : | <p> | ||
93 : | |||
94 : | <h3>3. System Structure</h3> | ||
95 : | <p> | ||
96 : | The frontend consists of a number of phases. | ||
97 : | The first phase consists of a lexer/parser (written using ml-lex and ml-yacc | ||
98 : | respectively). The output of this phase is a data-structure (parse tree) | ||
99 : | that is a simple "unprocessed" form that closely follows the structure of C's | ||
100 : | grammar. The next phase inputs the parse tree data-structure, type checks it, | ||
101 : | and produces a "processed" abstract syntax tree representation (Ast). | ||
102 : | <p> | ||
103 : | <h4>3.1. The Lexer and Parser</h4> | ||
104 : | <p> | ||
105 : | These are built using ml-lex and ml-yacc. The lex and yacc files can be found | ||
106 : | in src/parser/grammar/[c.lex,c.grm]. The parser performs only a minimal amount | ||
107 : | of syntactic processing. Many syntactic restrictions are enforced during the | ||
108 : | type-checking phase e.g restrictions on the number and combination of type | ||
109 : | specifiers used in a type. | ||
110 : | <p> | ||
111 : | Similarly, most scoping issues are addressed during type-checking. | ||
112 : | One exception is typedef. This must be handled during parsing because typedefs | ||
113 : | introduce new types and these can dramatically alter the shape of parse trees. | ||
114 : | In principle, the scoping of typedefs could be delayed till later processing, | ||
115 : | but in practice this is not feasible: in particular, if typedefs are not | ||
116 : | processed during parsing, then we cannot distinguish between | ||
117 : | declaration forms and expressions. Consider, the following program. | ||
118 : | <pre> | ||
119 : | char x; | ||
120 : | f() { | ||
121 : | typedef int x; | ||
122 : | { | ||
123 : | x * x; | ||
124 : | } | ||
125 : | } | ||
126 : | </pre> | ||
127 : | Here, "<code>x * x</code>" declares <code>x</code> as a pointer to an integer. | ||
128 : | However, if the typedef is commented out, then | ||
129 : | "<code>x * x</code>" is interpreted as an expression. | ||
130 : | <p> | ||
131 : | The treatment of typedefs involves a subtle interaction between the parser and | ||
132 : | lexer. When the parser recognizes a typedef for an identifier, it communicates | ||
133 : | to the lexer that the identifier should now be treated as a "type". | ||
134 : | Parser lookahead introduces additional complication: we cannot lex a token until | ||
135 : | any preceding typedefs have been processed. In particular, we must limit | ||
136 : | lookahead to one symbol. In fact, this only works because C's grammar requires | ||
137 : | typedefs to end in a semicolon --- this semicolon acts as a buffer so that a | ||
138 : | typedef will be completely processed before any use of the new type is lexed. | ||
139 : | Note that typedefs can be scoped (e.g. see the above program), and so the parser | ||
140 : | must tell the lexer to undo the effect of a typedef when the typedef's scope is | ||
141 : | exited. Another complication is the error recovery mechanism of ml-yacc. | ||
142 : | <p> | ||
143 : | The parser produces parse trees (see src/parser/parse-tree-sig.sml). | ||
144 : | This data structure is a simple "unprocessed" form that closely follows the | ||
145 : | structure of C's grammar. These parse trees are built up by the actions of the | ||
146 : | ml-yacc grammar. | ||
147 : | <p> | ||
148 : | Any language extensions is likely to involve extensions to the lexer, | ||
149 : | parser and to the parse tree datatype. When extending the lexer and | ||
150 : | parser, care must be taken to preserve the interaction between the | ||
151 : | lexer, the parser, and the use of one-token lookahead. Extensions to | ||
152 : | the parse tree datatype are supported via a collection of "Ext" | ||
153 : | constructors in the parse tree datatypes. The file | ||
154 : | extensions/c/parse-tree-ext.sml contains the default "empty extension" | ||
155 : | for standard C. | ||
156 : | <p> | ||
157 : | Files: | ||
158 : | <dl> | ||
159 : | <dt>parser/parser-tree-sig.sml, parser-tree.sml | ||
160 : | <dd>definition of parse tree types | ||
161 : | <dt>parser/grammar/c.lex, c.grm | ||
162 : | <dd>lex and yacc specifications | ||
163 : | <dt>parser/util/sourcemap-sig.sml, sourcemap.sml | ||
164 : | <dd>mapping source file locations | ||
165 : | <dt>parser/util/error-sig.sml, error.sml | ||
166 : | <dd>error reporting functions | ||
167 : | </dl> | ||
168 : | <p> | ||
169 : | <h4>3.2. Abstract Syntax Trees (AST'S) And BuildAst</h4> | ||
170 : | <p> | ||
171 : | BuildAst (src/ast/build-ast.sml) consumes parse trees and builds up abstract | ||
172 : | syntax trees (Ast's) while performing type checking. Ast's (src/ast/ast.sml) | ||
173 : | are defined so that each of the major syntactic categories (statements, | ||
174 : | expressions, and declarations) have a unique integer index associated with them. | ||
175 : | These indices are used to associate information with specific parts of the | ||
176 : | code. Care must be taken to preserve their uniqueness when performing code | ||
177 : | transformations. | ||
178 : | <p> | ||
179 : | Objects (global variables, local variables, functions, etc) and struct/union | ||
180 : | fields are assigned globally unique integers called program identifiers | ||
181 : | (pids). This simplifies treatment of scope in Ast. Similarly, types introduced | ||
182 : | by structs, unions, enums and typedefs are assigned globally unique | ||
183 : | integers called type identifiers (tids). | ||
184 : | <p> | ||
185 : | BuildAst performs the following tasks: | ||
186 : | <ol> | ||
187 : | <li> Scoping: scoping of variables, structs, unions, fields and enums | ||
188 : | is resolved. | ||
189 : | <p> | ||
190 : | <li> Type Checking: Full ANSIC C type checking is performed, and | ||
191 : | appropriate errors and warnings are generated. Errors and warnings | ||
192 : | are suppressed in the case where there are parse errors. The | ||
193 : | behaviour of the type checker can be customized using a collection of | ||
194 : | flags in the TypeCheckControl structure defined in | ||
195 : | src/variants/ansic/config.sml. BuildAst incrementally constructs a | ||
196 : | mapping between expression indices and types that records the type of | ||
197 : | each expression. | ||
198 : | <p> | ||
199 : | <li> Type Sizes And Memory Layout: BuildAst computes the sizes of the | ||
200 : | objects declared in the program. It also optionally reduces sizeof | ||
201 : | expressions to integer constants (the flag BuildAst.reduce_sizeof can | ||
202 : | be used to enable this feature; the default setting does not reduce | ||
203 : | sizeof constructs). BuildAst also computes the layout and alignment | ||
204 : | properties of all objects, including the offsets for fields of | ||
205 : | structs. Type size and memory layout is architecture and compiler | ||
206 : | specific. The behaviour of this aspect of BuildAst is specified in | ||
207 : | Sizes structure defined in src/variants/ansic/config.sml. | ||
208 : | <p> | ||
209 : | <li> Initializer Normalization: The meaning of an object initializer | ||
210 : | is partly determined by the type of the object begin initialized. | ||
211 : | BuildAst normalizes initializers so that they are easier to implement. | ||
212 : | Moreover, certain aspects of the type of an object are inferred from | ||
213 : | an initializer (e.g. int x[] = {1,2,3}). | ||
214 : | </ol> | ||
215 : | Files: | ||
216 : | <dl> | ||
217 : | <dt>ast/ast-sig.sml, ast.sml | ||
218 : | <dd>definition of abstract syntax datatypes. | ||
219 : | <dt>ast/build-ast.sml | ||
220 : | <dd>translation from parse trees to abstract syntax, with type | ||
221 : | checking and other static semantics processing. | ||
222 : | <dt>extensions/c/ | ||
223 : | <dd>dummy extension structures for C | ||
224 : | <dt>variants/ansic/config.sml | ||
225 : | <dd>various flags controlling error checking, type checking, etc. | ||
226 : | </dl> | ||
227 : | <p> | ||
228 : | <h4>3.3. Pretty Printer for AST</h4> | ||
229 : | Ast comes equipped with a pretty-printer (ast/pp/pp-ast-sig.sml). Not | ||
230 : | only is this useful for debugging purposes, but it also is an integral | ||
231 : | component of source-to-source applications of the frontend. When | ||
232 : | pretty printing Ast, pids and tids can be optionally printed. The | ||
233 : | following flags control this behavior: | ||
234 : | <pre> | ||
235 : | PPLib.suppressPidUnderscores: controls printing of pids | ||
236 : | PPLib.suppressPidGlobalUnderscores: controls printing of pids for global objects | ||
237 : | PPLib.suppressTidUnderscores: controls printing of tids. | ||
238 : | </pre> | ||
239 : | Files: | ||
240 : | <dl> | ||
241 : | <dt>pp/pp-ast-fn.sml | ||
242 : | <dd>the generic pretty printing code for ast | ||
243 : | <dt>pp/pp-ast-sig.sml | ||
244 : | <dd>pretty printing signature | ||
245 : | <dt>pp/pp-ast.sml | ||
246 : | <dd>default pretty printer | ||
247 : | <dt>pp/pp-ast-adornment.sml | ||
248 : | <dd>pretty printer for printing ast interspersed with adornment info | ||
249 : | <dt>pp/pp-lib.sml | ||
250 : | <dd>pretty printing for identifiers; some pretty printing combinators. | ||
251 : | </dl> | ||
252 : | <p> | ||
253 : | <h4>3.4. AST-UTILS [Not distributed yet]</h4> | ||
254 : | <p> | ||
255 : | Files: | ||
256 : | <dl> | ||
257 : | <dt>ast-utils/copy/ | ||
258 : | <dd>copying ast types | ||
259 : | <dt>ast-utils/equality/ | ||
260 : | <dd>equality for ast types | ||
261 : | <dt>ast-utils/simplifier/ | ||
262 : | <dd>ast simplifier | ||
263 : | </dl> | ||
264 : | <p> | ||
265 : | |||
266 : | <h3>4. Location Info</h5> | ||
267 : | <p> | ||
268 : | Program phrases (expressions, declarations, statements) are annotated | ||
269 : | in the abstract syntax with source code locations, which are | ||
270 : | represented by a data structure that determines a region within a | ||
271 : | source file. See src/parser/sourcemap-sig.sml. | ||
272 : | |||
273 : | <hr> | ||
274 : | <address><a href="mailto:dbm@research.bell-labs.com">Dave MacQueen</a></address> | ||
275 : | <!-- Created: Tue Dec 7 11:11:32 EST 1999 --> | ||
276 : | <!-- hhmts start --> | ||
277 : | Last modified: Tue Dec 7 15:39:13 EST 1999 | ||
278 : | <!-- hhmts end --> | ||
279 : | </body> | ||
280 : | </html> |
root@smlnj-gforge.cs.uchicago.edu | ViewVC Help |
Powered by ViewVC 1.0.0 |