11 |
Date: |
Date: |
12 |
Tag: <post-commit CVS tag> |
Tag: <post-commit CVS tag> |
13 |
Description: |
Description: |
14 |
|
|
15 |
|
---------------------------------------------------------------------- |
16 |
|
Name: Allen Leung |
17 |
|
Date: 2000/03/29 18:00:00 |
18 |
|
Tag: leunga-20000327-mlriscGen_hppa_alpha_x86 |
19 |
|
Description: |
20 |
|
|
21 |
|
This update contains *MAJOR* changes to the way code is generated from CPS |
22 |
|
in the module mlriscGen, and in various backend modules. |
23 |
|
|
24 |
|
CHANGES |
25 |
|
======= |
26 |
|
|
27 |
|
1. MLRiscGen: forward propagation fix. |
28 |
|
|
29 |
|
There was a bug in forward propagation introduced at about the same time |
30 |
|
as the MLRISC x86 backend, which prohibits coalescing to be |
31 |
|
performed effectively in loops. |
32 |
|
|
33 |
|
Effect: speed up of loops in RISC architectures. |
34 |
|
By itself, this actually slowed down certain benchmarks on the x86. |
35 |
|
|
36 |
|
2. MLRiscGen: forward propagating addresses from consing. |
37 |
|
|
38 |
|
I've changed the way consing code is generated. Basically I separated |
39 |
|
out the initialization part: |
40 |
|
|
41 |
|
store tag, offset(allocptr) |
42 |
|
store elem1, offset+4(allocptr) |
43 |
|
store elem2, offset+8(allocptr) |
44 |
|
... |
45 |
|
store elemn, offset+4n(allocptr) |
46 |
|
|
47 |
|
and the address computation part: |
48 |
|
|
49 |
|
celladdr <- offset+4+alloctpr |
50 |
|
|
51 |
|
and move the address computation part |
52 |
|
|
53 |
|
Effect: register pressure is generally lower as a result. This |
54 |
|
makes compilation of certain expressions much faster, such as |
55 |
|
long lists with non-trivial elements. |
56 |
|
|
57 |
|
[(0,0), (0,0), .... (0,0)] |
58 |
|
|
59 |
|
3. MLRiscGen: base pointer elimination. |
60 |
|
|
61 |
|
As part of the linkage mechanism, we generate the sequence: |
62 |
|
|
63 |
|
L: ... <- start of the code fragment |
64 |
|
|
65 |
|
L1: |
66 |
|
base pointer <- linkreg - L1 + L |
67 |
|
|
68 |
|
The base pointer was then used for computing relocatable addresses |
69 |
|
in the code fragment. Frequently (such as in lots of continuations) |
70 |
|
this is not needed. We now eliminate this sequence whenever possible. |
71 |
|
|
72 |
|
For compile time efficiency, I'm using a very stupid local heuristic. |
73 |
|
But in general, this should be done as a control flow analysis. |
74 |
|
|
75 |
|
Effect: Smaller code size. Speed up of most programs. |
76 |
|
|
77 |
|
4. Hppa back end |
78 |
|
|
79 |
|
Long jumps in span dependence resolution used to depend on the existence |
80 |
|
of the base pointer. |
81 |
|
|
82 |
|
A jump to a long label L was expanded into the following sequence: |
83 |
|
|
84 |
|
LDIL %hi(L-8192), %r29 |
85 |
|
LDO %lo(L-8192)(%r29), %r29 |
86 |
|
ADD %r29, baseptr, %r29 |
87 |
|
BV,n %r0(%r29) |
88 |
|
|
89 |
|
In the presence of change (3) above, this will not work. I've changed |
90 |
|
it so that the following sequence of instructions are generated, which |
91 |
|
doesn't mention the base pointer at all: |
92 |
|
|
93 |
|
BL,n L', %r29 /* branch and link, L' + 4 -> %r29 */ |
94 |
|
L': ADDIL L-(L'+4), %r29 /* Compute address of L */ |
95 |
|
BV,n %r0(%r29) /* Jump */ |
96 |
|
|
97 |
|
5. Alpha back end |
98 |
|
|
99 |
|
New alpha instructions LDB/LDW have been added, as per Fermin's |
100 |
|
suggestions. This is unrelated to all other changes. |
101 |
|
|
102 |
|
6. X86 back end |
103 |
|
|
104 |
|
I've changed andl to testl in the floating point test sequence |
105 |
|
whenever appropriate. The Intel optimization guide states that |
106 |
|
testl is perferable to andl. |
107 |
|
|
108 |
|
7. RA (x86 only) |
109 |
|
|
110 |
|
I've improved the spill propagation algorithm, using an approximation |
111 |
|
of maximal weighted independent sets. This seems to be necessary to |
112 |
|
alleviate the negative effect in light of the slow down in (1). |
113 |
|
|
114 |
|
I'll write down the algorithm one of these days. |
115 |
|
|
116 |
|
8. MLRiscGen: frequencies |
117 |
|
|
118 |
|
I've added an annotation that states that all call gc blocks have zero |
119 |
|
execution frequencies. This improves register allocation on the x86. |
120 |
|
|
121 |
|
BENCHMARKS |
122 |
|
========== |
123 |
|
|
124 |
|
I've only perform the comparison on 110.25. |
125 |
|
|
126 |
|
The platforms are: |
127 |
|
|
128 |
|
HPPA A four processor HP machine (E9000) with 5G of memory. |
129 |
|
X86 A 300Hhz Pentium II with 128M of memory, and |
130 |
|
SPARC An Ultra sparc 2 with 512M of memory. |
131 |
|
|
132 |
|
I used the following parameters for the SML benchmarks: |
133 |
|
|
134 |
|
@SMLalloc |
135 |
|
HPPA 256k |
136 |
|
SPARC 512k |
137 |
|
X86 256k |
138 |
|
|
139 |
|
COMPILATION TIME |
140 |
|
---------------- |
141 |
|
Here are the numbers comparing the compilation times of the compilers. |
142 |
|
I've only compared 110.25 compiling the new sources versus |
143 |
|
a fixpoint version of the new compiler compiling the same. |
144 |
|
|
145 |
|
110.25 New |
146 |
|
Total Time in RA Spill+Reload Total Time In RA Spill+Reload |
147 |
|
HPPA 627s 116s 2684+3584 599s 95s 1003+1879 |
148 |
|
SPARC 892s 173s 2891+3870 708s 116s 1004+1880 |
149 |
|
X86 999s 315s 94006+130691 987s 296s 108877+141957 |
150 |
|
|
151 |
|
110.25 New |
152 |
|
Code Size Code Size |
153 |
|
HPPA 8596736 8561421 |
154 |
|
SPARC 8974299 8785143 |
155 |
|
X86 9029180 8716783 |
156 |
|
|
157 |
|
So in summary, things are at least as good as before. Dramatic |
158 |
|
reduction in compilation is obtained on the Sparc; I can't explain it, |
159 |
|
but it is reproducible. Perhaps someone should try to reproduce this |
160 |
|
on their own machines. |
161 |
|
|
162 |
|
SML BENCHMARKS |
163 |
|
-------------- |
164 |
|
|
165 |
|
On the average, all benchmarks perform at least as well as before. |
166 |
|
|
167 |
|
HPPA Compilation Time Spill+Reload Run Time |
168 |
|
110.25 New 110.25 New 110.25 New |
169 |
|
|
170 |
|
barnesHut 3.158 3.015 4.75% 1+1 0+0 2.980 2.922 2.00% |
171 |
|
boyer 6.152 5.708 7.77% 0+0 0+0 0.218 0.213 2.34% |
172 |
|
count-graphs 1.168 1.120 4.32% 0+0 0+0 22.705 23.073 -1.60% |
173 |
|
fft 0.877 0.792 10.74% 1+3 1+3 0.602 0.587 2.56% |
174 |
|
knuthBendix 3.180 2.857 11.32% 0+0 0+0 0.675 0.662 2.02% |
175 |
|
lexgen 6.190 5.290 17.01% 0+0 0+0 0.913 0.788 15.86% |
176 |
|
life 0.803 0.703 14.22% 25+25 0+0 0.153 0.140 9.52% |
177 |
|
logic 2.048 2.007 2.08% 6+6 1+1 4.133 4.008 3.12% |
178 |
|
mandelbrot 0.077 0.080 -4.17% 0+0 0+0 0.765 0.712 7.49% |
179 |
|
mlyacc 22.932 20.937 9.53% 154+181 32+57 0.468 0.430 8.91% |
180 |
|
nucleic 5.183 5.060 2.44% 2+2 0+0 0.125 0.120 4.17% |
181 |
|
ratio-regions 3.357 3.142 6.84% 0+0 0+0 116.225 113.173 2.70% |
182 |
|
ray 1.283 1.290 -0.52% 0+0 0+0 2.887 2.855 1.11% |
183 |
|
simple 6.307 6.032 4.56% 28+30 5+7 3.705 3.658 1.28% |
184 |
|
tsp 0.888 0.862 3.09% 0+0 0+0 7.040 6.893 2.13% |
185 |
|
vliw 24.378 23.455 3.94% 106+127 25+45 2.758 2.707 1.91% |
186 |
|
-------------------------------------------------------------------------- |
187 |
|
Average 6.12% 4.09% |
188 |
|
|
189 |
|
SPARC Compilation Time Spill+Reload Run Time |
190 |
|
110.25 New 110.25 New 110.25 New |
191 |
|
|
192 |
|
barnesHut 3.778 3.592 5.20% 2+2 0+0 3.648 3.453 5.65% |
193 |
|
boyer 6.632 6.110 8.54% 0+0 0+0 0.258 0.242 6.90% |
194 |
|
count-graphs 1.435 1.325 8.30% 0+0 0+0 33.672 34.737 -3.07% |
195 |
|
fft 0.980 0.940 4.26% 3+9 2+6 0.838 0.827 1.41% |
196 |
|
knuthBendix 3.590 3.138 14.39% 0+0 0+0 0.962 0.967 -0.52% |
197 |
|
lexgen 6.593 6.072 8.59% 1+1 0+0 1.077 1.078 -0.15% |
198 |
|
life 0.972 0.868 11.90% 26+26 0+0 0.143 0.140 2.38% |
199 |
|
logic 2.525 2.387 5.80% 7+7 1+1 5.625 5.158 9.05% |
200 |
|
mandelbrot 0.090 0.093 -3.57% 0+0 0+0 0.855 0.728 17.39% |
201 |
|
mlyacc 26.732 23.827 12.19% 162+189 32+57 0.550 0.560 -1.79% |
202 |
|
nucleic 6.233 6.197 0.59% 3+3 0+0 0.163 0.173 -5.77% |
203 |
|
ratio-regions 3.780 3.507 7.79% 0+0 0+0 133.993 131.035 2.26% |
204 |
|
ray 1.595 1.550 2.90% 1+1 0+0 3.440 3.418 0.63% |
205 |
|
simple 6.972 6.487 7.48% 29+32 5+7 3.523 3.525 -0.05% |
206 |
|
tsp 1.115 1.063 4.86% 0+0 0+0 7.393 7.265 1.77% |
207 |
|
vliw 27.765 24.818 11.87% 110+135 25+45 2.265 2.135 6.09% |
208 |
|
---------------------------------------------------------------------------- |
209 |
|
Average 6.94% 2.64% |
210 |
|
|
211 |
|
X86 Compilation Time Spill+Reload Run Time |
212 |
|
110.25 New 110.25 New 110.25 New |
213 |
|
|
214 |
|
barnesHut 5.530 5.420 2.03% 593+893 597+915 3.532 3.440 2.66% |
215 |
|
boyer 8.768 7.747 13.19% 493+199 301+289 0.327 0.297 10.11% |
216 |
|
count-graphs 2.040 2.010 1.49% 298+394 315+457 26.578 28.660 -7.26% |
217 |
|
fft 1.327 1.302 1.92% 112+209 115+210 1.055 0.962 9.71% |
218 |
|
knuthBendix 5.218 5.475 -4.69% 451+598 510+650 0.928 0.932 -0.36% |
219 |
|
lexgen 9.970 9.623 3.60% 1014+841 1157+885 0.947 0.928 1.97% |
220 |
|
life 1.183 1.183 0.00% 162+182 145+148 0.127 0.103 22.58% |
221 |
|
logic 3.285 3.512 -6.45% 514+684 591+836 5.682 5.577 1.88% |
222 |
|
mandelbrot 0.147 0.143 2.33% 38+41 33+54 0.703 0.690 1.93% |
223 |
|
mlyacc 35.457 32.763 8.22% 3496+4564 3611+4860 0.552 0.550 0.30% |
224 |
|
nucleic 7.100 6.888 3.07% 239+168 201+158 0.175 0.173 0.96% |
225 |
|
ratio-regions 6.388 6.843 -6.65% 1182+257 981+300 120.142 120.345 -0.17% |
226 |
|
ray 2.332 2.338 -0.29% 346+398 402+494 3.593 3.540 1.51% |
227 |
|
simple 9.912 9.903 0.08% 1475+941 1579+1168 3.057 3.178 -3.83% |
228 |
|
tsp 1.623 1.532 5.98% 266+200 250+211 8.045 7.878 2.12% |
229 |
|
vliw 33.947 35.470 -4.29% 2629+2774 2877+3171 2.072 1.890 9.61% |
230 |
|
---------------------------------------------------------------------------- |
231 |
|
Average 1.22% 3.36% |
232 |
|
|
233 |
---------------------------------------------------------------------- |
---------------------------------------------------------------------- |
234 |
Name: Allen Leung |
Name: Allen Leung |
235 |
Date: 2000/03/23 16:25:00 |
Date: 2000/03/23 16:25:00 |