Home My Page Projects Code Snippets Project Openings SML/NJ
Summary Activity Forums Tracker Lists Tasks Docs Surveys News SCM Files

SCM Repository

[smlnj] Diff of /sml/trunk/src/MLRISC/x86/instructions/x86Shuffle.sml
ViewVC logotype

Diff of /sml/trunk/src/MLRISC/x86/instructions/x86Shuffle.sml

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 796, Tue Mar 6 00:04:33 2001 UTC revision 797, Fri Mar 16 00:00:17 2001 UTC
# Line 1  Line 1 
1    (* NOTE on xchg on the x86
2     *
3     * From Allen Leung:
4     * Here's why I didn't use xchg:
5     *
6     * o  According to the optimization guide xchg mem, reg is complex,
7     *    cannot be pipelined or paired at all. xchg reg, reg requires 3 uops.
8     *    In contrast, mov mem, reg requires 1 or 2 uops.
9     *    So xchgs loses out, at least on paper.
10     *    [I haven't done any measurements though]
11     *
12     * o  Secondly, unlike other architectures, parallel copies are split
13     *    into individual copies during instruction selection.  Here's why
14     *    I did this:  I found that more copies are retained and more spills
15     *    are generated when keeping the parallel copies.   My guess on this is
16     *    that the copy temporary for parallel copies create addition
17     *    interferences [even when they are not needed.]
18     *    This is not a problem on RISC machines, because of plentiful registers.
19     *
20     * o  Spilling of parallel copies is also a very complex business when
21     *    memory coalescing is turned on.  I think I have implemented a solution
22     *    to this, but not using parallel copies keep life simple.   This problem
23     *    could be simpler with xchg...but I haven't thought about it much.
24     *
25     * From Fermin Reig:
26     * In the java@gcc.gnu.org, GC  mailing lists there's been a discussion about
27     * the costs of xcgh. Here's some extracts of it:
28     *
29     * ----------------
30     * > From: Emery Berger [mailto:emery@cs.utexas.edu]
31     * >
32     * > http://developer.intel.com/design/pentium4/manuals/24547203.pdf
33     * >
34     * > See Chapter 7.1. "For the P6 family processors, locked
35     * > operations serialize
36     * > all outstanding load and store operations (that is, wait for them to
37     * > complete). This rule is also true for the Pentium 4
38     * > processor, with one
39     * > exception: load operations that reference weakly ordered
40     * > memory types (such
41     * > as the WC memory type) may not be serialized. "
42     * >
43     * -----------------
44     * I just tried this on a 500 MHz Pentium III.  I get about 23 cycles for
45     *
46     * lock; cmpxchg
47     *
48     * :
49     * and about 19 or 20 cycles for xchg (which has an implicit lock prefix).
50     *
51     * I got consistent results by timing a loop and by looking at an instruction
52     * level profile.  Putting other stuff in the loop didn't seem to affect the
53     * time taken by xchg much.  Here's the code in case someone else wants to try.
54     * (This requires Linux/gcc)
55     * -------------------
56     * Chris Dodd pointed out on the GC mailing list that on recent Intel X86
57     * processors:
58     *
59     * - cmpxchg without a lock prefix is much faster (roughly 3x or close to 15
60     * cycles by my measurements) than either xchg (implied lock prefix) or lock;
61     * cmpxchg .
62     *
63     * - cmpxchg without the lock prefix is atomic on uniprocessors, i.e. it's not
64     * interruptable.
65     *
66     * As far as I can tell, none of the GNU libraries currently take advantage of
67     * this fact.  Should they?
68     *
69     * This argues, for example, that I could get noticable additional speedup from
70     * Java hash synchronization on X86 by overwriting a few strategic "lock"
71     * prefixes with "nop"s when I notice that there's only one processor
72     *
73     *
74     * From John Reppy:
75     *
76     * Disregard what I said.  The xchg instruction has an implicit lock prefix,
77     * so it is not useful for normal programming tasks.
78     *)
79    
80  functor X86Shuffle(I : X86INSTR) : X86SHUFFLE =  functor X86Shuffle(I : X86INSTR) : X86SHUFFLE =
81  struct  struct
82    structure I = I    structure I = I

Legend:
Removed from v.796  
changed lines
  Added in v.797

root@smlnj-gforge.cs.uchicago.edu
ViewVC Help
Powered by ViewVC 1.0.0