Home My Page Projects Code Snippets Project Openings SML/NJ
Summary Activity Forums Tracker Lists Tasks Docs Surveys News SCM Files

SCM Repository

[smlnj] View of /sml/trunk/src/MLRISC/x86/instructions/x86Shuffle.sml
ViewVC logotype

View of /sml/trunk/src/MLRISC/x86/instructions/x86Shuffle.sml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1003 - (download) (annotate)
Fri Dec 7 02:45:32 2001 UTC (18 years, 1 month ago) by george
File size: 4869 byte(s)
Changed the representation of instructions from being fully abstract
to being partially concrete. That is to say:

  from
	type instruction

  to
	type instr				(* machine instruction *)

	datatype instruction =
	    LIVE of {regs: C.cellset, spilled: C.cellset}
          | KILL of {regs: C.cellset, spilled: C.cellset}
          | COPYXXX of {k: CB.cellkind, dst: CB.cell list, src: CB.cell list}
          | ANNOTATION of {i: instruction, a: Annotations.annotation}
          | INSTR of instr

This makes the handling of certain special instructions that appear on
all architectures easier and uniform.

LIVE and KILL say that a list of registers are live or killed at the
program point where they appear. No spill code is generated when an
element of the 'regs' field is spilled, but the register is moved to
the 'spilled' (which is present, more for debugging than anything else).

LIVE replaces the (now deprecated) DEFFREG instruction on the alpha.
We used to generate:

	DEFFREG f1
	f1 := f2 + f3
        trapb

but now generate:

	f1 := f2 + f3
	trapb
	LIVE {regs=[f1,f2,f3], spilled=[]}

Furthermore, the DEFFREG (hack) required that all floating point instruction
use all registers mentioned in the instruction. Therefore f1 := f2 + f3,
defines f1 and uses [f1,f2,f3]! This hack is no longer required resulting
in a cleaner alpha implementation. (Hopefully, intel will not get rid of
this architecture).

COPYXXX is intended to replace the parallel COPY and FCOPY  available on
all the architectures. This will result in further simplification of the
register allocator that must be aware of them for coalescing purposes, and
will also simplify certain aspects of the machine description that provides
callbacks related to parallel copies.

ANNOTATION should be obvious, and now INSTR represents the honest to God
machine instruction set!

The <arch>/instructions/<arch>Instr.sml files define certain utility
functions for making porting easier -- essentially converting upper case
to lower case. All machine instructions (of type instr) are in upper case,
and the lower case form generates an MLRISC instruction. For example on
the alpha we have:

  datatype instr =
     LDA of {r:cell, b:cell, d:operand}
   | ...

  val lda : {r:cell, b:cell, d:operand} -> instruction
    ...

where lda is just (INSTR o LDA), etc.
(* NOTE on xchg on the x86
 *
 * From Allen Leung:
 * Here's why I didn't use xchg: 
 * 
 * o  According to the optimization guide xchg mem, reg is complex,
 *    cannot be pipelined or paired at all. xchg reg, reg requires 3 uops.
 *    In contrast, mov mem, reg requires 1 or 2 uops.  
 *    So xchgs loses out, at least on paper.  
 *    [I haven't done any measurements though] 
 * 
 * o  Secondly, unlike other architectures, parallel copies are split 
 *    into individual copies during instruction selection.  Here's why
 *    I did this:  I found that more copies are retained and more spills 
 *    are generated when keeping the parallel copies.   My guess on this is 
 *    that the copy temporary for parallel copies create addition 
 *    interferences [even when they are not needed.]  
 *    This is not a problem on RISC machines, because of plentiful registers.
 *   
 * o  Spilling of parallel copies is also a very complex business when
 *    memory coalescing is turned on.  I think I have implemented a solution
 *    to this, but not using parallel copies keep life simple.   This problem
 *    could be simpler with xchg...but I haven't thought about it much.
 * 
 * From Fermin Reig:
 * In the java@gcc.gnu.org, GC  mailing lists there's been a discussion about
 * the costs of xcgh. Here's some extracts of it:
 * 
 * ----------------
 * > From: Emery Berger [mailto:emery@cs.utexas.edu]
 * > 
 * > http://developer.intel.com/design/pentium4/manuals/24547203.pdf
 * > 
 * > See Chapter 7.1. "For the P6 family processors, locked 
 * > operations serialize
 * > all outstanding load and store operations (that is, wait for them to
 * > complete). This rule is also true for the Pentium 4 
 * > processor, with one
 * > exception: load operations that reference weakly ordered 
 * > memory types (such
 * > as the WC memory type) may not be serialized. "
 * > 
 * -----------------
 * I just tried this on a 500 MHz Pentium III.  I get about 23 cycles for
 *  
 * lock; cmpxchg
 *  
 * :
 * and about 19 or 20 cycles for xchg (which has an implicit lock prefix).
 *  
 * I got consistent results by timing a loop and by looking at an instruction
 * level profile.  Putting other stuff in the loop didn't seem to affect the
 * time taken by xchg much.  Here's the code in case someone else wants to try.
 * (This requires Linux/gcc)
 * -------------------
 * Chris Dodd pointed out on the GC mailing list that on recent Intel X86
 * processors:
 *  
 * - cmpxchg without a lock prefix is much faster (roughly 3x or close to 15
 * cycles by my measurements) than either xchg (implied lock prefix) or lock;
 * cmpxchg .
 *  
 * - cmpxchg without the lock prefix is atomic on uniprocessors, i.e. it's not
 * interruptable.
 *  
 * As far as I can tell, none of the GNU libraries currently take advantage of
 * this fact.  Should they?
 *  
 * This argues, for example, that I could get noticable additional speedup from
 * Java hash synchronization on X86 by overwriting a few strategic "lock"
 * prefixes with "nop"s when I notice that there's only one processor
 *
 *
 * From John Reppy:
 *
 * Disregard what I said.  The xchg instruction has an implicit lock prefix,
 * so it is not useful for normal programming tasks.
 *)

functor X86Shuffle(I : X86INSTR) : X86SHUFFLE =
struct
  structure I = I
  structure C = I.C
  structure CB = CellsBasis
  structure Shuffle = Shuffle(I)

  type t = {tmp:I.ea option, dst:CB.cell list, src:CB.cell list}

  exception foo
  val shuffle =
    Shuffle.shuffle
        {mvInstr=fn{dst, src} => [I.move{mvOp=I.MOVL, src=src, dst=dst}],
	 ea=I.Direct}

  (*
   * These assume that the ''registers'' are mapped onto the memory
   *)

  (* Note, this only works with double precision floating point *) 
  val shufflefpNormalAndSlow = 
    Shuffle.shuffle
        {mvInstr=fn{dst, src} => [I.fldl src, I.fstpl dst],
	 ea = I.FDirect}

  (* 
   * This version makes use of the x86 floating point stack for hardware
   * renaming! 
   *)
  fun shufflefpNormal{tmp, src, dst} = let
    val n =  length src
  in 
    if n <= 7 then let 
	fun gen(s::ss, d::ds, pushes, pops) = 
  	    if CB.sameColor(s,d) then gen(ss, ds, pushes, pops)
	    else 
	      gen(ss, ds, 
		    I.fldl(I.FDirect s)::pushes,
		    I.fstpl(I.FDirect d)::pops)
	  | gen(_, _, pushes, pops) = List.revAppend(pushes, pops)
      in  gen(src, dst, [], []) 
      end
    else shufflefpNormalAndSlow{tmp=tmp, src=src, dst=dst}
  end

  (*
   * These assume that the ''registers'' are mapped onto the pseudo 
   * %fpr register.  Only works with double precision floating point for 
   * now...
   *)
  val shufflefpFast = 
       Shuffle.shuffle
         {mvInstr=fn{dst, src} => [I.fmove{fsize=I.FP64,src=src, dst=dst}],
	  ea = I.FPR}

  fun shufflefp(x as {tmp=SOME(I.FPR _), ...}) = shufflefpFast x
    | shufflefp x = shufflefpNormal x

end


root@smlnj-gforge.cs.uchicago.edu
ViewVC Help
Powered by ViewVC 1.0.0