public inbox for cgen@sourceware.org
 help / color / mirror / Atom feed
* exposed pipeline patch (long!)
@ 2003-01-09  3:27 Ben Elliston
  2003-01-09  6:41 ` Doug Evans
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Ben Elliston @ 2003-01-09  3:27 UTC (permalink / raw)
  To: cgen

I'm posting this patch on behalf of Graydon Hoare, who write this
exposed pipeline support last year.  It's a more generalised form of
the (delay ..) rtx and has been used for a couple of ports already.

Rather than just commit it, I thought I would post it for review.
Okay to commit?

Ben

2001-06-05  graydon hoare  <graydon@redhat.com>

        * utils.scm (foldl): Define.
        (foldr): Define.
        (union): Define.
        (intersection): Simplify.
        * sid.scm : Set APPLICATION to SID-SIMULATOR.
        (-op-gen-delayed-set-maybe-trace): Define.
        (<operand> 'gen-set-{quiet,trace}): Delegate to
        op-gen-delayed-set-quiet etc. Note: this is still a little tangled
        up and needs cleaning.
        (-with-parallel?): Hardwire with-parallel to #t.
        (<operand> 'cxmake-get): Replace with lookahead-aware code
        * sid-decode.scm: Remove per-insn writeback fns.
        (-gen-idesc-decls): Redefine sem_fn type.
        * sid-cpu.scm (gen-write-stack-structure): Replace parexec stuff
        with write stack stuff.
        (cgen-write.cxx): Replace per-insn writebacks with single write
        stack writeback. Add write stack reset function.
        (-gen-scache-semantic-fn insn): Replace parexec stuff with write
        stack stuff.
        * rtl-c.scm (xop): Clone operand into delayed operand if #:delayed
        estate attribute set.
        (delay): Set #:delayed attribute to calculated delay, update
        maximum delay of cpu, check (delay ...) usage.
        * operand.scm (<operand>): Add delayed slot to <operand>.
        * mach.scm (<cpu>): Add max-delay slot to <cpu>.
        * dev.scm (load-sid): Set APPLICATION to SID-SIMULATOR.
        * doc/rtl.texi (Expressions): Add section on (delay ...).

Index: utils.scm
===================================================================
RCS file: /cvs/src/src/cgen/utils.scm,v
retrieving revision 1.7
diff -u -p -r1.7 utils.scm
--- utils.scm	7 Jan 2002 08:23:59 -0000	1.7
+++ utils.scm	9 Jan 2003 03:22:12 -0000
@@ -78,6 +78,10 @@
 
 (define (spaces n) (make-string n #\space))
 
+; simple list-generators
+(define (seq p q) (if (> p q) '() (cons p (seq (+ p 1) q))))
+(define (fill x n) (if (> n 0) (cons x (fill x (- n 1))) '()))
+
 ; Write N spaces to PORT, or the current output port if elided.
 
 (define (write-spaces n . port)
@@ -471,6 +475,17 @@
   (reverse! (list-drop n (reverse l)))
 )
 
+;; left fold
+(define (foldl kons accum lis) 
+  (if (null? lis) accum 
+      (foldl kons (kons accum (car lis)) (cdr lis))))
+
+;; right fold
+(define (foldr kons knil lis) 
+  (if (null? lis) knil 
+      (kons (car lis) (foldr kons knil (cdr lis)))))
+
+
 ; APL's +\ operation on a vector of numbers.
 
 (define (plus-scan l)
@@ -540,12 +555,13 @@
 
 ; Return intersection of two lists.
 
-(define (intersection l1 l2)
-  (cond ((null? l1) l1)
-	((null? l2) l2)
-	((memq (car l1) l2) (cons (car l1) (intersection (cdr l1) l2)))
-	(else (intersection (cdr l1) l2)))
-)
+(define (intersection a b) 
+  (foldl (lambda (l e) (if (memq e a) (cons e l) l)) '() b))
+
+; Return union of two lists.
+
+(define (union a b) 
+  (foldl (lambda (l e) (if (memq e l) l (cons e l))) a b))
 
 ; Return a count of the number of elements of list L1 that are in list L2.
 ; Uses memq.
Index: sid.scm
===================================================================
RCS file: /cvs/src/src/cgen/sid.scm,v
retrieving revision 1.7
diff -u -p -r1.7 sid.scm
--- sid.scm	7 Jan 2002 08:23:59 -0000	1.7
+++ sid.scm	9 Jan 2003 03:22:18 -0000
@@ -10,7 +10,7 @@
 ; [It still does but that's to be fixed.]
 
 ; Specify which application.
-(set! APPLICATION 'SIMULATOR)
+(set! APPLICATION 'SID-SIMULATOR)
 
 ; Misc. state info.
 
@@ -118,7 +118,7 @@
 ; While processing operand reading (or writing), parallel execution support
 ; needs to be turned off, so it is up to the appropriate cgen-foo.c proc to
 ; set-with-parallel?! appropriately.
-(define -with-parallel? #f)
+(define -with-parallel? #t)
 (define (with-parallel?) -with-parallel?)
 (define (set-with-parallel?! flag) (set! -with-parallel? flag))
 
@@ -924,43 +924,6 @@
 	 (rtl-c++ INT yes? nil #:rtl-cover-fns? #t)))
 )
 
-; For parallel write post-processing, we don't want to defer setting the pc.
-; ??? Not sure anymore.
-;(method-make!
-; <pc> 'gen-set-quiet
-; (lambda (self estate mode index selector newval)
-;   (-op-gen-set-quiet self estate mode index selector newval)))
-;(method-make!
-; <pc> 'gen-set-trace
-; (lambda (self estate mode index selector newval)
-;   (-op-gen-set-trace self estate mode index selector newval)))
-
-; Name of C macro to access parallel execution operand support.
-
-(define -par-operand-macro "OPRND")
-
-; Return C code to fetch an operand's value and save it away for the
-; semantic handler.  This is used to handle parallel execution of several
-; instructions where all inputs of all insns are read before any outputs are
-; written.
-; For operands, the word `read' is only used in this context.
-
-(define (op:read op sfmt)
-  (let ((estate (estate-make-for-normal-rtl-c++ nil nil)))
-    (send op 'gen-read estate sfmt -par-operand-macro))
-)
-
-; Return C code to write an operand's value.
-; This is used to handle parallel execution of several instructions where all
-; outputs are written to temporary spots first, and then a final
-; post-processing pass is run to update cpu state.
-; For operands, the word `write' is only used in this context.
-
-(define (op:write op sfmt)
-  (let ((estate (estate-make-for-normal-rtl-c++ nil nil)))
-    (send op 'gen-write estate sfmt -par-operand-macro))
-)
-
 ; Default gen-read method.
 ; This is used to help support targets with parallel insns.
 ; Either this or gen-write (but not both) is used.
@@ -1010,36 +973,46 @@
 (method-make!
  <operand> 'cxmake-get
  (lambda (self estate mode index selector)
-   (let ((mode (if (mode:eq? 'DFLT mode)
-		   (send self 'get-mode)
-		   mode))
-	 (index (if index index (op:index self)))
-	 (selector (if selector selector (op:selector self))))
-     ; If the object is marked with the RAW attribute, access the hardware
-     ; object directly.
+   (let* ((mode (if (mode:eq? 'DFLT mode)
+		    (send self 'get-mode)
+		    mode))
+	  (hw (op:type self))
+	  (index (if index index (op:index self)))
+	  (selector (if selector selector (op:selector self)))
+	  (delayval (op:delay self))
+	  (md (mode:c-type mode))
+	  (name (if 
+		 (eq? (obj:name hw) 'h-memory)
+		 (string-append md "_memory")
+		 (gen-c-symbol (obj:name hw))))
+	  (getter (op:getter self))
+	  (def-val (cond ((obj-has-attr? self 'RAW)
+			  (send hw 'cxmake-get-raw estate mode index selector))
+			 (getter
+			  (let ((args (car getter))
+				(expr (cadr getter)))
+			    (rtl-c-expr mode expr
+					(if (= (length args) 0) nil
+					    (list (list (car args) 'UINT index)))
+					#:rtl-cover-fns? #t
+					#:output-language (estate-output-language estate))))
+			 (else
+			  (send hw 'cxmake-get estate mode index selector)))))
+     
      (logit 4 "<operand> cxmake-get self=" (obj:name self) " mode=" (obj:name mode)
 	    " index=" (obj:name index) " selector=" selector "\n")
-     (cond ((obj-has-attr? self 'RAW)
-	    (send (op:type self) 'cxmake-get-raw estate mode index selector))
-	   ; If the instruction could be parallely executed with others and
-	   ; we're doing read pre-processing, the operand has already been
-	   ; fetched, we just have to grab the cached value.
-	   ((with-parallel-read?)
-	    (cx:make-with-atlist mode
-				 (string-append -par-operand-macro
-						" (" (gen-sym self) ")")
-				 nil)) ; FIXME: want CACHED attr if present
-	   ((op:getter self)
-	    (let ((args (car (op:getter self)))
-		  (expr (cadr (op:getter self))))
-	      (rtl-c-expr mode expr
-			  (if (= (length args) 0)
-			      nil
-			      (list (list (car args) 'UINT index)))
-			  #:rtl-cover-fns? #t
-			  #:output-language (estate-output-language estate))))
-	   (else
-	    (send (op:type self) 'cxmake-get estate mode index selector)))))
+     
+     (if delayval
+	 (if (derived-operand? self)
+	     (error "delayed derived operands currently unsupported: " self)
+	     (let ((idx (if index (string-append ", " (-gen-hw-index index estate)) "")))	   
+	       (cx:make mode (string-append "lookahead ("
+					    (number->string delayval)
+					    ", tick, " 
+					    "buf." name "_writes, " 
+					    (cx:c def-val) 
+					    idx ")"))))
+	 def-val)))
 )
 
 
@@ -1049,16 +1022,9 @@
   (send (op:type op) 'gen-set-quiet estate mode index selector newval)
 )
 
-(define (-op-gen-set-quiet-parallel op estate mode index selector newval)
-  (string-append
-   (if (op-save-index? op)
-       (string-append "    " -par-operand-macro " (" (-op-index-name op) ")"
-		      " = " (-gen-hw-index index estate) ";\n")
-       "")
-   "    "
-   -par-operand-macro " (" (gen-sym op) ")"
-   " = " (cx:c newval) ";\n")
-)
+(define (-op-gen-delayed-set-quiet op estate mode index selector newval)
+  (-op-gen-delayed-set-maybe-trace op estate mode index selector newval #f))
+
 
 (define (-op-gen-set-trace op estate mode index selector newval)
   (string-append
@@ -1079,12 +1045,7 @@
        ;else
        (send (op:type op) 'gen-set-quiet estate mode index selector
 		(cx:make-with-atlist mode "opval" (cx:atlist newval))))
-   (if (and (with-profile?)
-	    (op:cond? op))
-       (string-append "    written |= (1ULL << "
-		      (number->string (op:num op))
-		      ");\n")
-       "")
+   
 ; TRACE_RESULT_<MODE> (cpu, abuf, hwnum, opnum, value);
 ; For each insn record array of operand numbers [or indices into
 ; operand instance table].
@@ -1122,21 +1083,41 @@
    "  }\n")
 )
 
-(define (-op-gen-set-trace-parallel op estate mode index selector newval)
-  (string-append
-   "  {\n"
-   "    " (mode:c-type mode) " opval = " (cx:c newval) ";\n"
-   (if (op-save-index? op)
-       (string-append "    " -par-operand-macro " (" (-op-index-name op) ")"
-		      " = " (-gen-hw-index index estate) ";\n")
-       "")
-   "    " -par-operand-macro " (" (gen-sym op) ")"
-   " = opval;\n"
-   (if (op:cond? op)
-       (string-append "    written |= (1ULL << "
-		      (number->string (op:num op))
-		      ");\n")
-       "")
+(define (-op-gen-delayed-set-trace op estate mode index selector newval)
+  (-op-gen-delayed-set-maybe-trace op estate mode index selector newval #t))
+
+(define (-op-gen-delayed-set-maybe-trace op estate mode index selector newval do-trace?)
+  (let* ((pad "    ")
+	 (hw (op:type op))
+	 (delayval (op:delay op))
+	 (md (mode:c-type mode))
+	 (name (if 
+		(eq? (obj:name hw) 'h-memory)
+		(string-append md "_memory")
+		(gen-c-symbol (obj:name hw))))
+	 (val (cx:c newval))
+	 (idx (if index (-gen-hw-index index estate) ""))
+	 (idx-args (if (equal? idx "") "" (string-append ", " idx)))
+	 )
+    
+    (string-append
+     "  {\n"
+
+     (if delayval 
+
+	 ;; delayed write: push it to the appropriate buffer
+	 (string-append	    
+	  pad md " opval = " val ";\n"
+	  pad "buf." name "_writes [(tick + " (number->string delayval)
+	  ") % @prefix@::pipe_sz].push (@prefix@::write<" md ">(pc, opval" idx-args "));\n")
+
+	 ;; else, uh, we should never have been called!
+	 (error "-op-gen-delayed-set-maybe-trace called on non-delayed operand"))       
+     
+     
+     (if do-trace?
+
+	 (string-append
 ; TRACE_RESULT_<MODE> (cpu, abuf, hwnum, opnum, value);
 ; For each insn record array of operand numbers [or indices into
 ; operand instance table].
@@ -1169,8 +1150,8 @@
 	   ""))
    "opval << dec << \"  \";\n"
    "  }\n")
-)
-
+	 ;; else no tracing is emitted
+	 ""))))
 
 ; Return C code to set the value of an operand.
 ; NEWVAL is a <c-expr> object of the value to store.
@@ -1189,8 +1170,8 @@
 	 (selector (if selector selector (op:selector self))))
      (cond ((obj-has-attr? self 'RAW)
 	    (send (op:type self) 'gen-set-quiet-raw estate mode index selector newval))
-	   ((with-parallel-write?)
-	    (-op-gen-set-quiet-parallel self estate mode index selector newval))
+	   ((op:delay self)
+	    (-op-gen-delayed-set-quiet self estate mode index selector newval))
 	   (else
 	    (-op-gen-set-quiet self estate mode index selector newval)))))
 )
@@ -1212,26 +1193,12 @@
 	 (selector (if selector selector (op:selector self))))
      (cond ((obj-has-attr? self 'RAW)
 	    (send (op:type self) 'gen-set-quiet-raw estate mode index selector newval))
-	   ((with-parallel-write?)
-	    (-op-gen-set-trace-parallel self estate mode index selector newval))
+	   ((op:delay self)
+	    (-op-gen-delayed-set-trace self estate mode index selector newval))
 	   (else
 	    (-op-gen-set-trace self estate mode index selector newval)))))
 )
 
-; Define and undefine C macros to tuck away details of instruction format used
-; in the parallel execution functions.  See gen-define-field-macro for a
-; similar thing done for extraction/semantic functions.
-
-(define (gen-define-parallel-operand-macro sfmt)
-  (string-append "#define " -par-operand-macro "(f) "
-		 "par_exec->operands."
-		 (gen-sym sfmt)
-		 ".f\n")
-)
-
-(define (gen-undef-parallel-operand-macro sfmt)
-  (string-append "#undef " -par-operand-macro "\n")
-)
 \f
 ; Operand profiling and parallel execution support.
 
Index: sid-decode.scm
===================================================================
RCS file: /cvs/src/src/cgen/sid-decode.scm,v
retrieving revision 1.8
diff -u -p -r1.8 sid-decode.scm
--- sid-decode.scm	7 Feb 2002 18:46:19 -0000	1.8
+++ sid-decode.scm	9 Jan 2003 03:22:18 -0000
@@ -47,10 +47,7 @@ bool @prefix@_idesc::idesc_table_initial
 	       (if pbb?
 		   "0, "
 		   (string-append (-gen-sem-fn-name insn) ", "))
-	       "")
-           (if (with-parallel?)
-               (string-append (-gen-write-fn-name sfmt) ", ")
-               "")
+	       "") 
 	   "\"" (string-upcase name) "\", "
 	   (gen-cpu-insn-enum (current-cpu) insn)
 	   ", "
@@ -131,25 +128,6 @@ bool @prefix@_idesc::idesc_table_initial
 )
 
 
-;; and the same for writeback functions
-
-(define (-gen-write-fn-name sfmt)
-  (string-append "@prefix@_write_" (gen-sym sfmt))
-)
-
-
-(define (-gen-write-fn-decls)
-  (string-write
-   "// Decls of each writeback fn.\n\n"
-   "using @cpu@::@prefix@_write_fn;\n"
-   (string-list-map (lambda (sfmt)
-		      (string-list "extern @prefix@_write_fn "
-				   (-gen-write-fn-name sfmt)
-				   ";\n"))
-		    (current-sfmt-list))
-   "\n"
-   )
-)
 
 \f
 ; idesc, argbuf, and scache types
@@ -164,14 +142,9 @@ struct @cpu@_cpu;
 struct @prefix@_scache;
 "
    (if (with-parallel?)
-       "struct @prefix@_parexec;\n" "")
-   (if (with-parallel?)
-       "typedef void (@prefix@_sem_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec);"
+       "typedef void (@prefix@_sem_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem, int tick, @prefix@::write_stacks &buf);"
        "typedef sem_status (@prefix@_sem_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem);")
    "\n"
-   (if (with-parallel?)
-       "typedef sem_status (@prefix@_write_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec);"
-       "")
    "\n"   
 "
 // Instruction descriptor.
@@ -192,12 +165,6 @@ struct @prefix@_idesc {
   @prefix@_sem_fn* execute;\n\n"
        "")
 
-   (if (with-parallel?)
-       "\
-  // scache write executor for this insn
-  @prefix@_write_fn* writeback;\n\n"
-       "")
-
    "\
   const char* insn_name;
   enum @prefix@_insn_type sem_index;
@@ -300,15 +267,6 @@ struct @prefix@_scache {
   // argument buffer
   @prefix@_sem_fields fields;
 
-" (if (or (with-profile?) (with-parallel-write?))
-      (string-append "
-  // writeback flags
-  // Only used if profiling or parallel execution support enabled during
-  // file generation.
-  unsigned long long written;
-")
-      "") "
-
   // decode given instruction
   void decode (@cpu@_cpu* current_cpu, PCADDR pc, @prefix@_insn_word base_insn, @prefix@_insn_word entire_insn);
 };
@@ -718,6 +676,11 @@ void
 #ifndef @PREFIX@_DECODE_H
 #define @PREFIX@_DECODE_H
 
+namespace @prefix@ {
+// forward declaration of struct in -defs.h
+struct write_stacks;
+}
+
 namespace @cpu@ {
 
 using namespace cgen;
@@ -739,10 +702,6 @@ typedef UINT @prefix@_insn_word;
    ; There's no pressing need for it though.
    (if (with-scache?)
        -gen-sem-fn-decls
-       "")
-
-   (if (with-parallel?)
-       -gen-write-fn-decls
        "")
 
    "\
Index: sid-cpu.scm
===================================================================
RCS file: /cvs/src/src/cgen/sid-cpu.scm,v
retrieving revision 1.7
diff -u -p -r1.7 sid-cpu.scm
--- sid-cpu.scm	7 Feb 2002 18:46:19 -0000	1.7
+++ sid-cpu.scm	9 Jan 2003 03:22:23 -0000
@@ -199,6 +199,34 @@ namespace @arch@ {
    (-gen-hardware-struct #f (find hw-need-storage? (current-hw-list))))
 )
 
+(define (-gen-hw-stream-and-destream-fns) 
+  (let* ((sa string-append)
+	 (regs (find hw-need-storage? (current-hw-list)))
+	 (reg-dim (lambda (r) 
+		    (let ((dims (-hw-vector-dims r)))
+		      (if (equal? 0 (length dims)) 
+			  "0"
+			  (number->string (car dims))))))
+	 (stream-reg (lambda (r) 
+		       (let ((rname (sa "hardware." (gen-c-symbol (obj:name r)))))
+			 (if (hw-scalar? r)
+			     (sa "    ost << " rname " << ' ';\n")
+			     (sa "    for (int i = 0; i < " (reg-dim r) 
+				 "; i++)\n      ost << " rname "[i] << ' ';\n")))))
+	 (destream-reg (lambda (r) 
+			 (let ((rname (sa "hardware." (gen-c-symbol (obj:name r)))))
+			   (if (hw-scalar? r)
+			       (sa "    ist >> " rname ";\n")
+			       (sa "    for (int i = 0; i < " (reg-dim r) 
+				   "; i++)\n      ist >> " rname "[i];\n"))))))
+    (sa
+     "  void stream_cgen_hardware (std::ostream &ost) const \n  {\n"
+     (string-map stream-reg regs)
+     "  }\n"
+     "  void destream_cgen_hardware (std::istream &ist) \n  {\n"
+     (string-map destream-reg regs)
+     "  }\n")))
+
 ; Generate <cpu>-cpu.h
 
 (define (cgen-cpu.h)
@@ -222,6 +250,8 @@ public:
 
    -gen-hardware-types
 
+   -gen-hw-stream-and-destream-fns
+
    "  // C++ register access function templates\n"
    "#define current_cpu this\n\n"
    (lambda ()
@@ -295,68 +325,161 @@ typedef struct {
    )
 )
 
-; Utility of gen-parallel-exec-type to generate the definition of one
-; structure in PAREXEC.
-; SFMT is an <sformat> object.
 
-(define (gen-parallel-exec-elm sfmt)
-  (string-append
-   "    struct { /* " (obj:comment sfmt) " */\n"
-   (let ((sem-ops
-	  ((if (with-parallel-write?) sfmt-out-ops sfmt-in-ops) sfmt)))
-     (if (null? sem-ops)
-	 "      int empty;\n"
-	 (string-map
-	  (lambda (op)
-	    (logit 2 "Processing operand " (obj:name op) " of format "
-		   (obj:name sfmt) " ...\n")
-	      (if (with-parallel-write?)
-		  (let ((index-type (and (op-save-index? op)
-					 (gen-index-type op sfmt))))
-		    (string-append "      " (gen-type op)
-				   " " (gen-sym op) ";\n"
-				   (if index-type
-				       (string-append "      " index-type 
-						      " " (gen-sym op) "_idx;\n")
-				       "")))
-		  (string-append "      "
-				 (gen-type op)
-				 " "
-				 (gen-sym op)
-				 ";\n")))
-	  sem-ops)))
-   "    } " (gen-sym sfmt) ";\n"
-   )
-)
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;; begin stack-based write schedule
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(define useful-mode-names '(BI QI HI SI DI UQI UHI USI UDI SF DF))
+
+;(define (-calculated-memory-write-buffer-size)
+;  (let* ((is-mem? (lambda (op) (eq? (hw-sem-name (op:type op)) 'h-memory)))
+;	 (count-mem-writes
+;	  (lambda (sfmt) (length (find is-mem? (sfmt-out-ops sfmt))))))
+;    (apply max (append '(0) (map count-mem-writes (current-sfmt-list))))))
+
+
+;; note: this doesn't really correctly approximate the worst case. user-supplied functions
+;; might rewrite the pipeline extensively while it's running. 
+;(define (-worst-case-number-of-writes-to hw-name)
+;  (let* ((sfmts (current-sfmt-list))
+;	 (out-ops (map sfmt-out-ops sfmts))
+;	 (pred (lambda (op) (equal? hw-name (gen-c-symbol (obj:name (op:type op))))))
+;	 (filtered-ops (map (lambda (ops) (find pred ops)) out-ops)))
+;    (apply max (cons 0 (map (lambda (ops) (length ops)) filtered-ops)))))
+	 
+(define (-hw-gen-write-stack-decl nm mode)
+  (let* (
+; for the time being, we're disabling this size-estimation stuff and just
+; requiring the user to supply a parameter WRITE_BUF_SZ before they include -defs.h
+;	 (pipe-sz (+ 1 (max-delay (cpu-max-delay (current-cpu)))))
+;	 (sz (* pipe-sz (-worst-case-number-of-writes-to nm))))
+	 
+	 (mode-pad (spaces (- 4 (string-length mode))))
+	 (stack-name (string-append nm "_writes")))
+    (string-append
+     "  write_stack< write<" mode "> >" mode-pad "\t" stack-name "\t[pipe_sz];\n")))
+
+
+(define (-hw-gen-write-struct-decl)
+  (let* ((dims (-worst-case-index-dims))
+	 (sa string-append)
+	 (ns number->string)
+	 (idxs (seq 0 (- dims 1)))
+	 (ctor (sa "write (PCADDR _pc, MODE _val"
+		   (string-map (lambda (x) (sa ", USI _idx" (ns x) "=0")) idxs)
+		   ") : pc(_pc), val(_val)"
+		   (string-map (lambda (x) (sa ", idx" (ns x) "(_idx" (ns x) ")")) idxs)
+		   " {} \n"))
+	 (idx-fields (string-map (lambda (x) (sa "    USI idx" (ns x) ";\n")) idxs)))
+    (sa
+     "\n\n"
+     "  template <typename MODE>\n"
+     "  struct write\n"
+     "  {\n"
+     "    USI pc;\n"
+     "    MODE val;\n"
+     idx-fields
+     "    " ctor 
+     "    write() {}\n"
+     "  };\n" )))
+	       
+(define (-hw-vector-dims hw) (elm-get (hw-type hw) 'dimensions))			    
+(define (-worst-case-index-dims)
+  (apply max
+	 (append '(1) ; for memory accesses
+		 (map (lambda (hw) (length (-hw-vector-dims hw))) 
+		      (find (lambda (hw) (not (scalar? hw))) (current-hw-list))))))
+
+(define (-gen-writestacks)
+  (let* ((hw (find register? (current-hw-list)))
+	 (modes useful-mode-names) 
+	 (hw-pairs (map (lambda (h) (list (gen-c-symbol (obj:name h))
+					    (obj:name (hw-mode h)))) 
+			hw))
+	 (mem-pairs (map (lambda (m) (list (string-append m "_memory") m)) 
+			 modes))
+	 (all-pairs (append mem-pairs hw-pairs))
+
+	 (h1 "\n\n// write stacks used in parallel execution\n\n  struct write_stacks\n  {\n  // types of stacks\n\n")
+	 (wb (string-append
+	      "\n\n  // unified writeback function (defined in @prefix@-write.cc)"
+	        "\n  void writeback (int tick, @cpu@::@cpu@_cpu* current_cpu);"
+		"\n  // unified write-stack clearing function (defined in @prefix@-write.cc)"
+	        "\n  void reset ();"))
+	 (zz "\n\n  }; // end struct @prefix@::write_stacks \n\n")
+	 (st (string-append 
+	      "  std::ostream &operator<< (std::ostream &ost, const @prefix@::write_stacks &s);\n"
+	      "  std::istream &operator>> (std::istream &ist, @prefix@::write_stacks &s);\n"))
+	 )
+    (string-append	
+     (-hw-gen-write-struct-decl)
+     (foldl (lambda (s pair) (string-append s (apply -hw-gen-write-stack-decl pair))) h1 all-pairs)	  
+     wb
+     zz
+     st)))
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;; end stack-based write schedule
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	  
 
 ; Generate the definition of the structure that holds register values, etc.
-; for use during parallel execution.  When instructions are executed parallelly
-; either
-; - their inputs are read before their outputs are written.  Thus we have to
-; fetch the input values of several instructions before executing any of them.
-; - or their outputs are queued here first and then written out after all insns
-; have executed.
-; The fetched/queued values are stored in an array of PAREXEC structs, one
-; element per instruction.
+; for use during parallel execution.  
 
-(define (gen-parallel-exec-type)
-  (logit 2 "Generating PAREXEC type ...\n")
-  (string-append
-   (if (with-parallel-write?)
-       "/* Queued output values of an instruction.  */\n"
-       "/* Fetched input values of an instruction.  */\n")
-   "\
+(define (gen-write-stack-structure)
+  (let (;(membuf-sz (-calculated-memory-write-buffer-size))
+	(max-delay (cpu-max-delay (current-cpu))))
+    (logit 2 "Generating write stack structure ...\n")
+    (string-append
+     "  static const int max_delay = "   
+     (number->string max-delay) ";\n"
+     "  static const int pipe_sz = "     
+     (number->string (+ 1 max-delay)) "; // max_delay + 1\n"
 
-struct @prefix@_parexec {
-  union {\n"
-   (string-map gen-parallel-exec-elm (current-sfmt-list))
-   "\
-  } operands;
-  /* For conditionally written operands, bitmask of which ones were.  */
-  unsigned long long written;
-};\n\n"
-   )
-)
+"
+#ifndef WRITE_BUF_SZ
+#define WRITE_BUF_SZ 1
+#endif
+
+  template <typename ELT> 
+  struct write_stack 
+  {
+    int t;
+    const int sz;
+    ELT buf[WRITE_BUF_SZ];
+
+    write_stack       ()             : t(-1), sz(WRITE_BUF_SZ) {}
+    inline bool empty ()             { return (t == -1); }
+    inline void clear ()             { t = -1; }
+    inline void pop   ()             { assert (t > -1); t--;}
+    inline void push  (const ELT &e) { assert (t+1 < sz); buf [++t] = e;}
+    inline ELT &top   ()             { return buf [t>0 ? ( t<sz ? t : sz-1) : 0];}
+  };
+
+  // look ahead for latest write with index = idx, where time of write is
+  // <= dist steps from base (present) in write_stack array st.
+  // returning def if no scheduled write is found.
+
+  template <typename STKS, typename VAL>
+  inline VAL lookahead (int dist, int base, STKS &st, VAL def, int idx=0)
+  {
+    for (; dist > 0; --dist)
+    {
+      write_stack <VAL> &v = st [(base + dist) % pipe_sz];
+      for (int i = v.t; i > 0; --i) 
+	  if (v.buf [i].idx0 == idx) return v.buf [i];
+    }
+    return def;
+  }
+
+"
+ 
+     (-gen-writestacks)     
+     )))
 
 ; Generate the TRACE_RECORD struct definition.
 
@@ -375,16 +498,26 @@ typedef struct @prefix@_trace_record {
 
 ; Generate <cpu>-defs.h
 
+(define semantics-processed? #f)
+
 (define (cgen-defs.h)
   (logit 1 "Generating " (gen-cpu-name) " defs.h ...\n")
   (assert-keep-one)
-
+  
   ; Turn parallel execution support on if cpu needs it.
   (set-with-parallel?! (state-parallel-exec?))
 
   ; Initialize rtl->c generation.
   (rtl-c-config! #:rtl-cover-fns? #t)
 
+  (sim-analyze-insns!)
+
+  ; ensure semantc analysis has happened, in time
+  ; for the pipeline size to be calculated
+  (if (and (with-parallel?)
+	   (not semantics-processed?))
+      (error "defs.h must be generated after sem.cxx for parallel-execution type CPUs"))
+
   (string-write
    (gen-copyright "CPU family header for @cpu@ / @prefix@."
 		  copyright-red-hat package-red-hat-simulators)
@@ -392,15 +525,26 @@ typedef struct @prefix@_trace_record {
 #ifndef DEFS_@PREFIX@_H
 #define DEFS_@PREFIX@_H
 
+#include <stack>
+#include \"cgen-types.h\"
+
+// forward declaration\n\n  
 namespace @cpu@ {
+struct @cpu@_cpu;
+}
+
+namespace @prefix@ {
+
+using namespace cgen;
+
 \n"
 
    (if (with-parallel?)
-       gen-parallel-exec-type
-       "")
+       gen-write-stack-structure
+       "// no parallel-execution support\n")
 
    "\
-} // end @cpu@ namespace
+} // end @prefix@ namespace
 
 #endif /* DEFS_@PREFIX@_H */\n"
    )
@@ -417,47 +561,132 @@ namespace @cpu@ {
 ; Return C code to fetch and save all output operands to instructions with
 ; <sformat> SFMT.
 
-(define (-gen-write-args sfmt)
-  (string-map (lambda (op) (op:write op sfmt))
-	      (sfmt-out-ops sfmt))
-)
+; Generate <cpu>-write.cxx.
 
-; Utility of gen-write-fns to generate a writer function for <sformat> SFMT.
 
-(define (-gen-write-fn sfmt)
-  (logit 2 "Processing write function for \"" (obj:name sfmt) "\" ...\n")
-  (string-list
-   "\nsem_status\n"
-   (-gen-write-fn-name sfmt) " (@cpu@_cpu* current_cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec)\n"
-   "{\n"
-   (if (with-scache?)
-       (gen-define-field-macro sfmt)
-       "")
-   (gen-define-parallel-operand-macro sfmt)
-   "  @prefix@_scache* abuf = sem;\n"
-   "  unsigned long long written = abuf->written;\n"
-   "  PCADDR pc = abuf->addr;\n"
-   "  PCADDR npc = 0; // dummy value for branches\n"
-   "  sem_status status = SEM_STATUS_NORMAL; // ditto\n"
-   "\n"
-   (-gen-write-args sfmt)
-   "\n"
-   "  return status;\n"
-   (gen-undef-parallel-operand-macro sfmt)
-   (if (with-scache?)
-       (gen-undef-field-macro sfmt)
-       "")
-   "}\n\n")
-)
+(define (-gen-register-writer nm mode dims)
+  (let* ((pad "    ")
+	 (sa string-append)
+	 (idx-args (string-map (lambda (x) (sa "w.idx" (number->string x) ", ")) 
+			       (seq 0 (- dims 1)))))
+    (sa pad "while (! " nm "_writes[tick].empty())\n"
+	pad "{\n"
+	pad "  write<" mode "> &w = " nm "_writes[tick].top();\n"
+	pad "  current_cpu->" nm "_set(" idx-args "w.val);\n"
+	pad "  " nm "_writes[tick].pop();\n"
+	pad "}\n\n")))
+
+(define (-gen-memory-writer nm mode dims)
+  (let* ((pad "    ")
+	 (sa string-append)
+	 (idx-args (string-map (lambda (x) (sa ", w.idx" (number->string x) "")) 
+			       (seq 0 (- dims 1)))))
+    (sa pad "while (! " nm "_writes[tick].empty())\n"
+	pad "{\n"
+	pad "  write<" mode "> &w = " nm "_writes[tick].top();\n"
+	pad "  current_cpu->SETMEM" mode " (w.pc" idx-args ", w.val);\n"
+	pad "  " nm "_writes[tick].pop();\n"
+	pad "}\n\n")))
+
+
+(define (-gen-reset-fn)
+  (let* ((sa string-append)
+	 (objs (append (map (lambda (h) (gen-c-symbol (obj:name h))) 
+			    (find register? (current-hw-list)))
+		       (map (lambda (m) (sa m "_memory")) useful-mode-names)))
+	 (clr (lambda (elt) (sa "    clear_stacks (" elt "_writes);\n"))))
+    (sa 
+     "  template <typename ST> \n"
+     "  static void clear_stacks (ST &st)\n"
+     "  {\n"
+     "    for (int i = 0; i < @prefix@::pipe_sz; i++)\n"
+     "      st[i].clear();\n"
+     "  }\n\n"
+     "  void @prefix@::write_stacks::reset ()\n  {\n"
+     (string-map clr objs)
+     "  }")))
+
+(define (-gen-unified-write-fn) 
+  (let* ((hw (find register? (current-hw-list)))
+	 (modes useful-mode-names)	
+	 (hw-triples (map (lambda (h) (list (gen-c-symbol (obj:name h))
+					    (obj:name (hw-mode h))
+					    (length (-hw-vector-dims h)))) 
+			hw))
+	 (mem-triples (map (lambda (m) (list (string-append m "_memory") m 1)) 
+			 modes)))
 
-(define (-gen-write-fns)
-  (logit 2 "Processing writer functions ...\n")
-  (string-write-map (lambda (sfmt) (-gen-write-fn sfmt))
-		    (current-sfmt-list))
-)
+    (logit 2 "Generating writer function ...\n") 
+    (string-append
+     "
+
+  void @prefix@::write_stacks::writeback (int tick, @cpu@::@cpu@_cpu* current_cpu) 
+  {
+"
+     "\n    // register writeback loops\n"
+     (string-map (lambda (t) (apply -gen-register-writer t)) hw-triples)
+     "\n    // memory writeback loops\n"
+     (string-map (lambda (t) (apply -gen-memory-writer t)) mem-triples)
+"
+  }
+")))
 
 
-; Generate <cpu>-write.cxx.
+(define (-gen-stacks-stream-and-destream-fns) 
+  (let* ((sa string-append)
+	 (regs (find hw-need-storage? (current-hw-list)))
+	 (reg-dim (lambda (r) 
+		    (let ((dims (-hw-vector-dims r)))
+		      (if (equal? 0 (length dims)) 
+			  "0"
+			  (number->string (car dims))))))
+	 (write-stacks 
+	  (map (lambda (n) (sa n "_writes"))
+	       (append (map (lambda (r) (gen-c-symbol (obj:name r))) regs)
+		       (map (lambda (m) (sa m "_memory")) useful-mode-names))))
+	 (stream-stacks (lambda (s) (sa "    stream_stacks ( s." s ", ost);\n")))
+	 (destream-stacks (lambda (s) (sa "    destream_stacks ( s." s ", ist);\n")))
+	 (stack-boilerplate
+	  (sa
+	   "  template <typename ST> \n"
+	   "  void stream_stacks (const ST &st, std::ostream &ost)\n"
+	   "  {\n"
+	   "    for (int i = 0; i < @prefix@::pipe_sz; i++)\n"
+	   "    {\n"
+	   "      ost << st[i].t << ' ';\n"
+	   "      for (int j = 0; j <= st[i].t; j++)\n"
+	   "      {\n"
+	   "        ost << st[i].buf[j].pc << ' ';\n"
+	   "        ost << st[i].buf[j].val << ' ';\n"
+	   "        ost << st[i].buf[j].idx0 << ' ';\n"
+	   "      }\n"
+	   "    }\n"
+	   "  }\n"
+	   "  \n"
+	   "  template <typename ST> \n"
+	   "  void destream_stacks (ST &st, std::istream &ist)\n"
+	   "  {\n"
+	   "    for (int i = 0; i < @prefix@::pipe_sz; i++)\n"
+	   "    {\n"
+	   "      ist >> st[i].t;\n"
+	   "      for (int j = 0; j <= st[i].t; j++)\n"
+	   "      {\n"
+	   "        ist >> st[i].buf[j].pc;\n"
+	   "        ist >> st[i].buf[j].val;\n"
+	   "        ist >> st[i].buf[j].idx0;\n"
+	   "      }\n"
+	   "    }\n"
+	   "  }\n"
+	   "  \n")))
+    (sa stack-boilerplate
+	"  std::ostream & @prefix@::operator<< (std::ostream &ost, const @prefix@::write_stacks &s)\n   {\n"
+	(string-map stream-stacks write-stacks)
+	"\n    return ost;\n"
+	"  }\n"
+	"  std::istream & @prefix@::operator>> (std::istream &ist, @prefix@::write_stacks &s)\n   {\n"
+	(string-map destream-stacks write-stacks)
+	"\n    return ist;\n"
+	"  }\n")))
 
 (define (cgen-write.cxx)
   (logit 1 "Generating " (gen-cpu-name) " write.cxx ...\n")
@@ -465,8 +694,8 @@ namespace @cpu@ {
 
   (sim-analyze-insns!)
 
-  ; Turn parallel execution support off.
-  (set-with-parallel?! #f)
+  ; Turn parallel execution support on if needed.
+  (set-with-parallel?! (state-parallel-exec?))
 
   ; Tell the rtx->c translator we are the simulator.
   (rtl-c-config! #:rtl-cover-fns? #t)
@@ -478,12 +707,18 @@ namespace @cpu@ {
    "\
 
 #include \"@cpu@.h\"
-using namespace @cpu@;
-
+#include <iostream>
 "
-   -gen-write-fns
+   (if (with-parallel?) 
+       (string-append
+	 (-gen-reset-fn)
+	 (-gen-unified-write-fn)
+	 (-gen-stacks-stream-and-destream-fns))
+
+       "// no write-stack functions required\n")
    )
 )
+
 \f
 ; ******************
 ; cgen-semantics.cxx
@@ -521,19 +756,14 @@ using namespace @cpu@;
 	 "sem_status\n")
      "@prefix@_sem_" (gen-sym insn)
      (if (with-parallel?)
-	 " (@cpu@_cpu* current_cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec)\n"
+	 (string-append " (@cpu@_cpu* current_cpu, @prefix@_scache* sem, const int tick, \n\t"
+			"@prefix@::write_stacks &buf)\n")
 	 " (@cpu@_cpu* current_cpu, @prefix@_scache* sem)\n")
      "{\n"
      (gen-define-field-macro (insn-sfmt insn))
-     (if (with-parallel?)
-	 (gen-define-parallel-operand-macro (insn-sfmt insn))
-	 "")
      "  sem_status status = SEM_STATUS_NORMAL;\n"
      "  @prefix@_scache* abuf = sem;\n"
-     ; Unconditionally written operands are not recorded here.
-     (if (or (with-profile?) (with-parallel-write?))
-	 "  unsigned long long written = 0;\n"
-	 "")
+
      ; The address of this insn, needed by extraction and semantic code.
      ; Note that the address recorded in the cpu state struct is not used.
      ; For faster engines that copy will be out of date.
@@ -542,23 +772,12 @@ using namespace @cpu@;
      "\n"
      (gen-semantic-code insn)
      "\n"
-     ; Only update what's been written if some are conditionally written.
-     ; Otherwise we know they're all written so there's no point in
-     ; keeping track.
-     (if (or (with-profile?) (with-parallel-write?))
-	 (if (-any-cond-written? (insn-sfmt insn))
-	     "  abuf->written = written;\n"
-	     "")
-	 "")
      (if cti?
 	 "  current_cpu->done_cti_insn (npc, status);\n"
 	 "  current_cpu->done_insn (npc, status);\n")
      (if (with-parallel?)
 	 ""
 	 "  return status;\n")
-     (if (with-parallel?)
-	 (gen-undef-parallel-operand-macro (insn-sfmt insn))
-	 "")
      (gen-undef-field-macro (insn-sfmt insn))
      "}\n\n"
      ))
@@ -576,13 +795,14 @@ using namespace @cpu@;
 ; Each instruction is implemented in its own function.
 
 (define (cgen-semantics.cxx)
-  (logit 1 "Generating " (gen-cpu-name) " semantics.cxx ...\n")
+  (logit 1 "Generating " (gen-cpu-name) " semantics.cxx ")
   (assert-keep-one)
 
   (sim-analyze-insns!)
 
   ; Turn parallel execution support on if cpu needs it.
   (set-with-parallel?! (state-parallel-exec?))
+  (logit 1 (if (state-parallel-exec?) " (parallel) ...\n" "...\n"))
 
   ; Tell the rtx->c translator we are the simulator.
   (rtl-c-config! #:rtl-cover-fns? #t)
@@ -590,6 +810,8 @@ using namespace @cpu@;
   ; Indicate we're currently not generating a pbb engine.
   (set-current-pbb-engine?! #f)
 
+  (set! semantics-processed? #t)
+
   (string-write
    (gen-copyright "Simulator instruction semantics for @prefix@."
 		  copyright-red-hat package-red-hat-simulators)
@@ -598,6 +820,7 @@ using namespace @cpu@;
 #include \"@cpu@.h\"
 
 using namespace @cpu@; // FIXME: namespace organization still wip
+using namespace @prefix@; // FIXME: namespace organization still wip
 
 #define GET_ATTR(name) GET_ATTR_##name ()
 
@@ -655,9 +878,6 @@ using namespace @cpu@; // FIXME: namespa
      (if (with-scache?)
 	 (gen-define-field-macro (insn-sfmt insn))
 	 "")
-     (if parallel?
-	 (gen-define-parallel-operand-macro (insn-sfmt insn))
-	 "")
      ; Unconditionally written operands are not recorded here.
      (if (or (with-profile?) (with-parallel-write?))
 	 "      unsigned long long written = 0;\n"
@@ -694,9 +914,6 @@ using namespace @cpu@; // FIXME: namespa
 	 (string-append "      pbb_br_npc = npc;\n"
 			"      pbb_br_status = br_status;\n")
 	 "")
-     (if parallel?
-	 (gen-undef-parallel-operand-macro (insn-sfmt insn))
-	 "")
      (if (with-scache?)
 	 (gen-undef-field-macro (insn-sfmt insn))
 	 "")
@@ -950,9 +1167,6 @@ struct @prefix@_pbb_label {
 			"      vpc = vpc + 1;\n")
 	 "")
      (gen-define-field-macro (sfrag-sfmt frag))
-     (if parallel?
-	 (gen-define-parallel-operand-macro (sfrag-sfmt frag))
-	 "")
      ; Unconditionally written operands are not recorded here.
      (if (or (with-profile?) (with-parallel-write?))
 	 "      unsigned long long written = 0;\n"
@@ -992,9 +1206,6 @@ struct @prefix@_pbb_label {
 	      (sfrag-trailer? frag))
 	 (string-append "      pbb_br_npc = npc;\n"
 			"      pbb_br_status = br_status;\n")
-	 "")
-     (if parallel?
-	 (gen-undef-parallel-operand-macro (sfrag-sfmt frag))
 	 "")
      (gen-undef-field-macro (sfrag-sfmt frag))
      "    }\n"
Index: rtl-c.scm
===================================================================
RCS file: /cvs/src/src/cgen/rtl-c.scm,v
retrieving revision 1.4
diff -u -p -r1.4 rtl-c.scm
--- rtl-c.scm	8 Sep 2000 22:18:37 -0000	1.4
+++ rtl-c.scm	9 Jan 2003 03:22:25 -0000
@@ -1304,7 +1304,23 @@
 			"bad arg to `operand'" object-or-name)))
 )
 
-(define-fn xop (estate options mode object) object)
+(define-fn xop (estate options mode object) 
+  (let ((delayed (assoc '#:delay (estate-modifiers estate))))
+    (if (and delayed
+	     (equal? APPLICATION 'SID-SIMULATOR)
+	     (operand? object))
+	;; if we're looking at an operand inside a (delay ...) rtx, then we
+	;; are talking about a _delayed_ operand, which is a different
+	;; beast.  rather than try to work out what context we were
+	;; constructed within, we just clone the operand instance and set
+	;; the new one to have a delayed value. the setters and getters
+	;; will work it out.
+	(let ((obj (object-copy object))
+	      (amount (cadr delayed)))
+	  (op:set-delay! obj amount)
+	  obj)
+	;; else return the normal object
+	object)))
 
 (define-fn local (estate options mode object-or-name)
   (cond ((rtx-temp? object-or-name)
@@ -1363,9 +1379,38 @@
   (cx:make VOID "; /*clobber*/\n")
 )
 
-(define-fn delay (estate options mode n rtx)
-  (s-sequence (estate-with-modifiers estate '((#:delay))) VOID '() rtx) ; wip!
-)
+
+(define-fn delay (estate options mode num-node rtx)
+  (case APPLICATION
+    ((SID-SIMULATOR)
+     (let* ((n (cadddr num-node))
+	    (old-delay (let ((old (assoc '#:delay (estate-modifiers estate))))
+			 (if old (cadr old) 0)))
+	    (new-delay (+ n old-delay)))    
+       (begin
+	 ;; check for proper usage
+     	 (if (let* ((hw (case (car rtx) 
+			  ((operand) (op:type (rtx-operand-obj rtx)))
+			  ((xop) (op:type (rtx-xop-obj rtx)))
+			  (else #f))))		    	       
+	       (not (and hw (or (pc? hw) (memory? hw) (register? hw)))))
+	     (context-error 
+	      (estate-context estate) 
+	      (string-append 
+	       "(delay ...) rtx applied to wrong type of operand '" (car rtx) "'. should be pc, register or memory")))
+	 ;; signal an error if we're delayed and not in a "parallel-insns" CPU
+	 (if (not (with-parallel?)) 
+	     (context-error 	      
+	      (estate-context estate) 
+	      "delayed operand in a non-parallel cpu"))
+	 ;; update cpu-global pipeline bound
+	 (cpu-set-max-delay! (current-cpu) (max (cpu-max-delay (current-cpu)) new-delay))      
+	 ;; pass along new delay to embedded rtx
+	 (rtx-eval-with-estate rtx mode (estate-with-modifiers estate `((#:delay ,new-delay)))))))
+
+    ;; not in sid-land
+    (else (s-sequence (estate-with-modifiers estate '((#:delay))) VOID '() rtx))))
+
 
 ; Gets expanded as a macro.
 ;(define-fn annul (estate yes?)
Index: operand.scm
===================================================================
RCS file: /cvs/src/src/cgen/operand.scm,v
retrieving revision 1.5
diff -u -p -r1.5 operand.scm
--- operand.scm	20 Dec 2002 06:39:04 -0000	1.5
+++ operand.scm	9 Jan 2003 03:22:29 -0000
@@ -90,6 +90,9 @@
 		; referenced.  #f means the operand is always referenced by
 		; the instruction.
 		(cond? . #f)
+		
+		; whether (and by how much) this instance of the operand is delayed.
+		(delayed . #f)
 		)
 	      nil)
 )
@@ -135,6 +138,8 @@
 (define op:set-num! (elm-make-setter <operand> 'num))
 (define op:cond? (elm-make-getter <operand> 'cond?))
 (define op:set-cond?! (elm-make-setter <operand> 'cond?))
+(define op:delay (elm-make-getter <operand> 'delayed))
+(define op:set-delay! (elm-make-setter <operand> 'delayed))
 
 ; Compute the hardware type lazily.
 ; FIXME: op:type should be named op:hwtype or some such.
Index: mach.scm
===================================================================
RCS file: /cvs/src/src/cgen/mach.scm,v
retrieving revision 1.2
diff -u -p -r1.2 mach.scm
--- mach.scm	12 Jul 2001 02:32:25 -0000	1.2
+++ mach.scm	9 Jan 2003 03:22:31 -0000
@@ -755,8 +755,7 @@
   (apply min (cons 65535
 		   (map insn-length (find (lambda (insn)
 					    (and (not (has-attr? insn 'ALIAS))
-						 (eq? (obj-attr-value insn 'ISA)
-						      (obj:name isa))))
+						 (isa-supports? isa insn)))
 					  (non-multi-insns (current-insn-list))))))
 )
 
@@ -765,9 +764,8 @@
   ; [a language with infinite precision can't have max-reduce-iota-0 :-)]
   (apply max (cons 0
 		   (map insn-length (find (lambda (insn)
-					    (and (not (has-attr? insn 'ALIAS))
-						 (eq? (obj-attr-value insn 'ISA)
-						      (obj:name isa))))
+					  (and (not (has-attr? insn 'ALIAS))
+						 (isa-supports? isa insn)))
 					  (non-multi-insns (current-insn-list))))))
 )
 
@@ -1008,13 +1006,19 @@
 		; Allow a cpu family to override the isa parallel-insns spec.
 		; ??? Concession to the m32r port which can go away, in time.
 		parallel-insns
+
+		; Computed: maximum number of insns which may pass before there
+		; an insn writes back its output operands.
+		max-delay
+
 		)
 	      nil)
 )
 
 ; Accessors.
 
-(define-getters <cpu> cpu (word-bitsize insn-chunk-bitsize file-transform parallel-insns))
+(define-getters <cpu> cpu (word-bitsize insn-chunk-bitsize file-transform parallel-insns max-delay))
+(define-setters <cpu> cpu (max-delay))
 
 ; Return endianness of instructions.
 
@@ -1064,7 +1068,9 @@
 	      word-bitsize
 	      insn-chunk-bitsize
 	      file-transform
-	      parallel-insns)
+	      parallel-insns
+	      0 ; default max-delay. will compute correct value
+	      )
 	(begin
 	  (logit 2 "Ignoring " name ".\n")
 	  #f))) ; cpu is not to be kept
@@ -1284,13 +1290,13 @@
   ; Assert only one cpu family has been selected.
   (assert-keep-one)
 
-  (let ((par-insns (map isa-parallel-insns (current-isa-list)))
+  (let ((false->zero (lambda (x) (if x x 0)))
+	(par-insns (map isa-parallel-insns (current-isa-list)))
 	(cpu-par-insns (cpu-parallel-insns (current-cpu))))
     ; ??? The m32r does have parallel execution, but to keep support for the
     ; base mach simpler, a cpu family is allowed to override the isa spec.
-    (or cpu-par-insns
-	; FIXME: ensure all have same value.
-	(car par-insns)))
+    (max (false->zero cpu-par-insns) 
+	 (apply max (map false->zero par-insns))))
 )
 
 ; Return boolean indicating if parallel execution support is required.
Index: dev.scm
===================================================================
RCS file: /cvs/src/src/cgen/dev.scm,v
retrieving revision 1.5
diff -u -p -r1.5 dev.scm
--- dev.scm	21 Dec 2002 22:22:33 -0000	1.5
+++ dev.scm	9 Jan 2003 03:22:31 -0000
@@ -115,7 +115,7 @@
   (load "sid-model")
   (load "sid-decode")
   (set! verbose-level 3)
-  (set! APPLICATION 'SIMULATOR)
+  (set! APPLICATION 'SID-SIMULATOR)
 )
 
 (define (load-sim)
Index: doc/rtl.texi
===================================================================
RCS file: /cvs/src/src/cgen/doc/rtl.texi,v
retrieving revision 1.17
diff -u -p -r1.17 rtl.texi
--- doc/rtl.texi	22 Dec 2002 04:49:26 -0000	1.17
+++ doc/rtl.texi	9 Jan 2003 03:22:34 -0000
@@ -1833,7 +1833,7 @@ This is a character string consisting of
 Fields are denoted by @code{$operand} or
 @code{$@{operand@}}@footnote{Support for @code{$@{operand@}} is
 work-in-progress.}.  If a @samp{$} is required in the syntax, it is
-specified with @samp{\$}.  At most one white-space character may be
+specified with @samp{$$}.  At most one white-space character may be
 present and it must be a blank separating the instruction mnemonic from
 the operands.  This doesn't restrict the user's assembler, this is
 @c Is this reasonable?
@@ -2257,10 +2257,39 @@ first argument.
 Indicate that @samp{object} is written in mode @samp{mode}, without
 saying how. This could be useful in conjunction with the C escape hooks.
 
-@item (delay mode num expr)
-Indicate that there are @samp{num} delay slots in the processing of
-@samp{expr}.  When using this rtx in instruction semantics, CGEN will
-infer that the instruction has the DELAY-SLOT attribute.
+@item (delay num expr)
+In older "sim" simulators, indicates that there are @samp{num} delay
+slots in the processing of @samp{expr}. When using this rtx in instruction
+semantics, CGEN will infer that the instruction has the DELAY-SLOT
+attribute.  
+
+In newer "sid" simulators, evaluates to the writeback queue for hardware
+operand @samp{expr}, at @samp{num} instruction cycles in the
+future. @samp{expr} @emph{must} be a hardware operand in this case. 
+
+For example, @code{(set (delay 3 pc) (+ pc 1))} will schedule write to
+the @samp{pc} register in the writeback phase of the 3rd instruction
+after the current. Alternatively, @code{(set gr1 (delay 3 gr2))} will
+immediately update the @samp{gr1} register with the @emph{latest write}
+to the @samp{gr2} register scheduled between the present and 3
+instructions in the future. @code{(delay 0 ...)}  refers to the
+writeback phase of the current instruction.
+
+This effect is modeled with a circular buffer of "write stacks" for each
+hardware element (register banks get a single stack). The size of the
+circular buffer is calculated from the uses of @code{(delay ...)} 
+rtxs. When a delayed write occurs, the simulator pushes the write onto
+the appropriate write stack in the "future" of the circular buffer for
+the written-to hardware element. At the end of each instruction cycle,
+the simulator executes all writes in all write stacks for the time slice
+just ending. When a delayed read (essentially a pipeline bypass) occurs,
+the simulator looks ahead in the circular buffer for any writes
+scheduled in the future write stack. If it doesn't find one, it
+progressively backs off towards the "current" instruction cycle's write
+stack, and if it still finds no scheduled writes then it returns the
+current state of the CPU. Thus while delayed writes are fast, delayed
+reads are potentially slower in a simulator with long pipelines and very
+large register banks.
 
 @item (annul yes?)
 @c FIXME: put annul into the glossary.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* exposed pipeline patch (long!)
  2003-01-09  3:27 exposed pipeline patch (long!) Ben Elliston
@ 2003-01-09  6:41 ` Doug Evans
  2003-01-09 17:55   ` Frank Ch. Eigler
  2003-01-09  6:55 ` Doug Evans
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Doug Evans @ 2003-01-09  6:41 UTC (permalink / raw)
  To: Ben Elliston; +Cc: cgen

I have big problems with this patch.

The SID stuff I don't much care about.
Each application's developers must be free to do things as they see fit,
provided they do so in the domain of that application and not intrude
into cgen proper.

One must have compelling reasons for moving or putting
application specific stuff into the non-application specific parts of cgen
(and to be honest I don't think they exist in this case).
Before this patch goes in I think someone needs to justify the following:

Refering to APPLICATION in rtl-c.scm.  Blech.

 > -(define-fn xop (estate options mode object) object)
 > +(define-fn xop (estate options mode object) 
 > +  (let ((delayed (assoc '#:delay (estate-modifiers estate))))
 > +    (if (and delayed
 > +	     (equal? APPLICATION 'SID-SIMULATOR)
 > +	     (operand? object))

and here

 > +(define-fn delay (estate options mode num-node rtx)
 > +  (case APPLICATION
 > +    ((SID-SIMULATOR)

and, references in rtl.texi.

 > +@item (delay num expr)
 > +In older "sim" simulators, indicates that there are @samp{num} delay
 > +slots in the processing of @samp{expr}. When using this rtx in instruction
 > +semantics, CGEN will infer that the instruction has the DELAY-SLOT
 > +attribute.  
 > +
 > +In newer "sid" simulators, evaluates to the writeback queue for hardware
 > +operand @samp{expr}, at @samp{num} instruction cycles in the
 > +future. @samp{expr} @emph{must} be a hardware operand in this case. 

rtl.texi shall not mention any particular app, and especially not go
into details about implementation.
[it can certainly make general references to classes of apps,
e.g. simulators, but that's it]

rtl shall define the ISA in an application independent way.
We can't have `delay' mean one thing to one app and one thing to another app.

Can the people who want this patch to go in try to come up with
a different way to do things?  At the very least define `delay'
in an application independent manner.  If any reasonable form of
`delay' is insufficient to describe every architecture we are interested
in then clearly we need something more.  [But obviously before
creating new rtl, one should make sure it's warranted.]

IMO it is not ok to commit this.

Ben Elliston writes:
 > I'm posting this patch on behalf of Graydon Hoare, who write this
 > exposed pipeline support last year.  It's a more generalised form of
 > the (delay ..) rtx and has been used for a couple of ports already.
 > 
 > Rather than just commit it, I thought I would post it for review.
 > Okay to commit?
 > 
 > Ben
 > 
 > 2001-06-05  graydon hoare  <graydon@redhat.com>
 > 
 >         * utils.scm (foldl): Define.
 >         (foldr): Define.
 >         (union): Define.
 >         (intersection): Simplify.
 >         * sid.scm : Set APPLICATION to SID-SIMULATOR.
 >         (-op-gen-delayed-set-maybe-trace): Define.
 >         (<operand> 'gen-set-{quiet,trace}): Delegate to
 >         op-gen-delayed-set-quiet etc. Note: this is still a little tangled
 >         up and needs cleaning.
 >         (-with-parallel?): Hardwire with-parallel to #t.
 >         (<operand> 'cxmake-get): Replace with lookahead-aware code
 >         * sid-decode.scm: Remove per-insn writeback fns.
 >         (-gen-idesc-decls): Redefine sem_fn type.
 >         * sid-cpu.scm (gen-write-stack-structure): Replace parexec stuff
 >         with write stack stuff.
 >         (cgen-write.cxx): Replace per-insn writebacks with single write
 >         stack writeback. Add write stack reset function.
 >         (-gen-scache-semantic-fn insn): Replace parexec stuff with write
 >         stack stuff.
 >         * rtl-c.scm (xop): Clone operand into delayed operand if #:delayed
 >         estate attribute set.
 >         (delay): Set #:delayed attribute to calculated delay, update
 >         maximum delay of cpu, check (delay ...) usage.
 >         * operand.scm (<operand>): Add delayed slot to <operand>.
 >         * mach.scm (<cpu>): Add max-delay slot to <cpu>.
 >         * dev.scm (load-sid): Set APPLICATION to SID-SIMULATOR.
 >         * doc/rtl.texi (Expressions): Add section on (delay ...).
 > 
 > Index: utils.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/utils.scm,v
 > retrieving revision 1.7
 > diff -u -p -r1.7 utils.scm
 > --- utils.scm	7 Jan 2002 08:23:59 -0000	1.7
 > +++ utils.scm	9 Jan 2003 03:22:12 -0000
 > @@ -78,6 +78,10 @@
 >  
 >  (define (spaces n) (make-string n #\space))
 >  
 > +; simple list-generators
 > +(define (seq p q) (if (> p q) '() (cons p (seq (+ p 1) q))))
 > +(define (fill x n) (if (> n 0) (cons x (fill x (- n 1))) '()))
 > +
 >  ; Write N spaces to PORT, or the current output port if elided.
 >  
 >  (define (write-spaces n . port)
 > @@ -471,6 +475,17 @@
 >    (reverse! (list-drop n (reverse l)))
 >  )
 >  
 > +;; left fold
 > +(define (foldl kons accum lis) 
 > +  (if (null? lis) accum 
 > +      (foldl kons (kons accum (car lis)) (cdr lis))))
 > +
 > +;; right fold
 > +(define (foldr kons knil lis) 
 > +  (if (null? lis) knil 
 > +      (kons (car lis) (foldr kons knil (cdr lis)))))
 > +
 > +
 >  ; APL's +\ operation on a vector of numbers.
 >  
 >  (define (plus-scan l)
 > @@ -540,12 +555,13 @@
 >  
 >  ; Return intersection of two lists.
 >  
 > -(define (intersection l1 l2)
 > -  (cond ((null? l1) l1)
 > -	((null? l2) l2)
 > -	((memq (car l1) l2) (cons (car l1) (intersection (cdr l1) l2)))
 > -	(else (intersection (cdr l1) l2)))
 > -)
 > +(define (intersection a b) 
 > +  (foldl (lambda (l e) (if (memq e a) (cons e l) l)) '() b))
 > +
 > +; Return union of two lists.
 > +
 > +(define (union a b) 
 > +  (foldl (lambda (l e) (if (memq e l) l (cons e l))) a b))
 >  
 >  ; Return a count of the number of elements of list L1 that are in list L2.
 >  ; Uses memq.
 > Index: sid.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/sid.scm,v
 > retrieving revision 1.7
 > diff -u -p -r1.7 sid.scm
 > --- sid.scm	7 Jan 2002 08:23:59 -0000	1.7
 > +++ sid.scm	9 Jan 2003 03:22:18 -0000
 > @@ -10,7 +10,7 @@
 >  ; [It still does but that's to be fixed.]
 >  
 >  ; Specify which application.
 > -(set! APPLICATION 'SIMULATOR)
 > +(set! APPLICATION 'SID-SIMULATOR)
 >  
 >  ; Misc. state info.
 >  
 > @@ -118,7 +118,7 @@
 >  ; While processing operand reading (or writing), parallel execution support
 >  ; needs to be turned off, so it is up to the appropriate cgen-foo.c proc to
 >  ; set-with-parallel?! appropriately.
 > -(define -with-parallel? #f)
 > +(define -with-parallel? #t)
 >  (define (with-parallel?) -with-parallel?)
 >  (define (set-with-parallel?! flag) (set! -with-parallel? flag))
 >  
 > @@ -924,43 +924,6 @@
 >  	 (rtl-c++ INT yes? nil #:rtl-cover-fns? #t)))
 >  )
 >  
 > -; For parallel write post-processing, we don't want to defer setting the pc.
 > -; ??? Not sure anymore.
 > -;(method-make!
 > -; <pc> 'gen-set-quiet
 > -; (lambda (self estate mode index selector newval)
 > -;   (-op-gen-set-quiet self estate mode index selector newval)))
 > -;(method-make!
 > -; <pc> 'gen-set-trace
 > -; (lambda (self estate mode index selector newval)
 > -;   (-op-gen-set-trace self estate mode index selector newval)))
 > -
 > -; Name of C macro to access parallel execution operand support.
 > -
 > -(define -par-operand-macro "OPRND")
 > -
 > -; Return C code to fetch an operand's value and save it away for the
 > -; semantic handler.  This is used to handle parallel execution of several
 > -; instructions where all inputs of all insns are read before any outputs are
 > -; written.
 > -; For operands, the word `read' is only used in this context.
 > -
 > -(define (op:read op sfmt)
 > -  (let ((estate (estate-make-for-normal-rtl-c++ nil nil)))
 > -    (send op 'gen-read estate sfmt -par-operand-macro))
 > -)
 > -
 > -; Return C code to write an operand's value.
 > -; This is used to handle parallel execution of several instructions where all
 > -; outputs are written to temporary spots first, and then a final
 > -; post-processing pass is run to update cpu state.
 > -; For operands, the word `write' is only used in this context.
 > -
 > -(define (op:write op sfmt)
 > -  (let ((estate (estate-make-for-normal-rtl-c++ nil nil)))
 > -    (send op 'gen-write estate sfmt -par-operand-macro))
 > -)
 > -
 >  ; Default gen-read method.
 >  ; This is used to help support targets with parallel insns.
 >  ; Either this or gen-write (but not both) is used.
 > @@ -1010,36 +973,46 @@
 >  (method-make!
 >   <operand> 'cxmake-get
 >   (lambda (self estate mode index selector)
 > -   (let ((mode (if (mode:eq? 'DFLT mode)
 > -		   (send self 'get-mode)
 > -		   mode))
 > -	 (index (if index index (op:index self)))
 > -	 (selector (if selector selector (op:selector self))))
 > -     ; If the object is marked with the RAW attribute, access the hardware
 > -     ; object directly.
 > +   (let* ((mode (if (mode:eq? 'DFLT mode)
 > +		    (send self 'get-mode)
 > +		    mode))
 > +	  (hw (op:type self))
 > +	  (index (if index index (op:index self)))
 > +	  (selector (if selector selector (op:selector self)))
 > +	  (delayval (op:delay self))
 > +	  (md (mode:c-type mode))
 > +	  (name (if 
 > +		 (eq? (obj:name hw) 'h-memory)
 > +		 (string-append md "_memory")
 > +		 (gen-c-symbol (obj:name hw))))
 > +	  (getter (op:getter self))
 > +	  (def-val (cond ((obj-has-attr? self 'RAW)
 > +			  (send hw 'cxmake-get-raw estate mode index selector))
 > +			 (getter
 > +			  (let ((args (car getter))
 > +				(expr (cadr getter)))
 > +			    (rtl-c-expr mode expr
 > +					(if (= (length args) 0) nil
 > +					    (list (list (car args) 'UINT index)))
 > +					#:rtl-cover-fns? #t
 > +					#:output-language (estate-output-language estate))))
 > +			 (else
 > +			  (send hw 'cxmake-get estate mode index selector)))))
 > +     
 >       (logit 4 "<operand> cxmake-get self=" (obj:name self) " mode=" (obj:name mode)
 >  	    " index=" (obj:name index) " selector=" selector "\n")
 > -     (cond ((obj-has-attr? self 'RAW)
 > -	    (send (op:type self) 'cxmake-get-raw estate mode index selector))
 > -	   ; If the instruction could be parallely executed with others and
 > -	   ; we're doing read pre-processing, the operand has already been
 > -	   ; fetched, we just have to grab the cached value.
 > -	   ((with-parallel-read?)
 > -	    (cx:make-with-atlist mode
 > -				 (string-append -par-operand-macro
 > -						" (" (gen-sym self) ")")
 > -				 nil)) ; FIXME: want CACHED attr if present
 > -	   ((op:getter self)
 > -	    (let ((args (car (op:getter self)))
 > -		  (expr (cadr (op:getter self))))
 > -	      (rtl-c-expr mode expr
 > -			  (if (= (length args) 0)
 > -			      nil
 > -			      (list (list (car args) 'UINT index)))
 > -			  #:rtl-cover-fns? #t
 > -			  #:output-language (estate-output-language estate))))
 > -	   (else
 > -	    (send (op:type self) 'cxmake-get estate mode index selector)))))
 > +     
 > +     (if delayval
 > +	 (if (derived-operand? self)
 > +	     (error "delayed derived operands currently unsupported: " self)
 > +	     (let ((idx (if index (string-append ", " (-gen-hw-index index estate)) "")))	   
 > +	       (cx:make mode (string-append "lookahead ("
 > +					    (number->string delayval)
 > +					    ", tick, " 
 > +					    "buf." name "_writes, " 
 > +					    (cx:c def-val) 
 > +					    idx ")"))))
 > +	 def-val)))
 >  )
 >  
 >  
 > @@ -1049,16 +1022,9 @@
 >    (send (op:type op) 'gen-set-quiet estate mode index selector newval)
 >  )
 >  
 > -(define (-op-gen-set-quiet-parallel op estate mode index selector newval)
 > -  (string-append
 > -   (if (op-save-index? op)
 > -       (string-append "    " -par-operand-macro " (" (-op-index-name op) ")"
 > -		      " = " (-gen-hw-index index estate) ";\n")
 > -       "")
 > -   "    "
 > -   -par-operand-macro " (" (gen-sym op) ")"
 > -   " = " (cx:c newval) ";\n")
 > -)
 > +(define (-op-gen-delayed-set-quiet op estate mode index selector newval)
 > +  (-op-gen-delayed-set-maybe-trace op estate mode index selector newval #f))
 > +
 >  
 >  (define (-op-gen-set-trace op estate mode index selector newval)
 >    (string-append
 > @@ -1079,12 +1045,7 @@
 >         ;else
 >         (send (op:type op) 'gen-set-quiet estate mode index selector
 >  		(cx:make-with-atlist mode "opval" (cx:atlist newval))))
 > -   (if (and (with-profile?)
 > -	    (op:cond? op))
 > -       (string-append "    written |= (1ULL << "
 > -		      (number->string (op:num op))
 > -		      ");\n")
 > -       "")
 > +   
 >  ; TRACE_RESULT_<MODE> (cpu, abuf, hwnum, opnum, value);
 >  ; For each insn record array of operand numbers [or indices into
 >  ; operand instance table].
 > @@ -1122,21 +1083,41 @@
 >     "  }\n")
 >  )
 >  
 > -(define (-op-gen-set-trace-parallel op estate mode index selector newval)
 > -  (string-append
 > -   "  {\n"
 > -   "    " (mode:c-type mode) " opval = " (cx:c newval) ";\n"
 > -   (if (op-save-index? op)
 > -       (string-append "    " -par-operand-macro " (" (-op-index-name op) ")"
 > -		      " = " (-gen-hw-index index estate) ";\n")
 > -       "")
 > -   "    " -par-operand-macro " (" (gen-sym op) ")"
 > -   " = opval;\n"
 > -   (if (op:cond? op)
 > -       (string-append "    written |= (1ULL << "
 > -		      (number->string (op:num op))
 > -		      ");\n")
 > -       "")
 > +(define (-op-gen-delayed-set-trace op estate mode index selector newval)
 > +  (-op-gen-delayed-set-maybe-trace op estate mode index selector newval #t))
 > +
 > +(define (-op-gen-delayed-set-maybe-trace op estate mode index selector newval do-trace?)
 > +  (let* ((pad "    ")
 > +	 (hw (op:type op))
 > +	 (delayval (op:delay op))
 > +	 (md (mode:c-type mode))
 > +	 (name (if 
 > +		(eq? (obj:name hw) 'h-memory)
 > +		(string-append md "_memory")
 > +		(gen-c-symbol (obj:name hw))))
 > +	 (val (cx:c newval))
 > +	 (idx (if index (-gen-hw-index index estate) ""))
 > +	 (idx-args (if (equal? idx "") "" (string-append ", " idx)))
 > +	 )
 > +    
 > +    (string-append
 > +     "  {\n"
 > +
 > +     (if delayval 
 > +
 > +	 ;; delayed write: push it to the appropriate buffer
 > +	 (string-append	    
 > +	  pad md " opval = " val ";\n"
 > +	  pad "buf." name "_writes [(tick + " (number->string delayval)
 > +	  ") % @prefix@::pipe_sz].push (@prefix@::write<" md ">(pc, opval" idx-args "));\n")
 > +
 > +	 ;; else, uh, we should never have been called!
 > +	 (error "-op-gen-delayed-set-maybe-trace called on non-delayed operand"))       
 > +     
 > +     
 > +     (if do-trace?
 > +
 > +	 (string-append
 >  ; TRACE_RESULT_<MODE> (cpu, abuf, hwnum, opnum, value);
 >  ; For each insn record array of operand numbers [or indices into
 >  ; operand instance table].
 > @@ -1169,8 +1150,8 @@
 >  	   ""))
 >     "opval << dec << \"  \";\n"
 >     "  }\n")
 > -)
 > -
 > +	 ;; else no tracing is emitted
 > +	 ""))))
 >  
 >  ; Return C code to set the value of an operand.
 >  ; NEWVAL is a <c-expr> object of the value to store.
 > @@ -1189,8 +1170,8 @@
 >  	 (selector (if selector selector (op:selector self))))
 >       (cond ((obj-has-attr? self 'RAW)
 >  	    (send (op:type self) 'gen-set-quiet-raw estate mode index selector newval))
 > -	   ((with-parallel-write?)
 > -	    (-op-gen-set-quiet-parallel self estate mode index selector newval))
 > +	   ((op:delay self)
 > +	    (-op-gen-delayed-set-quiet self estate mode index selector newval))
 >  	   (else
 >  	    (-op-gen-set-quiet self estate mode index selector newval)))))
 >  )
 > @@ -1212,26 +1193,12 @@
 >  	 (selector (if selector selector (op:selector self))))
 >       (cond ((obj-has-attr? self 'RAW)
 >  	    (send (op:type self) 'gen-set-quiet-raw estate mode index selector newval))
 > -	   ((with-parallel-write?)
 > -	    (-op-gen-set-trace-parallel self estate mode index selector newval))
 > +	   ((op:delay self)
 > +	    (-op-gen-delayed-set-trace self estate mode index selector newval))
 >  	   (else
 >  	    (-op-gen-set-trace self estate mode index selector newval)))))
 >  )
 >  
 > -; Define and undefine C macros to tuck away details of instruction format used
 > -; in the parallel execution functions.  See gen-define-field-macro for a
 > -; similar thing done for extraction/semantic functions.
 > -
 > -(define (gen-define-parallel-operand-macro sfmt)
 > -  (string-append "#define " -par-operand-macro "(f) "
 > -		 "par_exec->operands."
 > -		 (gen-sym sfmt)
 > -		 ".f\n")
 > -)
 > -
 > -(define (gen-undef-parallel-operand-macro sfmt)
 > -  (string-append "#undef " -par-operand-macro "\n")
 > -)
 >  \f
 >  ; Operand profiling and parallel execution support.
 >  
 > Index: sid-decode.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/sid-decode.scm,v
 > retrieving revision 1.8
 > diff -u -p -r1.8 sid-decode.scm
 > --- sid-decode.scm	7 Feb 2002 18:46:19 -0000	1.8
 > +++ sid-decode.scm	9 Jan 2003 03:22:18 -0000
 > @@ -47,10 +47,7 @@ bool @prefix@_idesc::idesc_table_initial
 >  	       (if pbb?
 >  		   "0, "
 >  		   (string-append (-gen-sem-fn-name insn) ", "))
 > -	       "")
 > -           (if (with-parallel?)
 > -               (string-append (-gen-write-fn-name sfmt) ", ")
 > -               "")
 > +	       "") 
 >  	   "\"" (string-upcase name) "\", "
 >  	   (gen-cpu-insn-enum (current-cpu) insn)
 >  	   ", "
 > @@ -131,25 +128,6 @@ bool @prefix@_idesc::idesc_table_initial
 >  )
 >  
 >  
 > -;; and the same for writeback functions
 > -
 > -(define (-gen-write-fn-name sfmt)
 > -  (string-append "@prefix@_write_" (gen-sym sfmt))
 > -)
 > -
 > -
 > -(define (-gen-write-fn-decls)
 > -  (string-write
 > -   "// Decls of each writeback fn.\n\n"
 > -   "using @cpu@::@prefix@_write_fn;\n"
 > -   (string-list-map (lambda (sfmt)
 > -		      (string-list "extern @prefix@_write_fn "
 > -				   (-gen-write-fn-name sfmt)
 > -				   ";\n"))
 > -		    (current-sfmt-list))
 > -   "\n"
 > -   )
 > -)
 >  
 >  \f
 >  ; idesc, argbuf, and scache types
 > @@ -164,14 +142,9 @@ struct @cpu@_cpu;
 >  struct @prefix@_scache;
 >  "
 >     (if (with-parallel?)
 > -       "struct @prefix@_parexec;\n" "")
 > -   (if (with-parallel?)
 > -       "typedef void (@prefix@_sem_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec);"
 > +       "typedef void (@prefix@_sem_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem, int tick, @prefix@::write_stacks &buf);"
 >         "typedef sem_status (@prefix@_sem_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem);")
 >     "\n"
 > -   (if (with-parallel?)
 > -       "typedef sem_status (@prefix@_write_fn) (@cpu@_cpu* cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec);"
 > -       "")
 >     "\n"   
 >  "
 >  // Instruction descriptor.
 > @@ -192,12 +165,6 @@ struct @prefix@_idesc {
 >    @prefix@_sem_fn* execute;\n\n"
 >         "")
 >  
 > -   (if (with-parallel?)
 > -       "\
 > -  // scache write executor for this insn
 > -  @prefix@_write_fn* writeback;\n\n"
 > -       "")
 > -
 >     "\
 >    const char* insn_name;
 >    enum @prefix@_insn_type sem_index;
 > @@ -300,15 +267,6 @@ struct @prefix@_scache {
 >    // argument buffer
 >    @prefix@_sem_fields fields;
 >  
 > -" (if (or (with-profile?) (with-parallel-write?))
 > -      (string-append "
 > -  // writeback flags
 > -  // Only used if profiling or parallel execution support enabled during
 > -  // file generation.
 > -  unsigned long long written;
 > -")
 > -      "") "
 > -
 >    // decode given instruction
 >    void decode (@cpu@_cpu* current_cpu, PCADDR pc, @prefix@_insn_word base_insn, @prefix@_insn_word entire_insn);
 >  };
 > @@ -718,6 +676,11 @@ void
 >  #ifndef @PREFIX@_DECODE_H
 >  #define @PREFIX@_DECODE_H
 >  
 > +namespace @prefix@ {
 > +// forward declaration of struct in -defs.h
 > +struct write_stacks;
 > +}
 > +
 >  namespace @cpu@ {
 >  
 >  using namespace cgen;
 > @@ -739,10 +702,6 @@ typedef UINT @prefix@_insn_word;
 >     ; There's no pressing need for it though.
 >     (if (with-scache?)
 >         -gen-sem-fn-decls
 > -       "")
 > -
 > -   (if (with-parallel?)
 > -       -gen-write-fn-decls
 >         "")
 >  
 >     "\
 > Index: sid-cpu.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/sid-cpu.scm,v
 > retrieving revision 1.7
 > diff -u -p -r1.7 sid-cpu.scm
 > --- sid-cpu.scm	7 Feb 2002 18:46:19 -0000	1.7
 > +++ sid-cpu.scm	9 Jan 2003 03:22:23 -0000
 > @@ -199,6 +199,34 @@ namespace @arch@ {
 >     (-gen-hardware-struct #f (find hw-need-storage? (current-hw-list))))
 >  )
 >  
 > +(define (-gen-hw-stream-and-destream-fns) 
 > +  (let* ((sa string-append)
 > +	 (regs (find hw-need-storage? (current-hw-list)))
 > +	 (reg-dim (lambda (r) 
 > +		    (let ((dims (-hw-vector-dims r)))
 > +		      (if (equal? 0 (length dims)) 
 > +			  "0"
 > +			  (number->string (car dims))))))
 > +	 (stream-reg (lambda (r) 
 > +		       (let ((rname (sa "hardware." (gen-c-symbol (obj:name r)))))
 > +			 (if (hw-scalar? r)
 > +			     (sa "    ost << " rname " << ' ';\n")
 > +			     (sa "    for (int i = 0; i < " (reg-dim r) 
 > +				 "; i++)\n      ost << " rname "[i] << ' ';\n")))))
 > +	 (destream-reg (lambda (r) 
 > +			 (let ((rname (sa "hardware." (gen-c-symbol (obj:name r)))))
 > +			   (if (hw-scalar? r)
 > +			       (sa "    ist >> " rname ";\n")
 > +			       (sa "    for (int i = 0; i < " (reg-dim r) 
 > +				   "; i++)\n      ist >> " rname "[i];\n"))))))
 > +    (sa
 > +     "  void stream_cgen_hardware (std::ostream &ost) const \n  {\n"
 > +     (string-map stream-reg regs)
 > +     "  }\n"
 > +     "  void destream_cgen_hardware (std::istream &ist) \n  {\n"
 > +     (string-map destream-reg regs)
 > +     "  }\n")))
 > +
 >  ; Generate <cpu>-cpu.h
 >  
 >  (define (cgen-cpu.h)
 > @@ -222,6 +250,8 @@ public:
 >  
 >     -gen-hardware-types
 >  
 > +   -gen-hw-stream-and-destream-fns
 > +
 >     "  // C++ register access function templates\n"
 >     "#define current_cpu this\n\n"
 >     (lambda ()
 > @@ -295,68 +325,161 @@ typedef struct {
 >     )
 >  )
 >  
 > -; Utility of gen-parallel-exec-type to generate the definition of one
 > -; structure in PAREXEC.
 > -; SFMT is an <sformat> object.
 >  
 > -(define (gen-parallel-exec-elm sfmt)
 > -  (string-append
 > -   "    struct { /* " (obj:comment sfmt) " */\n"
 > -   (let ((sem-ops
 > -	  ((if (with-parallel-write?) sfmt-out-ops sfmt-in-ops) sfmt)))
 > -     (if (null? sem-ops)
 > -	 "      int empty;\n"
 > -	 (string-map
 > -	  (lambda (op)
 > -	    (logit 2 "Processing operand " (obj:name op) " of format "
 > -		   (obj:name sfmt) " ...\n")
 > -	      (if (with-parallel-write?)
 > -		  (let ((index-type (and (op-save-index? op)
 > -					 (gen-index-type op sfmt))))
 > -		    (string-append "      " (gen-type op)
 > -				   " " (gen-sym op) ";\n"
 > -				   (if index-type
 > -				       (string-append "      " index-type 
 > -						      " " (gen-sym op) "_idx;\n")
 > -				       "")))
 > -		  (string-append "      "
 > -				 (gen-type op)
 > -				 " "
 > -				 (gen-sym op)
 > -				 ";\n")))
 > -	  sem-ops)))
 > -   "    } " (gen-sym sfmt) ";\n"
 > -   )
 > -)
 > +
 > +
 > +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 > +;;; begin stack-based write schedule
 > +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 > +
 > +(define useful-mode-names '(BI QI HI SI DI UQI UHI USI UDI SF DF))
 > +
 > +;(define (-calculated-memory-write-buffer-size)
 > +;  (let* ((is-mem? (lambda (op) (eq? (hw-sem-name (op:type op)) 'h-memory)))
 > +;	 (count-mem-writes
 > +;	  (lambda (sfmt) (length (find is-mem? (sfmt-out-ops sfmt))))))
 > +;    (apply max (append '(0) (map count-mem-writes (current-sfmt-list))))))
 > +
 > +
 > +;; note: this doesn't really correctly approximate the worst case. user-supplied functions
 > +;; might rewrite the pipeline extensively while it's running. 
 > +;(define (-worst-case-number-of-writes-to hw-name)
 > +;  (let* ((sfmts (current-sfmt-list))
 > +;	 (out-ops (map sfmt-out-ops sfmts))
 > +;	 (pred (lambda (op) (equal? hw-name (gen-c-symbol (obj:name (op:type op))))))
 > +;	 (filtered-ops (map (lambda (ops) (find pred ops)) out-ops)))
 > +;    (apply max (cons 0 (map (lambda (ops) (length ops)) filtered-ops)))))
 > +	 
 > +(define (-hw-gen-write-stack-decl nm mode)
 > +  (let* (
 > +; for the time being, we're disabling this size-estimation stuff and just
 > +; requiring the user to supply a parameter WRITE_BUF_SZ before they include -defs.h
 > +;	 (pipe-sz (+ 1 (max-delay (cpu-max-delay (current-cpu)))))
 > +;	 (sz (* pipe-sz (-worst-case-number-of-writes-to nm))))
 > +	 
 > +	 (mode-pad (spaces (- 4 (string-length mode))))
 > +	 (stack-name (string-append nm "_writes")))
 > +    (string-append
 > +     "  write_stack< write<" mode "> >" mode-pad "\t" stack-name "\t[pipe_sz];\n")))
 > +
 > +
 > +(define (-hw-gen-write-struct-decl)
 > +  (let* ((dims (-worst-case-index-dims))
 > +	 (sa string-append)
 > +	 (ns number->string)
 > +	 (idxs (seq 0 (- dims 1)))
 > +	 (ctor (sa "write (PCADDR _pc, MODE _val"
 > +		   (string-map (lambda (x) (sa ", USI _idx" (ns x) "=0")) idxs)
 > +		   ") : pc(_pc), val(_val)"
 > +		   (string-map (lambda (x) (sa ", idx" (ns x) "(_idx" (ns x) ")")) idxs)
 > +		   " {} \n"))
 > +	 (idx-fields (string-map (lambda (x) (sa "    USI idx" (ns x) ";\n")) idxs)))
 > +    (sa
 > +     "\n\n"
 > +     "  template <typename MODE>\n"
 > +     "  struct write\n"
 > +     "  {\n"
 > +     "    USI pc;\n"
 > +     "    MODE val;\n"
 > +     idx-fields
 > +     "    " ctor 
 > +     "    write() {}\n"
 > +     "  };\n" )))
 > +	       
 > +(define (-hw-vector-dims hw) (elm-get (hw-type hw) 'dimensions))			    
 > +(define (-worst-case-index-dims)
 > +  (apply max
 > +	 (append '(1) ; for memory accesses
 > +		 (map (lambda (hw) (length (-hw-vector-dims hw))) 
 > +		      (find (lambda (hw) (not (scalar? hw))) (current-hw-list))))))
 > +
 > +(define (-gen-writestacks)
 > +  (let* ((hw (find register? (current-hw-list)))
 > +	 (modes useful-mode-names) 
 > +	 (hw-pairs (map (lambda (h) (list (gen-c-symbol (obj:name h))
 > +					    (obj:name (hw-mode h)))) 
 > +			hw))
 > +	 (mem-pairs (map (lambda (m) (list (string-append m "_memory") m)) 
 > +			 modes))
 > +	 (all-pairs (append mem-pairs hw-pairs))
 > +
 > +	 (h1 "\n\n// write stacks used in parallel execution\n\n  struct write_stacks\n  {\n  // types of stacks\n\n")
 > +	 (wb (string-append
 > +	      "\n\n  // unified writeback function (defined in @prefix@-write.cc)"
 > +	        "\n  void writeback (int tick, @cpu@::@cpu@_cpu* current_cpu);"
 > +		"\n  // unified write-stack clearing function (defined in @prefix@-write.cc)"
 > +	        "\n  void reset ();"))
 > +	 (zz "\n\n  }; // end struct @prefix@::write_stacks \n\n")
 > +	 (st (string-append 
 > +	      "  std::ostream &operator<< (std::ostream &ost, const @prefix@::write_stacks &s);\n"
 > +	      "  std::istream &operator>> (std::istream &ist, @prefix@::write_stacks &s);\n"))
 > +	 )
 > +    (string-append	
 > +     (-hw-gen-write-struct-decl)
 > +     (foldl (lambda (s pair) (string-append s (apply -hw-gen-write-stack-decl pair))) h1 all-pairs)	  
 > +     wb
 > +     zz
 > +     st)))
 > +
 > +
 > +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 > +;;; end stack-based write schedule
 > +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 > +	  
 >  
 >  ; Generate the definition of the structure that holds register values, etc.
 > -; for use during parallel execution.  When instructions are executed parallelly
 > -; either
 > -; - their inputs are read before their outputs are written.  Thus we have to
 > -; fetch the input values of several instructions before executing any of them.
 > -; - or their outputs are queued here first and then written out after all insns
 > -; have executed.
 > -; The fetched/queued values are stored in an array of PAREXEC structs, one
 > -; element per instruction.
 > +; for use during parallel execution.  
 >  
 > -(define (gen-parallel-exec-type)
 > -  (logit 2 "Generating PAREXEC type ...\n")
 > -  (string-append
 > -   (if (with-parallel-write?)
 > -       "/* Queued output values of an instruction.  */\n"
 > -       "/* Fetched input values of an instruction.  */\n")
 > -   "\
 > +(define (gen-write-stack-structure)
 > +  (let (;(membuf-sz (-calculated-memory-write-buffer-size))
 > +	(max-delay (cpu-max-delay (current-cpu))))
 > +    (logit 2 "Generating write stack structure ...\n")
 > +    (string-append
 > +     "  static const int max_delay = "   
 > +     (number->string max-delay) ";\n"
 > +     "  static const int pipe_sz = "     
 > +     (number->string (+ 1 max-delay)) "; // max_delay + 1\n"
 >  
 > -struct @prefix@_parexec {
 > -  union {\n"
 > -   (string-map gen-parallel-exec-elm (current-sfmt-list))
 > -   "\
 > -  } operands;
 > -  /* For conditionally written operands, bitmask of which ones were.  */
 > -  unsigned long long written;
 > -};\n\n"
 > -   )
 > -)
 > +"
 > +#ifndef WRITE_BUF_SZ
 > +#define WRITE_BUF_SZ 1
 > +#endif
 > +
 > +  template <typename ELT> 
 > +  struct write_stack 
 > +  {
 > +    int t;
 > +    const int sz;
 > +    ELT buf[WRITE_BUF_SZ];
 > +
 > +    write_stack       ()             : t(-1), sz(WRITE_BUF_SZ) {}
 > +    inline bool empty ()             { return (t == -1); }
 > +    inline void clear ()             { t = -1; }
 > +    inline void pop   ()             { assert (t > -1); t--;}
 > +    inline void push  (const ELT &e) { assert (t+1 < sz); buf [++t] = e;}
 > +    inline ELT &top   ()             { return buf [t>0 ? ( t<sz ? t : sz-1) : 0];}
 > +  };
 > +
 > +  // look ahead for latest write with index = idx, where time of write is
 > +  // <= dist steps from base (present) in write_stack array st.
 > +  // returning def if no scheduled write is found.
 > +
 > +  template <typename STKS, typename VAL>
 > +  inline VAL lookahead (int dist, int base, STKS &st, VAL def, int idx=0)
 > +  {
 > +    for (; dist > 0; --dist)
 > +    {
 > +      write_stack <VAL> &v = st [(base + dist) % pipe_sz];
 > +      for (int i = v.t; i > 0; --i) 
 > +	  if (v.buf [i].idx0 == idx) return v.buf [i];
 > +    }
 > +    return def;
 > +  }
 > +
 > +"
 > + 
 > +     (-gen-writestacks)     
 > +     )))
 >  
 >  ; Generate the TRACE_RECORD struct definition.
 >  
 > @@ -375,16 +498,26 @@ typedef struct @prefix@_trace_record {
 >  
 >  ; Generate <cpu>-defs.h
 >  
 > +(define semantics-processed? #f)
 > +
 >  (define (cgen-defs.h)
 >    (logit 1 "Generating " (gen-cpu-name) " defs.h ...\n")
 >    (assert-keep-one)
 > -
 > +  
 >    ; Turn parallel execution support on if cpu needs it.
 >    (set-with-parallel?! (state-parallel-exec?))
 >  
 >    ; Initialize rtl->c generation.
 >    (rtl-c-config! #:rtl-cover-fns? #t)
 >  
 > +  (sim-analyze-insns!)
 > +
 > +  ; ensure semantc analysis has happened, in time
 > +  ; for the pipeline size to be calculated
 > +  (if (and (with-parallel?)
 > +	   (not semantics-processed?))
 > +      (error "defs.h must be generated after sem.cxx for parallel-execution type CPUs"))
 > +
 >    (string-write
 >     (gen-copyright "CPU family header for @cpu@ / @prefix@."
 >  		  copyright-red-hat package-red-hat-simulators)
 > @@ -392,15 +525,26 @@ typedef struct @prefix@_trace_record {
 >  #ifndef DEFS_@PREFIX@_H
 >  #define DEFS_@PREFIX@_H
 >  
 > +#include <stack>
 > +#include \"cgen-types.h\"
 > +
 > +// forward declaration\n\n  
 >  namespace @cpu@ {
 > +struct @cpu@_cpu;
 > +}
 > +
 > +namespace @prefix@ {
 > +
 > +using namespace cgen;
 > +
 >  \n"
 >  
 >     (if (with-parallel?)
 > -       gen-parallel-exec-type
 > -       "")
 > +       gen-write-stack-structure
 > +       "// no parallel-execution support\n")
 >  
 >     "\
 > -} // end @cpu@ namespace
 > +} // end @prefix@ namespace
 >  
 >  #endif /* DEFS_@PREFIX@_H */\n"
 >     )
 > @@ -417,47 +561,132 @@ namespace @cpu@ {
 >  ; Return C code to fetch and save all output operands to instructions with
 >  ; <sformat> SFMT.
 >  
 > -(define (-gen-write-args sfmt)
 > -  (string-map (lambda (op) (op:write op sfmt))
 > -	      (sfmt-out-ops sfmt))
 > -)
 > +; Generate <cpu>-write.cxx.
 >  
 > -; Utility of gen-write-fns to generate a writer function for <sformat> SFMT.
 >  
 > -(define (-gen-write-fn sfmt)
 > -  (logit 2 "Processing write function for \"" (obj:name sfmt) "\" ...\n")
 > -  (string-list
 > -   "\nsem_status\n"
 > -   (-gen-write-fn-name sfmt) " (@cpu@_cpu* current_cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec)\n"
 > -   "{\n"
 > -   (if (with-scache?)
 > -       (gen-define-field-macro sfmt)
 > -       "")
 > -   (gen-define-parallel-operand-macro sfmt)
 > -   "  @prefix@_scache* abuf = sem;\n"
 > -   "  unsigned long long written = abuf->written;\n"
 > -   "  PCADDR pc = abuf->addr;\n"
 > -   "  PCADDR npc = 0; // dummy value for branches\n"
 > -   "  sem_status status = SEM_STATUS_NORMAL; // ditto\n"
 > -   "\n"
 > -   (-gen-write-args sfmt)
 > -   "\n"
 > -   "  return status;\n"
 > -   (gen-undef-parallel-operand-macro sfmt)
 > -   (if (with-scache?)
 > -       (gen-undef-field-macro sfmt)
 > -       "")
 > -   "}\n\n")
 > -)
 > +(define (-gen-register-writer nm mode dims)
 > +  (let* ((pad "    ")
 > +	 (sa string-append)
 > +	 (idx-args (string-map (lambda (x) (sa "w.idx" (number->string x) ", ")) 
 > +			       (seq 0 (- dims 1)))))
 > +    (sa pad "while (! " nm "_writes[tick].empty())\n"
 > +	pad "{\n"
 > +	pad "  write<" mode "> &w = " nm "_writes[tick].top();\n"
 > +	pad "  current_cpu->" nm "_set(" idx-args "w.val);\n"
 > +	pad "  " nm "_writes[tick].pop();\n"
 > +	pad "}\n\n")))
 > +
 > +(define (-gen-memory-writer nm mode dims)
 > +  (let* ((pad "    ")
 > +	 (sa string-append)
 > +	 (idx-args (string-map (lambda (x) (sa ", w.idx" (number->string x) "")) 
 > +			       (seq 0 (- dims 1)))))
 > +    (sa pad "while (! " nm "_writes[tick].empty())\n"
 > +	pad "{\n"
 > +	pad "  write<" mode "> &w = " nm "_writes[tick].top();\n"
 > +	pad "  current_cpu->SETMEM" mode " (w.pc" idx-args ", w.val);\n"
 > +	pad "  " nm "_writes[tick].pop();\n"
 > +	pad "}\n\n")))
 > +
 > +
 > +(define (-gen-reset-fn)
 > +  (let* ((sa string-append)
 > +	 (objs (append (map (lambda (h) (gen-c-symbol (obj:name h))) 
 > +			    (find register? (current-hw-list)))
 > +		       (map (lambda (m) (sa m "_memory")) useful-mode-names)))
 > +	 (clr (lambda (elt) (sa "    clear_stacks (" elt "_writes);\n"))))
 > +    (sa 
 > +     "  template <typename ST> \n"
 > +     "  static void clear_stacks (ST &st)\n"
 > +     "  {\n"
 > +     "    for (int i = 0; i < @prefix@::pipe_sz; i++)\n"
 > +     "      st[i].clear();\n"
 > +     "  }\n\n"
 > +     "  void @prefix@::write_stacks::reset ()\n  {\n"
 > +     (string-map clr objs)
 > +     "  }")))
 > +
 > +(define (-gen-unified-write-fn) 
 > +  (let* ((hw (find register? (current-hw-list)))
 > +	 (modes useful-mode-names)	
 > +	 (hw-triples (map (lambda (h) (list (gen-c-symbol (obj:name h))
 > +					    (obj:name (hw-mode h))
 > +					    (length (-hw-vector-dims h)))) 
 > +			hw))
 > +	 (mem-triples (map (lambda (m) (list (string-append m "_memory") m 1)) 
 > +			 modes)))
 >  
 > -(define (-gen-write-fns)
 > -  (logit 2 "Processing writer functions ...\n")
 > -  (string-write-map (lambda (sfmt) (-gen-write-fn sfmt))
 > -		    (current-sfmt-list))
 > -)
 > +    (logit 2 "Generating writer function ...\n") 
 > +    (string-append
 > +     "
 > +
 > +  void @prefix@::write_stacks::writeback (int tick, @cpu@::@cpu@_cpu* current_cpu) 
 > +  {
 > +"
 > +     "\n    // register writeback loops\n"
 > +     (string-map (lambda (t) (apply -gen-register-writer t)) hw-triples)
 > +     "\n    // memory writeback loops\n"
 > +     (string-map (lambda (t) (apply -gen-memory-writer t)) mem-triples)
 > +"
 > +  }
 > +")))
 >  
 >  
 > -; Generate <cpu>-write.cxx.
 > +(define (-gen-stacks-stream-and-destream-fns) 
 > +  (let* ((sa string-append)
 > +	 (regs (find hw-need-storage? (current-hw-list)))
 > +	 (reg-dim (lambda (r) 
 > +		    (let ((dims (-hw-vector-dims r)))
 > +		      (if (equal? 0 (length dims)) 
 > +			  "0"
 > +			  (number->string (car dims))))))
 > +	 (write-stacks 
 > +	  (map (lambda (n) (sa n "_writes"))
 > +	       (append (map (lambda (r) (gen-c-symbol (obj:name r))) regs)
 > +		       (map (lambda (m) (sa m "_memory")) useful-mode-names))))
 > +	 (stream-stacks (lambda (s) (sa "    stream_stacks ( s." s ", ost);\n")))
 > +	 (destream-stacks (lambda (s) (sa "    destream_stacks ( s." s ", ist);\n")))
 > +	 (stack-boilerplate
 > +	  (sa
 > +	   "  template <typename ST> \n"
 > +	   "  void stream_stacks (const ST &st, std::ostream &ost)\n"
 > +	   "  {\n"
 > +	   "    for (int i = 0; i < @prefix@::pipe_sz; i++)\n"
 > +	   "    {\n"
 > +	   "      ost << st[i].t << ' ';\n"
 > +	   "      for (int j = 0; j <= st[i].t; j++)\n"
 > +	   "      {\n"
 > +	   "        ost << st[i].buf[j].pc << ' ';\n"
 > +	   "        ost << st[i].buf[j].val << ' ';\n"
 > +	   "        ost << st[i].buf[j].idx0 << ' ';\n"
 > +	   "      }\n"
 > +	   "    }\n"
 > +	   "  }\n"
 > +	   "  \n"
 > +	   "  template <typename ST> \n"
 > +	   "  void destream_stacks (ST &st, std::istream &ist)\n"
 > +	   "  {\n"
 > +	   "    for (int i = 0; i < @prefix@::pipe_sz; i++)\n"
 > +	   "    {\n"
 > +	   "      ist >> st[i].t;\n"
 > +	   "      for (int j = 0; j <= st[i].t; j++)\n"
 > +	   "      {\n"
 > +	   "        ist >> st[i].buf[j].pc;\n"
 > +	   "        ist >> st[i].buf[j].val;\n"
 > +	   "        ist >> st[i].buf[j].idx0;\n"
 > +	   "      }\n"
 > +	   "    }\n"
 > +	   "  }\n"
 > +	   "  \n")))
 > +    (sa stack-boilerplate
 > +	"  std::ostream & @prefix@::operator<< (std::ostream &ost, const @prefix@::write_stacks &s)\n   {\n"
 > +	(string-map stream-stacks write-stacks)
 > +	"\n    return ost;\n"
 > +	"  }\n"
 > +	"  std::istream & @prefix@::operator>> (std::istream &ist, @prefix@::write_stacks &s)\n   {\n"
 > +	(string-map destream-stacks write-stacks)
 > +	"\n    return ist;\n"
 > +	"  }\n")))
 >  
 >  (define (cgen-write.cxx)
 >    (logit 1 "Generating " (gen-cpu-name) " write.cxx ...\n")
 > @@ -465,8 +694,8 @@ namespace @cpu@ {
 >  
 >    (sim-analyze-insns!)
 >  
 > -  ; Turn parallel execution support off.
 > -  (set-with-parallel?! #f)
 > +  ; Turn parallel execution support on if needed.
 > +  (set-with-parallel?! (state-parallel-exec?))
 >  
 >    ; Tell the rtx->c translator we are the simulator.
 >    (rtl-c-config! #:rtl-cover-fns? #t)
 > @@ -478,12 +707,18 @@ namespace @cpu@ {
 >     "\
 >  
 >  #include \"@cpu@.h\"
 > -using namespace @cpu@;
 > -
 > +#include <iostream>
 >  "
 > -   -gen-write-fns
 > +   (if (with-parallel?) 
 > +       (string-append
 > +	 (-gen-reset-fn)
 > +	 (-gen-unified-write-fn)
 > +	 (-gen-stacks-stream-and-destream-fns))
 > +
 > +       "// no write-stack functions required\n")
 >     )
 >  )
 > +
 >  \f
 >  ; ******************
 >  ; cgen-semantics.cxx
 > @@ -521,19 +756,14 @@ using namespace @cpu@;
 >  	 "sem_status\n")
 >       "@prefix@_sem_" (gen-sym insn)
 >       (if (with-parallel?)
 > -	 " (@cpu@_cpu* current_cpu, @prefix@_scache* sem, @prefix@_parexec* par_exec)\n"
 > +	 (string-append " (@cpu@_cpu* current_cpu, @prefix@_scache* sem, const int tick, \n\t"
 > +			"@prefix@::write_stacks &buf)\n")
 >  	 " (@cpu@_cpu* current_cpu, @prefix@_scache* sem)\n")
 >       "{\n"
 >       (gen-define-field-macro (insn-sfmt insn))
 > -     (if (with-parallel?)
 > -	 (gen-define-parallel-operand-macro (insn-sfmt insn))
 > -	 "")
 >       "  sem_status status = SEM_STATUS_NORMAL;\n"
 >       "  @prefix@_scache* abuf = sem;\n"
 > -     ; Unconditionally written operands are not recorded here.
 > -     (if (or (with-profile?) (with-parallel-write?))
 > -	 "  unsigned long long written = 0;\n"
 > -	 "")
 > +
 >       ; The address of this insn, needed by extraction and semantic code.
 >       ; Note that the address recorded in the cpu state struct is not used.
 >       ; For faster engines that copy will be out of date.
 > @@ -542,23 +772,12 @@ using namespace @cpu@;
 >       "\n"
 >       (gen-semantic-code insn)
 >       "\n"
 > -     ; Only update what's been written if some are conditionally written.
 > -     ; Otherwise we know they're all written so there's no point in
 > -     ; keeping track.
 > -     (if (or (with-profile?) (with-parallel-write?))
 > -	 (if (-any-cond-written? (insn-sfmt insn))
 > -	     "  abuf->written = written;\n"
 > -	     "")
 > -	 "")
 >       (if cti?
 >  	 "  current_cpu->done_cti_insn (npc, status);\n"
 >  	 "  current_cpu->done_insn (npc, status);\n")
 >       (if (with-parallel?)
 >  	 ""
 >  	 "  return status;\n")
 > -     (if (with-parallel?)
 > -	 (gen-undef-parallel-operand-macro (insn-sfmt insn))
 > -	 "")
 >       (gen-undef-field-macro (insn-sfmt insn))
 >       "}\n\n"
 >       ))
 > @@ -576,13 +795,14 @@ using namespace @cpu@;
 >  ; Each instruction is implemented in its own function.
 >  
 >  (define (cgen-semantics.cxx)
 > -  (logit 1 "Generating " (gen-cpu-name) " semantics.cxx ...\n")
 > +  (logit 1 "Generating " (gen-cpu-name) " semantics.cxx ")
 >    (assert-keep-one)
 >  
 >    (sim-analyze-insns!)
 >  
 >    ; Turn parallel execution support on if cpu needs it.
 >    (set-with-parallel?! (state-parallel-exec?))
 > +  (logit 1 (if (state-parallel-exec?) " (parallel) ...\n" "...\n"))
 >  
 >    ; Tell the rtx->c translator we are the simulator.
 >    (rtl-c-config! #:rtl-cover-fns? #t)
 > @@ -590,6 +810,8 @@ using namespace @cpu@;
 >    ; Indicate we're currently not generating a pbb engine.
 >    (set-current-pbb-engine?! #f)
 >  
 > +  (set! semantics-processed? #t)
 > +
 >    (string-write
 >     (gen-copyright "Simulator instruction semantics for @prefix@."
 >  		  copyright-red-hat package-red-hat-simulators)
 > @@ -598,6 +820,7 @@ using namespace @cpu@;
 >  #include \"@cpu@.h\"
 >  
 >  using namespace @cpu@; // FIXME: namespace organization still wip
 > +using namespace @prefix@; // FIXME: namespace organization still wip
 >  
 >  #define GET_ATTR(name) GET_ATTR_##name ()
 >  
 > @@ -655,9 +878,6 @@ using namespace @cpu@; // FIXME: namespa
 >       (if (with-scache?)
 >  	 (gen-define-field-macro (insn-sfmt insn))
 >  	 "")
 > -     (if parallel?
 > -	 (gen-define-parallel-operand-macro (insn-sfmt insn))
 > -	 "")
 >       ; Unconditionally written operands are not recorded here.
 >       (if (or (with-profile?) (with-parallel-write?))
 >  	 "      unsigned long long written = 0;\n"
 > @@ -694,9 +914,6 @@ using namespace @cpu@; // FIXME: namespa
 >  	 (string-append "      pbb_br_npc = npc;\n"
 >  			"      pbb_br_status = br_status;\n")
 >  	 "")
 > -     (if parallel?
 > -	 (gen-undef-parallel-operand-macro (insn-sfmt insn))
 > -	 "")
 >       (if (with-scache?)
 >  	 (gen-undef-field-macro (insn-sfmt insn))
 >  	 "")
 > @@ -950,9 +1167,6 @@ struct @prefix@_pbb_label {
 >  			"      vpc = vpc + 1;\n")
 >  	 "")
 >       (gen-define-field-macro (sfrag-sfmt frag))
 > -     (if parallel?
 > -	 (gen-define-parallel-operand-macro (sfrag-sfmt frag))
 > -	 "")
 >       ; Unconditionally written operands are not recorded here.
 >       (if (or (with-profile?) (with-parallel-write?))
 >  	 "      unsigned long long written = 0;\n"
 > @@ -992,9 +1206,6 @@ struct @prefix@_pbb_label {
 >  	      (sfrag-trailer? frag))
 >  	 (string-append "      pbb_br_npc = npc;\n"
 >  			"      pbb_br_status = br_status;\n")
 > -	 "")
 > -     (if parallel?
 > -	 (gen-undef-parallel-operand-macro (sfrag-sfmt frag))
 >  	 "")
 >       (gen-undef-field-macro (sfrag-sfmt frag))
 >       "    }\n"
 > Index: rtl-c.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/rtl-c.scm,v
 > retrieving revision 1.4
 > diff -u -p -r1.4 rtl-c.scm
 > --- rtl-c.scm	8 Sep 2000 22:18:37 -0000	1.4
 > +++ rtl-c.scm	9 Jan 2003 03:22:25 -0000
 > @@ -1304,7 +1304,23 @@
 >  			"bad arg to `operand'" object-or-name)))
 >  )
 >  
 > -(define-fn xop (estate options mode object) object)
 > +(define-fn xop (estate options mode object) 
 > +  (let ((delayed (assoc '#:delay (estate-modifiers estate))))
 > +    (if (and delayed
 > +	     (equal? APPLICATION 'SID-SIMULATOR)
 > +	     (operand? object))
 > +	;; if we're looking at an operand inside a (delay ...) rtx, then we
 > +	;; are talking about a _delayed_ operand, which is a different
 > +	;; beast.  rather than try to work out what context we were
 > +	;; constructed within, we just clone the operand instance and set
 > +	;; the new one to have a delayed value. the setters and getters
 > +	;; will work it out.
 > +	(let ((obj (object-copy object))
 > +	      (amount (cadr delayed)))
 > +	  (op:set-delay! obj amount)
 > +	  obj)
 > +	;; else return the normal object
 > +	object)))
 >  
 >  (define-fn local (estate options mode object-or-name)
 >    (cond ((rtx-temp? object-or-name)
 > @@ -1363,9 +1379,38 @@
 >    (cx:make VOID "; /*clobber*/\n")
 >  )
 >  
 > -(define-fn delay (estate options mode n rtx)
 > -  (s-sequence (estate-with-modifiers estate '((#:delay))) VOID '() rtx) ; wip!
 > -)
 > +
 > +(define-fn delay (estate options mode num-node rtx)
 > +  (case APPLICATION
 > +    ((SID-SIMULATOR)
 > +     (let* ((n (cadddr num-node))
 > +	    (old-delay (let ((old (assoc '#:delay (estate-modifiers estate))))
 > +			 (if old (cadr old) 0)))
 > +	    (new-delay (+ n old-delay)))    
 > +       (begin
 > +	 ;; check for proper usage
 > +     	 (if (let* ((hw (case (car rtx) 
 > +			  ((operand) (op:type (rtx-operand-obj rtx)))
 > +			  ((xop) (op:type (rtx-xop-obj rtx)))
 > +			  (else #f))))		    	       
 > +	       (not (and hw (or (pc? hw) (memory? hw) (register? hw)))))
 > +	     (context-error 
 > +	      (estate-context estate) 
 > +	      (string-append 
 > +	       "(delay ...) rtx applied to wrong type of operand '" (car rtx) "'. should be pc, register or memory")))
 > +	 ;; signal an error if we're delayed and not in a "parallel-insns" CPU
 > +	 (if (not (with-parallel?)) 
 > +	     (context-error 	      
 > +	      (estate-context estate) 
 > +	      "delayed operand in a non-parallel cpu"))
 > +	 ;; update cpu-global pipeline bound
 > +	 (cpu-set-max-delay! (current-cpu) (max (cpu-max-delay (current-cpu)) new-delay))      
 > +	 ;; pass along new delay to embedded rtx
 > +	 (rtx-eval-with-estate rtx mode (estate-with-modifiers estate `((#:delay ,new-delay)))))))
 > +
 > +    ;; not in sid-land
 > +    (else (s-sequence (estate-with-modifiers estate '((#:delay))) VOID '() rtx))))
 > +
 >  
 >  ; Gets expanded as a macro.
 >  ;(define-fn annul (estate yes?)
 > Index: operand.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/operand.scm,v
 > retrieving revision 1.5
 > diff -u -p -r1.5 operand.scm
 > --- operand.scm	20 Dec 2002 06:39:04 -0000	1.5
 > +++ operand.scm	9 Jan 2003 03:22:29 -0000
 > @@ -90,6 +90,9 @@
 >  		; referenced.  #f means the operand is always referenced by
 >  		; the instruction.
 >  		(cond? . #f)
 > +		
 > +		; whether (and by how much) this instance of the operand is delayed.
 > +		(delayed . #f)
 >  		)
 >  	      nil)
 >  )
 > @@ -135,6 +138,8 @@
 >  (define op:set-num! (elm-make-setter <operand> 'num))
 >  (define op:cond? (elm-make-getter <operand> 'cond?))
 >  (define op:set-cond?! (elm-make-setter <operand> 'cond?))
 > +(define op:delay (elm-make-getter <operand> 'delayed))
 > +(define op:set-delay! (elm-make-setter <operand> 'delayed))
 >  
 >  ; Compute the hardware type lazily.
 >  ; FIXME: op:type should be named op:hwtype or some such.
 > Index: mach.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/mach.scm,v
 > retrieving revision 1.2
 > diff -u -p -r1.2 mach.scm
 > --- mach.scm	12 Jul 2001 02:32:25 -0000	1.2
 > +++ mach.scm	9 Jan 2003 03:22:31 -0000
 > @@ -755,8 +755,7 @@
 >    (apply min (cons 65535
 >  		   (map insn-length (find (lambda (insn)
 >  					    (and (not (has-attr? insn 'ALIAS))
 > -						 (eq? (obj-attr-value insn 'ISA)
 > -						      (obj:name isa))))
 > +						 (isa-supports? isa insn)))
 >  					  (non-multi-insns (current-insn-list))))))
 >  )
 >  
 > @@ -765,9 +764,8 @@
 >    ; [a language with infinite precision can't have max-reduce-iota-0 :-)]
 >    (apply max (cons 0
 >  		   (map insn-length (find (lambda (insn)
 > -					    (and (not (has-attr? insn 'ALIAS))
 > -						 (eq? (obj-attr-value insn 'ISA)
 > -						      (obj:name isa))))
 > +					  (and (not (has-attr? insn 'ALIAS))
 > +						 (isa-supports? isa insn)))
 >  					  (non-multi-insns (current-insn-list))))))
 >  )
 >  
 > @@ -1008,13 +1006,19 @@
 >  		; Allow a cpu family to override the isa parallel-insns spec.
 >  		; ??? Concession to the m32r port which can go away, in time.
 >  		parallel-insns
 > +
 > +		; Computed: maximum number of insns which may pass before there
 > +		; an insn writes back its output operands.
 > +		max-delay
 > +
 >  		)
 >  	      nil)
 >  )
 >  
 >  ; Accessors.
 >  
 > -(define-getters <cpu> cpu (word-bitsize insn-chunk-bitsize file-transform parallel-insns))
 > +(define-getters <cpu> cpu (word-bitsize insn-chunk-bitsize file-transform parallel-insns max-delay))
 > +(define-setters <cpu> cpu (max-delay))
 >  
 >  ; Return endianness of instructions.
 >  
 > @@ -1064,7 +1068,9 @@
 >  	      word-bitsize
 >  	      insn-chunk-bitsize
 >  	      file-transform
 > -	      parallel-insns)
 > +	      parallel-insns
 > +	      0 ; default max-delay. will compute correct value
 > +	      )
 >  	(begin
 >  	  (logit 2 "Ignoring " name ".\n")
 >  	  #f))) ; cpu is not to be kept
 > @@ -1284,13 +1290,13 @@
 >    ; Assert only one cpu family has been selected.
 >    (assert-keep-one)
 >  
 > -  (let ((par-insns (map isa-parallel-insns (current-isa-list)))
 > +  (let ((false->zero (lambda (x) (if x x 0)))
 > +	(par-insns (map isa-parallel-insns (current-isa-list)))
 >  	(cpu-par-insns (cpu-parallel-insns (current-cpu))))
 >      ; ??? The m32r does have parallel execution, but to keep support for the
 >      ; base mach simpler, a cpu family is allowed to override the isa spec.
 > -    (or cpu-par-insns
 > -	; FIXME: ensure all have same value.
 > -	(car par-insns)))
 > +    (max (false->zero cpu-par-insns) 
 > +	 (apply max (map false->zero par-insns))))
 >  )
 >  
 >  ; Return boolean indicating if parallel execution support is required.
 > Index: dev.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/dev.scm,v
 > retrieving revision 1.5
 > diff -u -p -r1.5 dev.scm
 > --- dev.scm	21 Dec 2002 22:22:33 -0000	1.5
 > +++ dev.scm	9 Jan 2003 03:22:31 -0000
 > @@ -115,7 +115,7 @@
 >    (load "sid-model")
 >    (load "sid-decode")
 >    (set! verbose-level 3)
 > -  (set! APPLICATION 'SIMULATOR)
 > +  (set! APPLICATION 'SID-SIMULATOR)
 >  )
 >  
 >  (define (load-sim)
 > Index: doc/rtl.texi
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/doc/rtl.texi,v
 > retrieving revision 1.17
 > diff -u -p -r1.17 rtl.texi
 > --- doc/rtl.texi	22 Dec 2002 04:49:26 -0000	1.17
 > +++ doc/rtl.texi	9 Jan 2003 03:22:34 -0000
 > @@ -1833,7 +1833,7 @@ This is a character string consisting of
 >  Fields are denoted by @code{$operand} or
 >  @code{$@{operand@}}@footnote{Support for @code{$@{operand@}} is
 >  work-in-progress.}.  If a @samp{$} is required in the syntax, it is
 > -specified with @samp{\$}.  At most one white-space character may be
 > +specified with @samp{$$}.  At most one white-space character may be
 >  present and it must be a blank separating the instruction mnemonic from
 >  the operands.  This doesn't restrict the user's assembler, this is
 >  @c Is this reasonable?
 > @@ -2257,10 +2257,39 @@ first argument.
 >  Indicate that @samp{object} is written in mode @samp{mode}, without
 >  saying how. This could be useful in conjunction with the C escape hooks.
 >  
 > -@item (delay mode num expr)
 > -Indicate that there are @samp{num} delay slots in the processing of
 > -@samp{expr}.  When using this rtx in instruction semantics, CGEN will
 > -infer that the instruction has the DELAY-SLOT attribute.
 > +@item (delay num expr)
 > +In older "sim" simulators, indicates that there are @samp{num} delay
 > +slots in the processing of @samp{expr}. When using this rtx in instruction
 > +semantics, CGEN will infer that the instruction has the DELAY-SLOT
 > +attribute.  
 > +
 > +In newer "sid" simulators, evaluates to the writeback queue for hardware
 > +operand @samp{expr}, at @samp{num} instruction cycles in the
 > +future. @samp{expr} @emph{must} be a hardware operand in this case. 
 > +
 > +For example, @code{(set (delay 3 pc) (+ pc 1))} will schedule write to
 > +the @samp{pc} register in the writeback phase of the 3rd instruction
 > +after the current. Alternatively, @code{(set gr1 (delay 3 gr2))} will
 > +immediately update the @samp{gr1} register with the @emph{latest write}
 > +to the @samp{gr2} register scheduled between the present and 3
 > +instructions in the future. @code{(delay 0 ...)}  refers to the
 > +writeback phase of the current instruction.
 > +
 > +This effect is modeled with a circular buffer of "write stacks" for each
 > +hardware element (register banks get a single stack). The size of the
 > +circular buffer is calculated from the uses of @code{(delay ...)} 
 > +rtxs. When a delayed write occurs, the simulator pushes the write onto
 > +the appropriate write stack in the "future" of the circular buffer for
 > +the written-to hardware element. At the end of each instruction cycle,
 > +the simulator executes all writes in all write stacks for the time slice
 > +just ending. When a delayed read (essentially a pipeline bypass) occurs,
 > +the simulator looks ahead in the circular buffer for any writes
 > +scheduled in the future write stack. If it doesn't find one, it
 > +progressively backs off towards the "current" instruction cycle's write
 > +stack, and if it still finds no scheduled writes then it returns the
 > +current state of the CPU. Thus while delayed writes are fast, delayed
 > +reads are potentially slower in a simulator with long pipelines and very
 > +large register banks.
 >  
 >  @item (annul yes?)
 >  @c FIXME: put annul into the glossary.
 > 
 > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* exposed pipeline patch (long!)
  2003-01-09  3:27 exposed pipeline patch (long!) Ben Elliston
  2003-01-09  6:41 ` Doug Evans
@ 2003-01-09  6:55 ` Doug Evans
  2003-01-09  7:24 ` Doug Evans
  2003-01-12 17:54 ` Doug Evans
  3 siblings, 0 replies; 12+ messages in thread
From: Doug Evans @ 2003-01-09  6:55 UTC (permalink / raw)
  To: Ben Elliston; +Cc: cgen

Ben Elliston writes:
 > I'm posting this patch on behalf of Graydon Hoare, who write this
 > exposed pipeline support last year.  It's a more generalised form of
 > the (delay ..) rtx and has been used for a couple of ports already.

Also, nit.  I'd rather not add seq when there's already iota.

 > Index: utils.scm
 > +(define (seq p q) (if (> p q) '() (cons p (seq (+ p 1) q))))

^ permalink raw reply	[flat|nested] 12+ messages in thread

* exposed pipeline patch (long!)
  2003-01-09  3:27 exposed pipeline patch (long!) Ben Elliston
  2003-01-09  6:41 ` Doug Evans
  2003-01-09  6:55 ` Doug Evans
@ 2003-01-09  7:24 ` Doug Evans
  2003-01-09 17:17   ` Frank Ch. Eigler
  2003-01-12 17:54 ` Doug Evans
  3 siblings, 1 reply; 12+ messages in thread
From: Doug Evans @ 2003-01-09  7:24 UTC (permalink / raw)
  To: Ben Elliston; +Cc: cgen

Handling exposed pipelines can get really messy
when one takes bypass networks into account.

Question: For the ports in question, are the delays ISA related
or implementation related?

If they're ISA related then specifying the delays in rtl is appropriate.

If they're implementation related (e.g. related to the depth of
the pipeline), then I think rtl isn't the way to go.
[I suppose an ISA could specify the depth of the pipeline
but that wouldn't be the norm.]
One way to go would be to specify the hazards independently of the rtl.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exposed pipeline patch (long!)
  2003-01-09  7:24 ` Doug Evans
@ 2003-01-09 17:17   ` Frank Ch. Eigler
  0 siblings, 0 replies; 12+ messages in thread
From: Frank Ch. Eigler @ 2003-01-09 17:17 UTC (permalink / raw)
  To: Doug Evans; +Cc: cgen


dje wrote:

> Handling exposed pipelines can get really messy
> when one takes bypass networks into account.

... right, which is why we don't model bypass networks at all in RTL.

> Question: For the ports in question, are the delays ISA related
> or implementation related?
> If they're ISA related then specifying the delays in rtl is appropriate.
> [...]

That's right.

> One way to go would be to specify the hazards independently of the rtl.

Right, and to some extent, the cgen "model" machinery may enable this.
They are orthogonal animals.


- FChE

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exposed pipeline patch (long!)
  2003-01-09  6:41 ` Doug Evans
@ 2003-01-09 17:55   ` Frank Ch. Eigler
  2003-01-09 18:36     ` Doug Evans
  2003-01-09 19:12     ` graydon hoare
  0 siblings, 2 replies; 12+ messages in thread
From: Frank Ch. Eigler @ 2003-01-09 17:55 UTC (permalink / raw)
  To: Doug Evans; +Cc: cgen


dje wrote:

> I have big problems with this patch.
> [...]

> One must have compelling reasons for moving or putting application
> specific stuff into the non-application specific parts of cgen [...]
> Refering to APPLICATION in rtl-c.scm.  Blech.
> 
>  > -(define-fn xop (estate options mode object) object)
>  > +(define-fn xop (estate options mode object) 
>  > +  (let ((delayed (assoc '#:delay (estate-modifiers estate))))
>  > +    (if (and delayed
>  > +	     (equal? APPLICATION 'SID-SIMULATOR)
>  > +	     (operand? object))

I believe this was added with good intentions: because the "delay"
operator name was already in some token use for older sim ports, and
we did not want to break them.  The new delay operator actually does
something, and when/if sim-side support is added, this rtl-c hack can
go away.  IIRC, the old delay operator did nothing except signal that
an abstract delay slot exists for the instruction in whose RTL the
operator appears someplace.  If someone is genuinely fond of this
meaning, then I propose renaming it to something else.


- FChE

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exposed pipeline patch (long!)
  2003-01-09 17:55   ` Frank Ch. Eigler
@ 2003-01-09 18:36     ` Doug Evans
  2003-01-09 19:14       ` Frank Ch. Eigler
  2003-01-09 19:12     ` graydon hoare
  1 sibling, 1 reply; 12+ messages in thread
From: Doug Evans @ 2003-01-09 18:36 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: cgen

Frank Ch. Eigler writes:
 > 
 > dje wrote:
 > 
 > > I have big problems with this patch.
 > > [...]
 > 
 > > One must have compelling reasons for moving or putting application
 > > specific stuff into the non-application specific parts of cgen [...]
 > > Refering to APPLICATION in rtl-c.scm.  Blech.
 > > 
 > >  > -(define-fn xop (estate options mode object) object)
 > >  > +(define-fn xop (estate options mode object) 
 > >  > +  (let ((delayed (assoc '#:delay (estate-modifiers estate))))
 > >  > +    (if (and delayed
 > >  > +	     (equal? APPLICATION 'SID-SIMULATOR)
 > >  > +	     (operand? object))
 > 
 > I believe this was added with good intentions: because the "delay"
 > operator name was already in some token use for older sim ports, and
 > we did not want to break them.  The new delay operator actually does
 > something, and when/if sim-side support is added, this rtl-c hack can
 > go away.  IIRC, the old delay operator did nothing except signal that
 > an abstract delay slot exists for the instruction in whose RTL the
 > operator appears someplace.  If someone is genuinely fond of this
 > meaning, then I propose renaming it to something else.

Redhat can keep this as a local mod of course, but
the FSF would never readily accept a target-specific hack like this to gcc
(#ifdef TARGET_Z8000 in expr.c for example).  I think the same principal
should apply here.

I don't think the situation is all that bleak though.

First, we need to separate architecture description from application usage.
RTL is all about abstract description.

What are the ports in question?
Can I see the rtl where the new `delay' is used?  (Are they checked in?)

If they're still under NDA I'm sure you can come up with an independent
example of how one would write rtl with the new `delay'.
[You'd have to do that anyway for anyone wanting to use the new `delay'.]
Then let's compare it, at the rtl level, with the old delay,
and go from there.

I'm guessing the difference is that the existing delay specifies when
an assignment happens (more or less) and the new delay specifies
when an operand is ready.  If there is no reasonable way to merge them
then clearly we need new rtl (assuming the new delay is for ISA-related
delays and not implementation-related delays: I didn't get an answer to
my question: for the ports that need this patch, are the delays ISA
related or or implementation-related?  I think
you answered it in a previous message but I couldn't be sure).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exposed pipeline patch (long!)
  2003-01-09 17:55   ` Frank Ch. Eigler
  2003-01-09 18:36     ` Doug Evans
@ 2003-01-09 19:12     ` graydon hoare
  2003-01-12 17:21       ` Doug Evans
  1 sibling, 1 reply; 12+ messages in thread
From: graydon hoare @ 2003-01-09 19:12 UTC (permalink / raw)
  To: cgen public list

On Thu, 2003-01-09 at 12:55, Frank Ch. Eigler wrote:

> I believe this was added with good intentions: because the "delay"
> operator name was already in some token use for older sim ports, and
> we did not want to break them.

yes. the semantics were enhanced to a superset of those in "sim" common.
unfortunately rtl trees seem to be given a fixed operational semantics
in terms of the C code they generate. there is, from my limited
understanding, no other place in which an application is *able* to alter
interpretation of the (delay...) rtl node, or its sub-nodes, without
replacing parts of "the" rtl-evaluator.

I certainly would have liked to put all these changes in *-sid.scm, I
just didn't find any practical way to do so. any suggestions?

-graydon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exposed pipeline patch (long!)
  2003-01-09 18:36     ` Doug Evans
@ 2003-01-09 19:14       ` Frank Ch. Eigler
  2003-01-12 17:05         ` Doug Evans
  0 siblings, 1 reply; 12+ messages in thread
From: Frank Ch. Eigler @ 2003-01-09 19:14 UTC (permalink / raw)
  To: Doug Evans; +Cc: cgen


dje wrote:

> [...]
> First, we need to separate architecture description from application usage.
> RTL is all about abstract description.

Right.

> What are the ports in question?
> Can I see the rtl where the new `delay' is used?  (Are they checked in?)

Hmm, that's an oversight.  There are indeed a couple of ports that
haven't been released yet.  Here is a generic example:

NEW:  (if (test) (set (delay 1 foo) bar))
OLD:  (delay 1 (if (test) (set foo bar)))

The new meaning makes "delay" a property of an lvalue being assigned
to as a future time (== enqueuing position).  The old one relates
"delay" of a generic bunch of computation.  The former definition is
sufficient to model ordinary branch delay slots, or functional units
that expose taking their sweet time.  The latter was never actually
implemented in a way a reader may expect: there is no delay in the
"if" or anything associated with foo or bar; there is no use of the
"1" value.


> I'm guessing the difference is that the existing delay specifies when
> an assignment happens (more or less) 

(Well, less - no actual deferred computation mechanism had existed in
cgen or the runtimes.)

> and the new delay specifies when an operand is ready. 

Right.

> [...] for the ports that need this patch, are the delays ISA related [...]

These are ISA defined.
That's what makes the pipelines exposed to the asm programmer.


- FChE

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exposed pipeline patch (long!)
  2003-01-09 19:14       ` Frank Ch. Eigler
@ 2003-01-12 17:05         ` Doug Evans
  0 siblings, 0 replies; 12+ messages in thread
From: Doug Evans @ 2003-01-12 17:05 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: cgen

Frank Ch. Eigler writes:
 > > Can I see the rtl where the new `delay' is used?  (Are they checked in?)
 > 
 > [...] Here is a generic example:
 > 
 > NEW:  (if (test) (set (delay 1 foo) bar))
 > OLD:  (delay 1 (if (test) (set foo bar)))
 > 
 > The new meaning makes "delay" a property of an lvalue being assigned
 > to as a future time (== enqueuing position).  The old one relates
 > "delay" of a generic bunch of computation.  The former definition is
 > sufficient to model ordinary branch delay slots, or functional units
 > that expose taking their sweet time.

 > The latter was never actually
 > implemented in a way a reader may expect: there is no delay in the
 > "if" or anything associated with foo or bar; there is no use of the
 > "1" value.

For the purposes of this discussion (deciding whether the two delays
are mergable) I'm separating description from application.

I have this feeling that branch delay slots are a sufficiently different beast
than the delays of exposed pipelines.
Sure, one can argue the former is just an example of the latter.
But there are various semantics associated with branch delay slots.
OTOH, I don't have too strong an opinion I guess.

So next question.  The sid behaviour is special cased for fear
of breaking existing ports (or some such IIRC).
Presumably we can do a sufficiently reasonable job just by
comparing generated files before/after (and in particular if in
the first pass we take steps to remove all diffs in the generated
files then the confidence goes up high enough for me to make the change).
So _if_ delays are mergable and _if_ the new definition of delay is what we
want, why not just go with it?
[I still have questions of the implementation though (to follow).]

 > > [...] for the ports that need this patch, are the delays ISA related [...]
 > 
 > These are ISA defined.
 > That's what makes the pipelines exposed to the asm programmer.

An ISA might specify that the pipeline is exposed, but leave the
actual delay to fall out from whatever the implementation ends
up being.  I wanted to make sure you're not talking about this case,
and that you really want the delays in the rtl.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exposed pipeline patch (long!)
  2003-01-09 19:12     ` graydon hoare
@ 2003-01-12 17:21       ` Doug Evans
  0 siblings, 0 replies; 12+ messages in thread
From: Doug Evans @ 2003-01-12 17:21 UTC (permalink / raw)
  To: graydon hoare; +Cc: cgen public list

graydon hoare writes:
 > On Thu, 2003-01-09 at 12:55, Frank Ch. Eigler wrote:
 > 
 > > I believe this was added with good intentions: because the "delay"
 > > operator name was already in some token use for older sim ports, and
 > > we did not want to break them.
 > 
 > yes. the semantics were enhanced to a superset of those in "sim" common.
 > unfortunately rtl trees seem to be given a fixed operational semantics
 > in terms of the C code they generate. there is, from my limited
 > understanding, no other place in which an application is *able* to alter
 > interpretation of the (delay...) rtl node, or its sub-nodes, without
 > replacing parts of "the" rtl-evaluator.
 > 
 > I certainly would have liked to put all these changes in *-sid.scm, I
 > just didn't find any practical way to do so. any suggestions?

Adding hooks in various places for applications to use is something
I've thought about.  Have wanted to postpone doing so until the
need was compelling.

OTOH, as long as the internal rtl records what's in the .cpu file,
it's up to the file generators to deal with interpretation of the rtl.
File generators shouldn't be thinking they need to "alter interpretation"
while rtl is being processed by cgen-proper.
cgen-proper records some basic things while processing rtl, but that's it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* exposed pipeline patch (long!)
  2003-01-09  3:27 exposed pipeline patch (long!) Ben Elliston
                   ` (2 preceding siblings ...)
  2003-01-09  7:24 ` Doug Evans
@ 2003-01-12 17:54 ` Doug Evans
  3 siblings, 0 replies; 12+ messages in thread
From: Doug Evans @ 2003-01-12 17:54 UTC (permalink / raw)
  To: cgen

Ben Elliston writes:
 > --- rtl-c.scm	8 Sep 2000 22:18:37 -0000	1.4
 > +++ rtl-c.scm	9 Jan 2003 03:22:25 -0000
 > @@ -1304,7 +1304,23 @@
 >  			"bad arg to `operand'" object-or-name)))
 >  )
 >  
 > -(define-fn xop (estate options mode object) object)
 > +(define-fn xop (estate options mode object) 
 > +  (let ((delayed (assoc '#:delay (estate-modifiers estate))))
 > +    (if (and delayed
 > +	     (equal? APPLICATION 'SID-SIMULATOR)
 > +	     (operand? object))
 > +	;; if we're looking at an operand inside a (delay ...) rtx, then we
 > +	;; are talking about a _delayed_ operand, which is a different
 > +	;; beast.  rather than try to work out what context we were
 > +	;; constructed within, we just clone the operand instance and set
 > +	;; the new one to have a delayed value. the setters and getters
 > +	;; will work it out.
 > +	(let ((obj (object-copy object))
 > +	      (amount (cadr delayed)))
 > +	  (op:set-delay! obj amount)
 > +	  obj)
 > +	;; else return the normal object
 > +	object)))

This feels like something semantic-compile would do.
Any reason to not do this there?

 > -(define-fn delay (estate options mode n rtx)
 > -  (s-sequence (estate-with-modifiers estate '((#:delay))) VOID '() rtx) ; wip!
 > -)
 > +(define-fn delay (estate options mode num-node rtx)
 > +  (case APPLICATION
 > +    ((SID-SIMULATOR)
 > +     (let* ((n (cadddr num-node))
 > +	    (old-delay (let ((old (assoc '#:delay (estate-modifiers estate))))
 > +			 (if old (cadr old) 0)))
 > +	    (new-delay (+ n old-delay)))    
 > +       (begin
 > +	 ;; check for proper usage
 > +     	 (if (let* ((hw (case (car rtx) 
 > +			  ((operand) (op:type (rtx-operand-obj rtx)))
 > +			  ((xop) (op:type (rtx-xop-obj rtx)))
 > +			  (else #f))))		    	       
 > +	       (not (and hw (or (pc? hw) (memory? hw) (register? hw)))))
 > +	     (context-error 
 > +	      (estate-context estate) 
 > +	      (string-append 
 > +	       "(delay ...) rtx applied to wrong type of operand '" (car rtx) "'. should be pc, register or memory")))
 > +	 ;; signal an error if we're delayed and not in a "parallel-insns" CPU
 > +	 (if (not (with-parallel?)) 
 > +	     (context-error 	      
 > +	      (estate-context estate) 
 > +	      "delayed operand in a non-parallel cpu"))
 > +	 ;; update cpu-global pipeline bound
 > +	 (cpu-set-max-delay! (current-cpu) (max (cpu-max-delay (current-cpu)) new-delay))      
 > +	 ;; pass along new delay to embedded rtx
 > +	 (rtx-eval-with-estate rtx mode (estate-with-modifiers estate `((#:delay ,new-delay)))))))
 > +
 > +    ;; not in sid-land
 > +    (else (s-sequence (estate-with-modifiers estate '((#:delay))) VOID '() rtx))))

The check for with-parallel? needs to be removed.
[If you want to move it to sid go for it.]

Calling cpu-set-max-delay! here is wrong.
If we want it done in cgen-proper, the general place to put this is in
semantic-compile.

NOTE: Apps are perfectly free to have their own post processing pass
of the rtl and have their own file like rtl-c.scm that works on the
post-processed form.  This is one perfectly legit way to go, and
it doesn't require hooks.

 >  ; Gets expanded as a macro.
 >  ;(define-fn annul (estate yes?)
 > Index: operand.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/operand.scm,v
 > retrieving revision 1.5
 > diff -u -p -r1.5 operand.scm
 > --- operand.scm	20 Dec 2002 06:39:04 -0000	1.5
 > +++ operand.scm	9 Jan 2003 03:22:29 -0000
 > @@ -90,6 +90,9 @@
 >  		; referenced.  #f means the operand is always referenced by
 >  		; the instruction.
 >  		(cond? . #f)
 > +		
 > +		; whether (and by how much) this instance of the operand is delayed.
 > +		(delayed . #f)
 >  		)
 >  	      nil)
 >  )
 > @@ -135,6 +138,8 @@
 >  (define op:set-num! (elm-make-setter <operand> 'num))
 >  (define op:cond? (elm-make-getter <operand> 'cond?))
 >  (define op:set-cond?! (elm-make-setter <operand> 'cond?))
 > +(define op:delay (elm-make-getter <operand> 'delayed))
 > +(define op:set-delay! (elm-make-setter <operand> 'delayed))

I _think_ adding `delayed' to <operand> is ok.
Guess I don't have a strong opinion.

 >  ; Compute the hardware type lazily.
 >  ; FIXME: op:type should be named op:hwtype or some such.
 > Index: mach.scm
 > ===================================================================
 > RCS file: /cvs/src/src/cgen/mach.scm,v
 > retrieving revision 1.2
 > diff -u -p -r1.2 mach.scm
 > --- mach.scm	12 Jul 2001 02:32:25 -0000	1.2
 > +++ mach.scm	9 Jan 2003 03:22:31 -0000
 > @@ -755,8 +755,7 @@
 >    (apply min (cons 65535
 >  		   (map insn-length (find (lambda (insn)
 >  					    (and (not (has-attr? insn 'ALIAS))
 > -						 (eq? (obj-attr-value insn 'ISA)
 > -						      (obj:name isa))))
 > +						 (isa-supports? isa insn)))
 >  					  (non-multi-insns (current-insn-list))))))
 >  )
 >  
 > @@ -765,9 +764,8 @@
 >    ; [a language with infinite precision can't have max-reduce-iota-0 :-)]
 >    (apply max (cons 0
 >  		   (map insn-length (find (lambda (insn)
 > -					    (and (not (has-attr? insn 'ALIAS))
 > -						 (eq? (obj-attr-value insn 'ISA)
 > -						      (obj:name isa))))
 > +					  (and (not (has-attr? insn 'ALIAS))
 > +						 (isa-supports? isa insn)))
 >  					  (non-multi-insns (current-insn-list))))))
 >  )

I'm guessing these are just no-op simplifications.
They can go in of course, file as a separate patch.
[Note: I don't mind this being included here.
Just trying to help y'all make forward progress.]

 > @@ -1008,13 +1006,19 @@
 >  		; Allow a cpu family to override the isa parallel-insns spec.
 >  		; ??? Concession to the m32r port which can go away, in time.
 >  		parallel-insns
 > +
 > +		; Computed: maximum number of insns which may pass before there
 > +		; an insn writes back its output operands.
 > +		max-delay
 > +
 >  		)
 >  	      nil)
 >  )
 >  
 >  ; Accessors.
 >  
 > -(define-getters <cpu> cpu (word-bitsize insn-chunk-bitsize file-transform parallel-insns))
 > +(define-getters <cpu> cpu (word-bitsize insn-chunk-bitsize file-transform parallel-insns max-delay))
 > +(define-setters <cpu> cpu (max-delay))

I dunno about this one, but I guess I don't have a strong opinion.

 > @@ -1284,13 +1290,13 @@
 >    ; Assert only one cpu family has been selected.
 >    (assert-keep-one)
 >  
 > -  (let ((par-insns (map isa-parallel-insns (current-isa-list)))
 > +  (let ((false->zero (lambda (x) (if x x 0)))
 > +	(par-insns (map isa-parallel-insns (current-isa-list)))
 >  	(cpu-par-insns (cpu-parallel-insns (current-cpu))))
 >      ; ??? The m32r does have parallel execution, but to keep support for the
 >      ; base mach simpler, a cpu family is allowed to override the isa spec.
 > -    (or cpu-par-insns
 > -	; FIXME: ensure all have same value.
 > -	(car par-insns)))
 > +    (max (false->zero cpu-par-insns) 
 > +	 (apply max (map false->zero par-insns))))
 >  )

this is ok.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-01-12 17:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-01-09  3:27 exposed pipeline patch (long!) Ben Elliston
2003-01-09  6:41 ` Doug Evans
2003-01-09 17:55   ` Frank Ch. Eigler
2003-01-09 18:36     ` Doug Evans
2003-01-09 19:14       ` Frank Ch. Eigler
2003-01-12 17:05         ` Doug Evans
2003-01-09 19:12     ` graydon hoare
2003-01-12 17:21       ` Doug Evans
2003-01-09  6:55 ` Doug Evans
2003-01-09  7:24 ` Doug Evans
2003-01-09 17:17   ` Frank Ch. Eigler
2003-01-12 17:54 ` Doug Evans

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).