public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [Patch,AVR]: PR49687 (better widening 32-bit mul)
@ 2011-07-25 10:08 Georg-Johann Lay
  2011-07-25 14:47 ` Weddington, Eric
  0 siblings, 1 reply; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-25 10:08 UTC (permalink / raw)
  To: gcc-patches
  Cc: Anatoly Sokolov, Denis Chertykov, Eric Weddington, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 3126 bytes --]

This is the second part for a better widening multiply for AVR,
namely widening to 32 bit when a MUL instructions are available.

This as a bit more complicated than the 16-bit case because the
multiplications are emit as implicit libgcc calls and involve
hard registers.  Thus, all splits and expansion has to be done
before register allocation.

The patch includes widening multiply from QI to SI, too.
If a QI is involved the extension is done in two steps:
An explicit, inlined QI->HI extension and an implicit HI->SI
extension in the library routine.

The __mulsi3 is rewritten; it now runs a bit slower and needs a
bit more code because __umulhisi3 and __muluhisi3 are factored out
to faciliate code-reuse.  In particular, multiplication
with a small constant (i.e. 17-bit signed -65536...65536) perform
better and will reuse such functions.

Eric, can you review the assembler routines and say if such reuse is ok
or if you'd prefer a speed-optimized version of __mulsi3 like in the current libgcc?

The new multiplication routines aim at a minimal register usage footprint:
No registers need to be clobbered except R26/R27 for __mulhi3.

The patch passes without regressions, of course.

Moreover, I drove individual tests of the routines against the old implementation
before integrating then into libgcc to run regression tests.

Ok to install?

Johann


	PR target/49687
	* config/avr/t-avr (LIB1ASMFUNCS): Remove _xmulhisi3_exit.
	Add _muluhisi3, _mulshisi3, _usmulhisi3.
	* config/avr/libgcc.S (__mulsi3): Rewrite.
	(__mulhisi3): Rewrite.
	(__umulhisi3): Rewrite.
	(__usmulhisi3): New.
	(__muluhisi3): New.
	(__mulshisi3): New.
	(__mulohisi3): New.
	(__mulqi3, __mulqihi3, __umulqihi3, __mulhi3): Use DEFUN/ENDF to
	declare.
	* config/avr/predicates.md (pseudo_register_operand): Rewrite.
	(pseudo_register_or_const_int_operand): New.
	(combine_pseudo_register_operand): New.
	(u16_operand): New.
	(s16_operand): New.
	(o16_operand): New.
	* config/avr/avr.c (avr_rtx_costs): Handle costs for mult:SI.
	* config/avr/avr.md (QIHI, QIHI2): New mode iterators.
	(any_extend, any_extend2): New code iterators.
	(extend_prefix): New code attribute.
	(mulsi3): Rewrite. Turn insn to expander.
	(mulhisi3): Ditto.
	(umulhisi3): Ditto.
	(usmulhisi3): New expander.
	(*mulsi3): New insn-and-split.
	(mulu<mode>si3): New insn-and-split.
	(muls<mode>si3): New insn-and-split.
	(mulohisi3): New insn-and-split.
	(*uumulqihisi3, *uumulhiqisi3, *uumulhihisi3, *uumulqiqisi3,
	*usmulqihisi3, *usmulhiqisi3, *usmulhihisi3, *usmulqiqisi3,
	*sumulqihisi3, *sumulhiqisi3, *sumulhihisi3, *sumulqiqisi3,
	*ssmulqihisi3, *ssmulhiqisi3, *ssmulhihisi3, *ssmulqiqisi3): New
	insn-and-split.
	(*mulsi3_call): Rewrite.
	(*mulhisi3_call): Rewrite.
	(*umulhisi3_call): Rewrite.
	(*usmulhisi3_call): New insn.
	(*muluhisi3_call): New insn.
	(*mulshisi3_call): New insn.
	(*mulohisi3_call): New insn.
	(extendqihi2): Use combine_pseudo_register_operand as predicate
	for operand 1.
	(extendqisi2): Ditto.
	(zero_extendqihi2): Ditto.
	(zero_extendqisi2): Ditto.
	(zero_extendhisi2): Ditto.
	(extendhisi2): Ditto. Don't early-clobber operand 0.


[-- Attachment #2: opt-mul32.diff --]
[-- Type: text/x-patch, Size: 32024 bytes --]

Index: config/avr/predicates.md
===================================================================
--- config/avr/predicates.md	(revision 176624)
+++ config/avr/predicates.md	(working copy)
@@ -155,10 +155,34 @@ (define_predicate "call_insn_operand"
        (ior (match_test "register_operand (XEXP (op, 0), mode)")
             (match_test "CONSTANT_ADDRESS_P (XEXP (op, 0))"))))
 
+;; For some insns we must ensure that no hard register is inserted
+;; into their operands because the insns are split and the split
+;; involves hard registers.  An example are divmod insn that are
+;; split to insns that represent implicit library calls.
+
 ;; True for register that is pseudo register.
 (define_predicate "pseudo_register_operand"
-  (and (match_code "reg")
-       (match_test "!HARD_REGISTER_P (op)")))
+  (and (match_operand 0 "register_operand")
+       (not (and (match_code "reg")
+                 (match_test "HARD_REGISTER_P (op)")))))
+
+;; True for operand that is pseudo register or CONST_INT.
+(define_predicate "pseudo_register_or_const_int_operand"
+  (ior (match_operand 0 "const_int_operand")
+       (match_operand 0 "pseudo_register_operand")))
+
+;; We keep combiner from inserting hard registers into the input of sign- and
+;; zero-extends.  A hard register in the input operand is not wanted because
+;; 32-bit multiply patterns clobber some hard registers and extends with a
+;; hard register that overlaps these clobbers won't combine to a widening
+;; multiplication.  There is no need for combine to propagate or insert
+;; hard registers, register allocation can do it just as well.
+
+;; True for operand that is pseudo register at combine time.
+(define_predicate "combine_pseudo_register_operand"
+  (ior (match_operand 0 "pseudo_register_operand")
+       (and (match_operand 0 "register_operand")
+            (match_test "reload_completed || reload_in_progress"))))
 
 ;; Return true if OP is a constant integer that is either
 ;; 8 or 16 or 24.
@@ -189,3 +213,18 @@ (define_predicate "s9_operand"
 (define_predicate "register_or_s9_operand"
   (ior (match_operand 0 "register_operand")
        (match_operand 0 "s9_operand")))
+
+;; Unsigned CONST_INT that fits in 16 bits, i.e. 0..65536.
+(define_predicate "u16_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), 0, (1<<16)-1)")))
+
+;; Signed CONST_INT that fits in 16 bits, i.e. -32768..32767.
+(define_predicate "s16_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), -(1<<15), (1<<15)-1)")))
+
+;; One-extended CONST_INT that fits in 16 bits, i.e. -65536..-1.
+(define_predicate "o16_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), -(1<<16), -1)")))
Index: config/avr/libgcc.S
===================================================================
--- config/avr/libgcc.S	(revision 176624)
+++ config/avr/libgcc.S	(working copy)
@@ -72,10 +72,11 @@ see the files COPYING3 and COPYING.RUNTI
 .endm
 
 \f
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 /* Note: mulqi3, mulhi3 are open-coded on the enhanced core.  */
 #if !defined (__AVR_HAVE_MUL__)
 /*******************************************************
-               Multiplication  8 x 8
+    Multiplication  8 x 8  without MUL
 *******************************************************/
 #if defined (L_mulqi3)
 
@@ -83,9 +84,7 @@ see the files COPYING3 and COPYING.RUNTI
 #define	r_arg1 	r24		/* multiplier */
 #define r_res	__tmp_reg__	/* result */
 
-	.global	__mulqi3
-	.func	__mulqi3
-__mulqi3:
+DEFUN __mulqi3
 	clr	r_res		; clear result
 __mulqi3_loop:
 	sbrc	r_arg1,0
@@ -97,18 +96,16 @@ __mulqi3_loop:
 __mulqi3_exit:	
 	mov	r_arg1,r_res	; result to return register
 	ret
+ENDF __mulqi3
 
 #undef r_arg2  
 #undef r_arg1  
 #undef r_res   
 	
-.endfunc
 #endif 	/* defined (L_mulqi3) */
 
 #if defined (L_mulqihi3)
-	.global	__mulqihi3
-	.func	__mulqihi3
-__mulqihi3:
+DEFUN __mulqihi3
 	clr	r25
 	sbrc	r24, 7
 	dec	r25
@@ -116,21 +113,19 @@ __mulqihi3:
 	sbrc	r22, 7
 	dec	r22
 	rjmp	__mulhi3
-	.endfunc
+ENDF __mulqihi3:
 #endif /* defined (L_mulqihi3) */
 
 #if defined (L_umulqihi3)
-	.global	__umulqihi3
-	.func	__umulqihi3
-__umulqihi3:
+DEFUN __umulqihi3
 	clr	r25
 	clr	r23
 	rjmp	__mulhi3
-	.endfunc
+ENDF __umulqihi3
 #endif /* defined (L_umulqihi3) */
 
 /*******************************************************
-               Multiplication  16 x 16
+    Multiplication  16 x 16  without MUL
 *******************************************************/
 #if defined (L_mulhi3)
 #define	r_arg1L	r24		/* multiplier Low */
@@ -140,9 +135,7 @@ __umulqihi3:
 #define r_resL	__tmp_reg__	/* result Low */
 #define r_resH  r21		/* result High */
 
-	.global	__mulhi3
-	.func	__mulhi3
-__mulhi3:
+DEFUN __mulhi3
 	clr	r_resH		; clear result
 	clr	r_resL		; clear result
 __mulhi3_loop:
@@ -166,6 +159,7 @@ __mulhi3_exit:
 	mov	r_arg1H,r_resH	; result to return register
 	mov	r_arg1L,r_resL
 	ret
+ENDF __mulhi3
 
 #undef r_arg1L
 #undef r_arg1H
@@ -174,168 +168,51 @@ __mulhi3_exit:
 #undef r_resL 	
 #undef r_resH 
 
-.endfunc
 #endif /* defined (L_mulhi3) */
-#endif /* !defined (__AVR_HAVE_MUL__) */
 
 /*******************************************************
-      Widening Multiplication  32 = 16 x 16
+    Widening Multiplication  32 = 16 x 16  without MUL
 *******************************************************/
-                              
+
 #if defined (L_mulhisi3)
 DEFUN __mulhisi3
-#if defined (__AVR_HAVE_MUL__)
-
-;; r25:r22 = r19:r18 * r21:r20
-
-#define A0 18
-#define B0 20
-#define C0 22
-
-#define A1 A0+1
-#define B1 B0+1
-#define C1 C0+1
-#define C2 C0+2
-#define C3 C0+3
- 
-    ; C = (signed)A1 * (signed)B1
-    muls  A1, B1
-    movw  C2, R0
-
-    ; C += A0 * B0
-    mul   A0, B0
-    movw  C0, R0
-
-    ; C += (signed)A1 * B0
-    mulsu A1, B0
-    sbci  C3, 0
-    add   C1, R0
-    adc   C2, R1
-    clr   __zero_reg__
-    adc   C3, __zero_reg__
-
-    ; C += (signed)B1 * A0
-    mulsu B1, A0
-    sbci  C3, 0
-    XJMP  __xmulhisi3_exit
-
-#undef A0
-#undef A1
-#undef B0
-#undef B1
-#undef C0
-#undef C1
-#undef C2
-#undef C3
-
-#else /* !__AVR_HAVE_MUL__ */
 ;;; FIXME: This is dead code (noone calls it)
-	mov_l	r18, r24
-	mov_h	r19, r25
-	clr	r24
-	sbrc	r23, 7
-	dec	r24
-	mov	r25, r24
-	clr	r20
-	sbrc	r19, 7
-	dec	r20
-	mov	r21, r20
-	XJMP	__mulsi3
-#endif /* __AVR_HAVE_MUL__ */
+    mov_l   r18, r24
+    mov_h   r19, r25
+    clr     r24
+    sbrc    r23, 7
+    dec     r24
+    mov     r25, r24
+    clr     r20
+    sbrc    r19, 7
+    dec     r20
+    mov     r21, r20
+    XJMP    __mulsi3
 ENDF __mulhisi3
 #endif /* defined (L_mulhisi3) */
 
 #if defined (L_umulhisi3)
 DEFUN __umulhisi3
-#if defined (__AVR_HAVE_MUL__)
-
-;; r25:r22 = r19:r18 * r21:r20
-
-#define A0 18
-#define B0 20
-#define C0 22
-
-#define A1 A0+1
-#define B1 B0+1
-#define C1 C0+1
-#define C2 C0+2
-#define C3 C0+3
-
-    ; C = A1 * B1
-    mul   A1, B1
-    movw  C2, R0
-
-    ; C += A0 * B0
-    mul   A0, B0
-    movw  C0, R0
-
-    ; C += A1 * B0
-    mul   A1, B0
-    add   C1, R0
-    adc   C2, R1
-    clr   __zero_reg__
-    adc   C3, __zero_reg__
-
-    ; C += B1 * A0
-    mul   B1, A0
-    XJMP  __xmulhisi3_exit
-
-#undef A0
-#undef A1
-#undef B0
-#undef B1
-#undef C0
-#undef C1
-#undef C2
-#undef C3
-
-#else /* !__AVR_HAVE_MUL__ */
 ;;; FIXME: This is dead code (noone calls it)
-	mov_l	r18, r24
-	mov_h	r19, r25
-	clr	r24
-	clr	r25
-	clr	r20
-	clr	r21
-	XJMP	__mulsi3
-#endif /* __AVR_HAVE_MUL__ */
+    mov_l   r18, r24
+    mov_h   r19, r25
+    clr     r24
+    clr     r25
+    mov_l   r20, r24
+    mov_h   r21, r25
+    XJMP    __mulsi3
 ENDF __umulhisi3
 #endif /* defined (L_umulhisi3) */
 
-#if defined (L_xmulhisi3_exit)
-
-;;; Helper for __mulhisi3 resp. __umulhisi3.
-
-#define C0 22
-#define C1 C0+1
-#define C2 C0+2
-#define C3 C0+3
-
-DEFUN __xmulhisi3_exit
-    add   C1, R0
-    adc   C2, R1
-    clr   __zero_reg__
-    adc   C3, __zero_reg__
-    ret
-ENDF __xmulhisi3_exit
-
-#undef C0
-#undef C1
-#undef C2
-#undef C3
-
-#endif /* defined (L_xmulhisi3_exit) */
-
 #if defined (L_mulsi3)
 /*******************************************************
-               Multiplication  32 x 32
+    Multiplication  32 x 32  without MUL
 *******************************************************/
 #define r_arg1L  r22		/* multiplier Low */
 #define r_arg1H  r23
 #define	r_arg1HL r24
 #define	r_arg1HH r25		/* multiplier High */
 
-
 #define	r_arg2L  r18		/* multiplicand Low */
 #define	r_arg2H  r19	
 #define	r_arg2HL r20
@@ -346,43 +223,7 @@ ENDF __xmulhisi3_exit
 #define r_resHL	 r30
 #define r_resHH  r31		/* result High */
 
-	
-	.global	__mulsi3
-	.func	__mulsi3
-__mulsi3:
-#if defined (__AVR_HAVE_MUL__)
-	mul	r_arg1L, r_arg2L
-	movw	r_resL, r0
-	mul	r_arg1H, r_arg2H
-	movw	r_resHL, r0
-	mul	r_arg1HL, r_arg2L
-	add	r_resHL, r0
-	adc	r_resHH, r1
-	mul	r_arg1L, r_arg2HL
-	add	r_resHL, r0
-	adc	r_resHH, r1
-	mul	r_arg1HH, r_arg2L
-	add	r_resHH, r0
-	mul	r_arg1HL, r_arg2H
-	add	r_resHH, r0
-	mul	r_arg1H, r_arg2HL
-	add	r_resHH, r0
-	mul	r_arg1L, r_arg2HH
-	add	r_resHH, r0
-	clr	r_arg1HH	; use instead of __zero_reg__ to add carry
-	mul	r_arg1H, r_arg2L
-	add	r_resH, r0
-	adc	r_resHL, r1
-	adc	r_resHH, r_arg1HH ; add carry
-	mul	r_arg1L, r_arg2H
-	add	r_resH, r0
-	adc	r_resHL, r1
-	adc	r_resHH, r_arg1HH ; add carry
-	movw	r_arg1L, r_resL
-	movw	r_arg1HL, r_resHL
-	clr	r1		; __zero_reg__ clobbered by "mul"
-	ret
-#else
+DEFUN __mulsi3
 	clr	r_resHH		; clear result
 	clr	r_resHL		; clear result
 	clr	r_resH		; clear result
@@ -414,13 +255,13 @@ __mulsi3_exit:
 	mov_h	r_arg1H,r_resH
 	mov_l	r_arg1L,r_resL
 	ret
-#endif /* defined (__AVR_HAVE_MUL__) */
+ENDF __mulsi3
+
 #undef r_arg1L 
 #undef r_arg1H 
 #undef r_arg1HL
 #undef r_arg1HH
              
-             
 #undef r_arg2L 
 #undef r_arg2H 
 #undef r_arg2HL
@@ -431,9 +272,183 @@ __mulsi3_exit:
 #undef r_resHL 
 #undef r_resHH 
 
-.endfunc
 #endif /* defined (L_mulsi3) */
+
+#endif /* !defined (__AVR_HAVE_MUL__) */
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+\f
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#if defined (__AVR_HAVE_MUL__)    
+#define A0 26
+#define B0 18
+#define C0 22
+
+#define A1 A0+1
+
+#define B1 B0+1
+#define B2 B0+2
+#define B3 B0+3
+
+#define C1 C0+1
+#define C2 C0+2
+#define C3 C0+3
+
+/*******************************************************
+    Widening Multiplication  32 = 16 x 16
+*******************************************************/
+                              
+#if defined (L_mulhisi3)
+;;; R25:R22 = (signed long) R27:R26 * (signed long) R19:R18
+;;; C3:C0   = (signed long) A1:A0   * (signed long) B1:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __mulhisi3
+    XCALL   __umulhisi3
+    ;; Sign-extend B
+    tst     B1
+    brpl    1f
+    sub     C2, A0
+    sbc     C3, A1
+1:  ;; Sign-extend A
+    XJMP __usmulhisi3_tail
+ENDF __mulhisi3
+#endif /* L_mulhisi3 */
+
+#if defined (L_usmulhisi3)
+;;; R25:R22 = (signed long) R27:R26 * (unsigned long) R19:R18
+;;; C3:C0   = (signed long) A1:A0   * (unsigned long) B1:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __usmulhisi3
+    XCALL   __umulhisi3
+    ;; FALLTHRU
+ENDF __usmulhisi3
+
+DEFUN __usmulhisi3_tail
+    ;; Sign-extend A
+    sbrs    A1, 7
+    ret
+    sub     C2, B0
+    sbc     C3, B1
+    ret
+ENDF __usmulhisi3_tail
+#endif /* L_usmulhisi3 */
+
+#if defined (L_umulhisi3)
+;;; R25:R22 = (unsigned long) R27:R26 * (unsigned long) R19:R18
+;;; C3:C0   = (unsigned long) A1:A0   * (unsigned long) B1:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __umulhisi3
+    mul     A0, B0
+    movw    C0, r0
+    mul     A1, B1
+    movw    C2, r0
+    mul     A0, B1
+    add     C1, r0
+    adc     C2, r1
+    clr     __zero_reg__
+    adc     C3, __zero_reg__
+    mul     A1, B0
+    add     C1, r0
+    adc     C2, r1
+    clr     __zero_reg__
+    adc     C3, __zero_reg__
+    ret
+ENDF __umulhisi3
+#endif /* L_umulhisi3 */
+
+/*******************************************************
+    Widening Multiplication  32 = 16 x 32
+*******************************************************/
+
+#if defined (L_mulshisi3)
+;;; R25:R22 = (signed long) R27:R26 * R21:R18
+;;; (C3:C0) = (signed long) A1:A0   * B3:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __mulshisi3
+#ifdef __AVR_HAVE_JMP_CALL__
+    ;; Some cores have problem skipping 2-word instruction
+    tst     A1
+    brmi    __mulohisi3
+#else
+    sbrs    A1, 7
+#endif /* __AVR_HAVE_JMP_CALL__ */
+    XJMP    __muluhisi3
+    ;; FALLTHRU
+ENDF __mulshisi3
+    
+;;; R25:R22 = (one-extended long) R27:R26 * R21:R18
+;;; (C3:C0) = (one-extended long) A1:A0   * B3:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __mulohisi3
+    XCALL   __muluhisi3
+    ;; One-extend R27:R26 (A1:A0)
+    sub     C2, B0
+    sbc     C3, B1
+    ret
+ENDF __mulohisi3
+#endif /* L_mulshisi3 */
+
+#if defined (L_muluhisi3)
+;;; R25:R22 = (unsigned long) R27:R26 * R21:R18
+;;; (C3:C0) = (unsigned long) A1:A0   * B3:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __muluhisi3
+    XCALL   __umulhisi3
+    mul     A0, B3
+    add     C3, r0
+    mul     A1, B2
+    add     C3, r0
+    mul     A0, B2
+    add     C2, r0
+    adc     C3, r1
+    clr     __zero_reg__
+    ret
+ENDF __muluhisi3
+#endif /* L_muluhisi3 */
+
+/*******************************************************
+    Multiplication  32 x 32
+*******************************************************/
+
+#if defined (L_mulsi3)
+;;; R25:R22 = R25:R22 * R21:R18
+;;; (C3:C0) = C3:C0   * B3:B0
+;;; Clobbers: R26, R27, __tmp_reg__
+DEFUN __mulsi3
+    movw    A0, C0
+    push    C2
+    push    C3
+    XCALL   __muluhisi3
+    pop     A1
+    pop     A0
+    ;; A1:A0 now contains the high word of A
+    mul     A0, B0
+    add     C2, r0
+    adc     C3, r1
+    mul     A0, B1
+    add     C3, r0
+    mul     A1, B0
+    add     C3, r0
+    clr     __zero_reg__
+    ret
+#endif /* L_mulsi3 */
+
+#undef A0
+#undef A1
+
+#undef B0
+#undef B1
+#undef B2
+#undef B3
+
+#undef C0
+#undef C1
+#undef C2
+#undef C3
+
+#endif /* __AVR_HAVE_MUL__ */
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 	
+\f
 /*******************************************************
        Division 8 / 8 => (result + remainder)
 *******************************************************/
Index: config/avr/avr.md
===================================================================
--- config/avr/avr.md	(revision 176624)
+++ config/avr/avr.md	(working copy)
@@ -126,12 +126,25 @@ (define_attr "length" ""
 		       (const_int 2))]
         (const_int 2)))
 
-;; Define mode iterator
+;; Define mode iterators
+(define_mode_iterator QIHI  [(QI "") (HI "")])
+(define_mode_iterator QIHI2 [(QI "") (HI "")])
 (define_mode_iterator QISI [(QI "") (HI "") (SI "")])
 (define_mode_iterator QIDI [(QI "") (HI "") (SI "") (DI "")])
 (define_mode_iterator HIDI [(HI "") (SI "") (DI "")])
 (define_mode_iterator HISI [(HI "") (SI "")])
 
+;; Define code iterators
+;; Define two incarnations so that we can build the cross product.
+(define_code_iterator any_extend  [sign_extend zero_extend])
+(define_code_iterator any_extend2 [sign_extend zero_extend])
+
+;; Define code attributes
+(define_code_attr extend_prefix
+  [(sign_extend "s")
+   (zero_extend "u")])
+
+
 ;;========================================================================
 ;; The following is used by nonlocal_goto and setjmp.
 ;; The receiver pattern will create no instructions since internally
@@ -1349,69 +1362,310 @@ (define_insn "*mulhi3_call"
 
 ;; Operand 2 (reg:SI 18) not clobbered on the enhanced core.
 ;; All call-used registers clobbered otherwise - normal library call.
+;;    To support widening multiplicatioon with constant we postpone
+;; expanding to the implicit library call until post combine and
+;; prior to register allocation.  Clobber all hard registers that
+;; might be used by the (widening) multiply until it is split and
+;; it's final register footprint is worked out.
+
 (define_expand "mulsi3"
-  [(set (reg:SI 22) (match_operand:SI 1 "register_operand" ""))
-   (set (reg:SI 18) (match_operand:SI 2 "register_operand" ""))
-   (parallel [(set (reg:SI 22) (mult:SI (reg:SI 22) (reg:SI 18)))
-	      (clobber (reg:HI 26))
-	      (clobber (reg:HI 30))])
-   (set (match_operand:SI 0 "register_operand" "") (reg:SI 22))]
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (match_operand:SI 1 "register_operand" "")
+                            (match_operand:SI 2 "nonmemory_operand" "")))
+              (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
-  "")
+  {
+    if (u16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_muluhisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
 
-(define_insn "*mulsi3_call"
-  [(set (reg:SI 22) (mult:SI (reg:SI 22) (reg:SI 18)))
-   (clobber (reg:HI 26))
-   (clobber (reg:HI 30))]
-  "AVR_HAVE_MUL"
-  "%~call __mulsi3"
-  [(set_attr "type" "xcall")
-   (set_attr "cc" "clobber")])
+    if (o16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_mulohisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
+  })
 
-(define_expand "mulhisi3"
-  [(set (reg:HI 18)
-        (match_operand:HI 1 "register_operand" ""))
-   (set (reg:HI 20)
-        (match_operand:HI 2 "register_operand" ""))
+(define_insn_and_split "*mulsi3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
+        (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:SI 18)
+        (match_dup 1))
    (set (reg:SI 22) 
-        (mult:SI (sign_extend:SI (reg:HI 18))
-                 (sign_extend:SI (reg:HI 20))))
-   (set (match_operand:SI 0 "register_operand" "") 
+        (match_dup 2))
+   (parallel [(set (reg:SI 22)
+                   (mult:SI (reg:SI 22)
+                            (reg:SI 18)))
+              (clobber (reg:HI 26))])
+   (set (match_dup 0)
+        (reg:SI 22))]
+  {
+    if (u16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_muluhisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
+
+    if (o16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_mulohisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
+  })
+
+;; "muluqisi3"
+;; "muluhisi3"
+(define_insn_and_split "mulu<mode>si3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                           "=r")
+        (mult:SI (zero_extend:SI (match_operand:QIHI 1 "pseudo_register_operand" "r"))
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand"      "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:HI 26)
+        (match_dup 1))
+   (set (reg:SI 18)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (zero_extend:SI (reg:HI 26))
+                 (reg:SI 18)))
+   (set (match_dup 0)
+        (reg:SI 22))]
+  {
+    /* Do the QI -> HI extension explicitely before the multiplication.  */
+    /* Do the HI -> SI extension implicitely and after the multiplication.  */
+       
+    if (QImode == <MODE>mode)
+      operands[1] = gen_rtx_ZERO_EXTEND (HImode, operands[1]);
+
+    if (u16_operand (operands[2], SImode))
+      {
+        operands[1] = force_reg (HImode, operands[1]);
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_umulhisi3 (operands[0], operands[1], operands[2]));
+        DONE;
+      }
+  })
+
+;; "mulsqisi3"
+;; "mulshisi3"
+(define_insn_and_split "muls<mode>si3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                           "=r")
+        (mult:SI (sign_extend:SI (match_operand:QIHI 1 "pseudo_register_operand" "r"))
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand"      "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:HI 26)
+        (match_dup 1))
+   (set (reg:SI 18)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (sign_extend:SI (reg:HI 26))
+                 (reg:SI 18)))
+   (set (match_dup 0)
+        (reg:SI 22))]
+  {
+    /* Do the QI -> HI extension explicitely before the multiplication.  */
+    /* Do the HI -> SI extension implicitely and after the multiplication.  */
+       
+    if (QImode == <MODE>mode)
+      operands[1] = gen_rtx_SIGN_EXTEND (HImode, operands[1]);
+
+    if (u16_operand (operands[2], SImode)
+        || s16_operand (operands[2], SImode))
+      {
+        rtx xop2 = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+
+        operands[1] = force_reg (HImode, operands[1]);
+
+        if (u16_operand (operands[2], SImode))
+          emit_insn (gen_usmulhisi3 (operands[0], xop2, operands[1]));
+        else
+          emit_insn (gen_mulhisi3 (operands[0], operands[1], xop2));
+
+        DONE;
+      }
+  })
+
+;; One-extend operand 1
+
+(define_insn_and_split "mulohisi3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                          "=r")
+        (mult:SI (not:SI (zero_extend:SI 
+                          (not:HI (match_operand:HI 1 "pseudo_register_operand" "r"))))
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand"     "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:HI 26)
+        (match_dup 1))
+   (set (reg:SI 18)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (not:SI (zero_extend:SI (not:HI (reg:HI 26))))
+                 (reg:SI 18)))
+   (set (match_dup 0)
         (reg:SI 22))]
+  "")
+
+(define_expand "mulhisi3"
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (sign_extend:SI (match_operand:HI 1 "register_operand" ""))
+                            (sign_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
   "")
 
 (define_expand "umulhisi3"
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
+                            (zero_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:DI 18))])]
+  "AVR_HAVE_MUL"
+  "")
+
+(define_expand "usmulhisi3"
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
+                            (sign_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:DI 18))])]
+  "AVR_HAVE_MUL"
+  "")
+
+;; "*uumulqihisi3" "*uumulhiqisi3" "*uumulhihisi3" "*uumulqiqisi3"
+;; "*usmulqihisi3" "*usmulhiqisi3" "*usmulhihisi3" "*usmulqiqisi3"
+;; "*sumulqihisi3" "*sumulhiqisi3" "*sumulhihisi3" "*sumulqiqisi3"
+;; "*ssmulqihisi3" "*ssmulhiqisi3" "*ssmulhihisi3" "*ssmulqiqisi3"
+(define_insn_and_split
+  "*<any_extend:extend_prefix><any_extend2:extend_prefix>mul<QIHI:mode><QIHI2:mode>si3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                            "=r")
+        (mult:SI (any_extend:SI (match_operand:QIHI 1 "pseudo_register_operand"   "r"))
+                 (any_extend2:SI (match_operand:QIHI2 2 "pseudo_register_operand" "r"))))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
   [(set (reg:HI 18)
-        (match_operand:HI 1 "register_operand" ""))
-   (set (reg:HI 20)
-        (match_operand:HI 2 "register_operand" ""))
-   (set (reg:SI 22) 
-        (mult:SI (zero_extend:SI (reg:HI 18))
-                 (zero_extend:SI (reg:HI 20))))
-   (set (match_operand:SI 0 "register_operand" "") 
+        (match_dup 1))
+   (set (reg:HI 26)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (match_dup 3)
+                 (match_dup 4)))
+   (set (match_dup 0)
         (reg:SI 22))]
+  {
+    rtx xop1 = operands[1];
+    rtx xop2 = operands[2];
+
+    /* Do the QI -> HI extension explicitely before the multiplication.  */
+    /* Do the HI -> SI extension implicitely and after the multiplication.  */
+       
+    if (QImode == <QIHI:MODE>mode)
+      xop1 = gen_rtx_fmt_e (<any_extend:CODE>, HImode, xop1);
+
+    if (QImode == <QIHI2:MODE>mode)
+      xop2 = gen_rtx_fmt_e (<any_extend2:CODE>, HImode, xop2);
+
+    if (<any_extend:CODE> == <any_extend2:CODE>
+        || <any_extend:CODE> == ZERO_EXTEND)
+      {
+        operands[1] = xop1;
+        operands[2] = xop2;
+        operands[3] = gen_rtx_fmt_e (<any_extend:CODE>, SImode, gen_rtx_REG (HImode, 18));
+        operands[4] = gen_rtx_fmt_e (<any_extend2:CODE>, SImode, gen_rtx_REG (HImode, 26));
+      }
+    else
+      {
+        /* <any_extend:CODE>  = SIGN_EXTEND */
+        /* <any_extend2:CODE> = ZERO_EXTEND */
+
+        operands[1] = xop2;
+        operands[2] = xop1;
+        operands[3] = gen_rtx_ZERO_EXTEND (SImode, gen_rtx_REG (HImode, 18));
+        operands[4] = gen_rtx_SIGN_EXTEND (SImode, gen_rtx_REG (HImode, 26));
+      }
+  })
+
+(define_insn "*mulsi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (reg:SI 22)
+                 (reg:SI 18)))
+   (clobber (reg:HI 26))]
   "AVR_HAVE_MUL"
-  "")
+  "%~call __mulsi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
 
 (define_insn "*mulhisi3_call"
-  [(set (reg:SI 22) 
+  [(set (reg:SI 22)
         (mult:SI (sign_extend:SI (reg:HI 18))
-                 (sign_extend:SI (reg:HI 20))))]
+                 (sign_extend:SI (reg:HI 26))))]
   "AVR_HAVE_MUL"
   "%~call __mulhisi3"
   [(set_attr "type" "xcall")
    (set_attr "cc" "clobber")])
 
 (define_insn "*umulhisi3_call"
-  [(set (reg:SI 22) 
+  [(set (reg:SI 22)
         (mult:SI (zero_extend:SI (reg:HI 18))
-                 (zero_extend:SI (reg:HI 20))))]
+                 (zero_extend:SI (reg:HI 26))))]
   "AVR_HAVE_MUL"
   "%~call __umulhisi3"
   [(set_attr "type" "xcall")
    (set_attr "cc" "clobber")])
 
+(define_insn "*usmulhisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (zero_extend:SI (reg:HI 18))
+                 (sign_extend:SI (reg:HI 26))))]
+  "AVR_HAVE_MUL"
+  "%~call __usmulhisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+(define_insn "*muluhisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (zero_extend:SI (reg:HI 26))
+                 (reg:SI 18)))]
+  "AVR_HAVE_MUL"
+  "%~call __muluhisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+(define_insn "*mulshisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (sign_extend:SI (reg:HI 26))
+                 (reg:SI 18)))]
+  "AVR_HAVE_MUL"
+  "%~call __mulshisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+(define_insn "*mulohisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (not:SI (zero_extend:SI (not:HI (reg:HI 26))))
+                 (reg:SI 18)))]
+  "AVR_HAVE_MUL"
+  "%~call __mulohisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
 ; / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / %
 ; divmod
 
@@ -2399,9 +2653,16 @@ (define_insn "one_cmplsi2"
 ;; xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x
 ;; sign extend
 
+;; We keep combiner from inserting hard registers into the input of sign- and
+;; zero-extends.  A hard register in the input operand is not wanted because
+;; 32-bit multiply patterns clobber some hard registers and extends with a
+;; hard register that overlaps these clobbers won't be combined to a widening
+;; multiplication.  There is no need for combine to propagate hard registers,
+;; register allocation can do it just as well.
+
 (define_insn "extendqihi2"
   [(set (match_operand:HI 0 "register_operand" "=r,r")
-        (sign_extend:HI (match_operand:QI 1 "register_operand" "0,*r")))]
+        (sign_extend:HI (match_operand:QI 1 "combine_pseudo_register_operand" "0,*r")))]
   ""
   "@
 	clr %B0\;sbrc %0,7\;com %B0
@@ -2411,7 +2672,7 @@ (define_insn "extendqihi2"
 
 (define_insn "extendqisi2"
   [(set (match_operand:SI 0 "register_operand" "=r,r")
-        (sign_extend:SI (match_operand:QI 1 "register_operand" "0,*r")))]
+        (sign_extend:SI (match_operand:QI 1 "combine_pseudo_register_operand" "0,*r")))]
   ""
   "@
 	clr %B0\;sbrc %A0,7\;com %B0\;mov %C0,%B0\;mov %D0,%B0
@@ -2420,8 +2681,8 @@ (define_insn "extendqisi2"
    (set_attr "cc" "set_n,set_n")])
 
 (define_insn "extendhisi2"
-  [(set (match_operand:SI 0 "register_operand"               "=r,&r")
-        (sign_extend:SI (match_operand:HI 1 "register_operand" "0,*r")))]
+  [(set (match_operand:SI 0 "register_operand"                               "=r,r")
+        (sign_extend:SI (match_operand:HI 1 "combine_pseudo_register_operand" "0,*r")))]
   ""
   "@
 	clr %C0\;sbrc %B0,7\;com %C0\;mov %D0,%C0
@@ -2438,7 +2699,7 @@ (define_insn "extendhisi2"
 
 (define_insn_and_split "zero_extendqihi2"
   [(set (match_operand:HI 0 "register_operand" "=r")
-        (zero_extend:HI (match_operand:QI 1 "register_operand" "r")))]
+        (zero_extend:HI (match_operand:QI 1 "combine_pseudo_register_operand" "r")))]
   ""
   "#"
   "reload_completed"
@@ -2454,7 +2715,7 @@ (define_insn_and_split "zero_extendqihi2
 
 (define_insn_and_split "zero_extendqisi2"
   [(set (match_operand:SI 0 "register_operand" "=r")
-        (zero_extend:SI (match_operand:QI 1 "register_operand" "r")))]
+        (zero_extend:SI (match_operand:QI 1 "combine_pseudo_register_operand" "r")))]
   ""
   "#"
   "reload_completed"
@@ -2469,8 +2730,8 @@ (define_insn_and_split "zero_extendqisi2
 })
 
 (define_insn_and_split "zero_extendhisi2"
-  [(set (match_operand:SI 0 "register_operand" "=r")
-        (zero_extend:SI (match_operand:HI 1 "register_operand" "r")))]
+  [(set (match_operand:SI 0 "register_operand"                               "=r")
+        (zero_extend:SI (match_operand:HI 1 "combine_pseudo_register_operand" "r")))]
   ""
   "#"
   "reload_completed"
Index: config/avr/t-avr
===================================================================
--- config/avr/t-avr	(revision 176624)
+++ config/avr/t-avr	(working copy)
@@ -41,7 +41,9 @@ LIB1ASMFUNCS = \
 	_mulhi3 \
 	_mulhisi3 \
 	_umulhisi3 \
-	_xmulhisi3_exit \
+	_usmulhisi3 \
+	_muluhisi3 \
+	_mulshisi3 \
 	_mulsi3 \
 	_udivmodqi4 \
 	_divmodqi4 \
Index: config/avr/avr.c
===================================================================
--- config/avr/avr.c	(revision 176624)
+++ config/avr/avr.c	(working copy)
@@ -5512,6 +5512,34 @@ avr_rtx_costs (rtx x, int codearg, int o
 	    return false;
 	  break;
 
+	case SImode:
+	  if (AVR_HAVE_MUL)
+            {
+              if (!speed)
+                {
+                  /* Add some additional costs besides CALL like moves etc.  */
+
+                  *total = COSTS_N_INSNS (AVR_HAVE_JMP_CALL ? 5 : 4);
+                }
+              else
+                {
+                  /* Just a rough estimate.  Even with -O2 we don't want bulky
+                     code expanded inline.  */
+
+                  *total = COSTS_N_INSNS (25);
+                }
+            }
+          else
+            {
+              if (speed)
+                *total = COSTS_N_INSNS (300);
+              else
+                /* Add some additional costs besides CALL like moves etc.  */
+                *total = COSTS_N_INSNS (AVR_HAVE_JMP_CALL ? 5 : 4);
+            }
+          
+          return true;
+          
 	default:
 	  return false;
 	}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-25 10:08 [Patch,AVR]: PR49687 (better widening 32-bit mul) Georg-Johann Lay
@ 2011-07-25 14:47 ` Weddington, Eric
  2011-07-25 17:06   ` Georg-Johann Lay
  0 siblings, 1 reply; 17+ messages in thread
From: Weddington, Eric @ 2011-07-25 14:47 UTC (permalink / raw)
  To: Georg-Johann Lay, gcc-patches
  Cc: Anatoly Sokolov, Denis Chertykov, Richard Henderson



> -----Original Message-----
> From: Georg-Johann Lay [mailto:avr@gjlay.de]
> Sent: Monday, July 25, 2011 3:32 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Anatoly Sokolov; Denis Chertykov; Weddington, Eric; Richard Henderson
> Subject: [Patch,AVR]: PR49687 (better widening 32-bit mul)
> 
> Eric, can you review the assembler routines and say if such reuse is ok
> or if you'd prefer a speed-optimized version of __mulsi3 like in the
> current libgcc?

Hi Johann,

Typically a penalty on speed is preferred over a penalty on code size. Do you already have information on how it compares on code size with the old routines?

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-25 14:47 ` Weddington, Eric
@ 2011-07-25 17:06   ` Georg-Johann Lay
  2011-07-25 21:10     ` Weddington, Eric
  0 siblings, 1 reply; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-25 17:06 UTC (permalink / raw)
  To: Weddington, Eric
  Cc: gcc-patches, Anatoly Sokolov, Denis Chertykov, Richard Henderson

Weddington, Eric wrote:
> 
>> Eric, can you review the assembler routines and say if such reuse is ok or if you'd prefer a
>> speed-optimized version of __mulsi3 like in the current libgcc?
> 
> Hi Johann,
> 
> Typically a penalty on speed is preferred over a penalty on code size. Do you already have
> information on how it compares on code size with the old routines?
> 
> Eric

The old sizes are

62 __mulsi3
26 __mulhisi3
22 __umulhisi3
10 __xmulhisi3

where the __[u]mulhisi3 will drag in __xmulhisi3 and the insns don't combine
with constants.

The new implementation has more fragments, the indented modules are dragged
in i.e. used by respective function:

12 __mulhisi3
         __umulhisi3
         __usmulhisi3_tail

30 __umulhisi3

02 __usmulhisi3
10 __usmulhisi3_tail

20 __muluhisi3
         __umulhisi3

08 __mulohisi3
04 __mulshisi3
         __muluhisi3

30 __mulsi3
         __muluhisi3

This means that a pure __mulsi3 will have 30+30+20 = 80 bytes (+18).

If all functions are used they occupy 116 bytes (-4), so they actually
save a little space if they are used all with the benefit that they also
can one-extend, extend 32 = 16*32 as well as 32=16*16 and work for
small (17 bit signed) constants.

__umulhisi3 reads:

DEFUN __umulhisi3
    mul     A0, B0
    movw    C0, r0
    mul     A1, B1
    movw    C2, r0
    mul     A0, B1
    add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    mul     A1, B0
    add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    ret
ENDF __umulhisi3

It could be compressed to the following sequence, i.e.
24 bytes instead of 30, but I think that's too much of
quenching the last byte out of the code:

DEFUN __umulhisi3
    mul     A0, B0
    movw    C0, r0
    mul     A1, B1
    movw    C2, r0
    mul     A0, B1
    rcall   1f
    mul     A1, B0
1:  add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    ret
ENDF __umulhisi3


In that lack of real-world-code that uses 32-bit arithmetic I trust
my intuition that code size will decrease in general ;-)

Tiny examples are sometimes misleading because of additional moves from
unpleasant register allocation, bit that's a different story...

Johann

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-25 17:06   ` Georg-Johann Lay
@ 2011-07-25 21:10     ` Weddington, Eric
  2011-07-25 23:00       ` Georg-Johann Lay
  2011-07-27 14:08       ` Georg-Johann Lay
  0 siblings, 2 replies; 17+ messages in thread
From: Weddington, Eric @ 2011-07-25 21:10 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: gcc-patches, Anatoly Sokolov, Denis Chertykov, Richard Henderson



> -----Original Message-----
> From: Georg-Johann Lay [mailto:avr@gjlay.de]
> Sent: Monday, July 25, 2011 10:29 AM
> To: Weddington, Eric
> Cc: gcc-patches@gcc.gnu.org; Anatoly Sokolov; Denis Chertykov; Richard
> Henderson
> Subject: Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
> 

> 
> This means that a pure __mulsi3 will have 30+30+20 = 80 bytes (+18).
> 
> If all functions are used they occupy 116 bytes (-4), so they actually
> save a little space if they are used all with the benefit that they also
> can one-extend, extend 32 = 16*32 as well as 32=16*16 and work for
> small (17 bit signed) constants.
> 
> __umulhisi3 reads:
> 
> DEFUN __umulhisi3
>     mul     A0, B0
>     movw    C0, r0
>     mul     A1, B1
>     movw    C2, r0
>     mul     A0, B1
>     add     C1, r0
>     adc     C2, r1
>     clr     __zero_reg__
>     adc     C3, __zero_reg__
>     mul     A1, B0
>     add     C1, r0
>     adc     C2, r1
>     clr     __zero_reg__
>     adc     C3, __zero_reg__
>     ret
> ENDF __umulhisi3
> 
> It could be compressed to the following sequence, i.e.
> 24 bytes instead of 30, but I think that's too much of
> quenching the last byte out of the code:
> 
> DEFUN __umulhisi3
>     mul     A0, B0
>     movw    C0, r0
>     mul     A1, B1
>     movw    C2, r0
>     mul     A0, B1
>     rcall   1f
>     mul     A1, B0
> 1:  add     C1, r0
>     adc     C2, r1
>     clr     __zero_reg__
>     adc     C3, __zero_reg__
>     ret
> ENDF __umulhisi3
> 
> 
> In that lack of real-world-code that uses 32-bit arithmetic I trust
> my intuition that code size will decrease in general ;-)
> 

Hi Johann,

I would agree with you that it seems that overall code size will decrease in general.

However, I also like your creative compression in the second sequence above, and I think that it would be best to implement that sequence and try to find others like that where possible.

Remember that to AVR users, code size is *everything*. Even saving 6 bytes here or there has a positive effect.

I'll let Richard (or Denis if he's back from vacation) do the actual approval of the patch, as they are a lot more technically competent in this area. But I'm ok with the general tactic of the code reuse with looking at further ways to reduce code size like the example above.

Eric Weddington

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-25 21:10     ` Weddington, Eric
@ 2011-07-25 23:00       ` Georg-Johann Lay
  2011-07-26 13:00         ` Georg-Johann Lay
  2011-07-27 14:08       ` Georg-Johann Lay
  1 sibling, 1 reply; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-25 23:00 UTC (permalink / raw)
  To: Weddington, Eric
  Cc: gcc-patches, Anatoly Sokolov, Denis Chertykov, Richard Henderson

Eric Weddington a écrit:

>>__umulhisi3 reads:
>>
>>DEFUN __umulhisi3
>>    mul     A0, B0
>>    movw    C0, r0
>>    mul     A1, B1
>>    movw    C2, r0
>>    mul     A0, B1
>>    add     C1, r0
>>    adc     C2, r1
>>    clr     __zero_reg__
>>    adc     C3, __zero_reg__
>>    mul     A1, B0
>>    add     C1, r0
>>    adc     C2, r1
>>    clr     __zero_reg__
>>    adc     C3, __zero_reg__
>>    ret
>>ENDF __umulhisi3
>>
>>It could be compressed to the following sequence, i.e.
>>24 bytes instead of 30, but I think that's too much of
>>quenching the last byte out of the code:
>>
>>DEFUN __umulhisi3
>>    mul     A0, B0
>>    movw    C0, r0
>>    mul     A1, B1
>>    movw    C2, r0
>>    mul     A0, B1
>>    rcall   1f
>>    mul     A1, B0
>>1:  add     C1, r0
>>    adc     C2, r1
>>    clr     __zero_reg__
>>    adc     C3, __zero_reg__
>>    ret
>>ENDF __umulhisi3
> 
> However, I also like your creative compression in the second sequence
> above, and I think that it would be best to implement that sequence and
> try to find others like that where possible.

Maybe you see a sequence that I have overlooked?

> Remember that to AVR users, code size is *everything*. Even saving
> 6 bytes here or there has a positive effect.

Actually I had that self tail-call in my initial version.
But look at the code flow for a 32-bit multiply:

user -> mulsi3 -> muluhisi3 -> umulhisi3 -> umulhisi3.tail

That are 4 call-levels!  And mulsi3 pushes 2 bytes so that
a mulsi3 costs at least 10 bytes of stack.  (I prefered pushs/pops
over clobbering Z).  The three calls (one of the four is inevitable)
and pushs/pops will cost already ~30 ticks!
I found that too painful, and on devices with >= 8k flash the
self-tail-call will just save 4 bytes.

One way would be to depend the self tail-call on
!__AVR_HAVE_JMP_CALL__, i.e. just do it on small devices.

Also note that on tiny devices without MUL instruction
nothing changes, anyway.  The respective mulsi3 is unchanged.

Johann

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-25 23:00       ` Georg-Johann Lay
@ 2011-07-26 13:00         ` Georg-Johann Lay
  0 siblings, 0 replies; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-26 13:00 UTC (permalink / raw)
  To: Weddington, Eric; +Cc: gcc-patches, Denis Chertykov, Richard Henderson

Georg-Johann Lay wrote:

> I found that too painful, and on devices with >= 8k flash the
> self-tail-call will just save 4 bytes.

That' not correct: even on devices >= 8k an rcall will always reach
the destination, so that the self tail-call always saves 6 bytes.

Johann

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-25 21:10     ` Weddington, Eric
  2011-07-25 23:00       ` Georg-Johann Lay
@ 2011-07-27 14:08       ` Georg-Johann Lay
  2011-07-27 15:58         ` Richard Henderson
  2011-07-29 11:58         ` [Committed,AVR]: Addendum to fix thinko in PR49687 Georg-Johann Lay
  1 sibling, 2 replies; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-27 14:08 UTC (permalink / raw)
  To: Weddington, Eric
  Cc: gcc-patches, Anatoly Sokolov, Denis Chertykov, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 5022 bytes --]

http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02113.html

Weddington, Eric wrote:
> 
>> -----Original Message-----
>> From: Georg-Johann Lay
>>
>> This means that a pure __mulsi3 will have 30+30+20 = 80 bytes (+18).
>>
>> If all functions are used they occupy 116 bytes (-4), so they actually
>> save a little space if they are used all with the benefit that they also
>> can one-extend, extend 32 = 16*32 as well as 32=16*16 and work for
>> small (17 bit signed) constants.
>>
>> __umulhisi3 reads:
>>
>> DEFUN __umulhisi3
>>     mul     A0, B0
>>     movw    C0, r0
>>     mul     A1, B1
>>     movw    C2, r0
>>     mul     A0, B1
>>     add     C1, r0
>>     adc     C2, r1
>>     clr     __zero_reg__
>>     adc     C3, __zero_reg__
>>     mul     A1, B0
>>     add     C1, r0
>>     adc     C2, r1
>>     clr     __zero_reg__
>>     adc     C3, __zero_reg__
>>     ret
>> ENDF __umulhisi3
>>
>> It could be compressed to the following sequence, i.e.
>> 24 bytes instead of 30, but I think that's too much of
>> quenching the last byte out of the code:
>>
>> DEFUN __umulhisi3
>>     mul     A0, B0
>>     movw    C0, r0
>>     mul     A1, B1
>>     movw    C2, r0
>>     mul     A0, B1
>>     rcall   1f
>>     mul     A1, B0
>> 1:  add     C1, r0
>>     adc     C2, r1
>>     clr     __zero_reg__
>>     adc     C3, __zero_reg__
>>     ret
>> ENDF __umulhisi3
>>
>>
>> In that lack of real-world-code that uses 32-bit arithmetic I trust
>> my intuition that code size will decrease in general ;-)
>>
> 
> Hi Johann,
> 
> I would agree with you that it seems that overall code size will decrease in general.
> 
> However, I also like your creative compression in the second sequence above, and I think that it would be best to implement that sequence and try to find others like that where possible.
> 
> Remember that to AVR users, code size is *everything*. Even saving 6 bytes here or there has a positive effect.
> 
> I'll let Richard (or Denis if he's back from vacation) do the actual approval of the patch, as they are a lot more technically competent in this area. But I'm ok with the general tactic of the code reuse with looking at further ways to reduce code size like the example above.
> 
> Eric Weddington


This is a revised patch for review with the changes proposed by Eric,
i.e. __umulhisi3 is calling it's own tail.

A pure __mulsi3 will now cost 30+24+20 = 74 bytes (+12).

Using all functions will cost 110 bytes (-10).

__mulsi3 missed a final ENDF __mulsi3, I added it.

The rest of the patch is just technical:

* postponing emit of implicit library call from expand to split1,
  i.e. after combiner but prior to reload, of course.

* The patch covers QI->SI extensions where such extensions are
  done in two steps:  First an explicit QI-HI extension expanded
  inline and second the implicit HI->SI extension as by, e.g.
  __muluhisi3 (32 = 16 * 32)

* There is a bunch of possible HI/QI combinations. This is done
  with help of code iterators; the cross product covers all 16
  cases of QI->SI resp. HI->SI as signed resp. unsigned extension
  for operand1 resp. operand2.

* extendhisi2 need not to early-clobber the output because HI will
  always start in even register.

Tested without regressions.

Ok to install?

Johann

	PR target/49687
	* config/avr/t-avr (LIB1ASMFUNCS): Remove _xmulhisi3_exit.
	Add _muluhisi3, _mulshisi3, _usmulhisi3.
	* config/avr/libgcc.S (__mulsi3): Rewrite.
	(__mulhisi3): Rewrite.
	(__umulhisi3): Rewrite.
	(__usmulhisi3): New.
	(__muluhisi3): New.
	(__mulshisi3): New.
	(__mulohisi3): New.
	(__mulqi3, __mulqihi3, __umulqihi3, __mulhi3): Use DEFUN/ENDF to
	declare.
	* config/avr/predicates.md (pseudo_register_operand): Rewrite.
	(pseudo_register_or_const_int_operand): New.
	(combine_pseudo_register_operand): New.
	(u16_operand): New.
	(s16_operand): New.
	(o16_operand): New.
	* config/avr/avr.c (avr_rtx_costs): Handle costs for mult:SI.
	* config/avr/avr.md (QIHI, QIHI2): New mode iterators.
	(any_extend, any_extend2): New code iterators.
	(extend_prefix): New code attribute.
	(mulsi3): Rewrite. Turn insn to expander.
	(mulhisi3): Ditto.
	(umulhisi3): Ditto.
	(usmulhisi3): New expander.
	(*mulsi3): New insn-and-split.
	(mulu<mode>si3): New insn-and-split.
	(muls<mode>si3): New insn-and-split.
	(mulohisi3): New insn-and-split.
	(*uumulqihisi3, *uumulhiqisi3, *uumulhihisi3, *uumulqiqisi3,
	*usmulqihisi3, *usmulhiqisi3, *usmulhihisi3, *usmulqiqisi3,
	*sumulqihisi3, *sumulhiqisi3, *sumulhihisi3, *sumulqiqisi3,
	*ssmulqihisi3, *ssmulhiqisi3, *ssmulhihisi3, *ssmulqiqisi3): New
	insn-and-split.
	(*mulsi3_call): Rewrite.
	(*mulhisi3_call): Rewrite.
	(*umulhisi3_call): Rewrite.
	(*usmulhisi3_call): New insn.
	(*muluhisi3_call): New insn.
	(*mulshisi3_call): New insn.
	(*mulohisi3_call): New insn.
	(extendqihi2): Use combine_pseudo_register_operand as predicate
	for operand 1.
	(extendqisi2): Ditto.
	(zero_extendqihi2): Ditto.
	(zero_extendqisi2): Ditto.
	(zero_extendhisi2): Ditto.
	(extendhisi2): Ditto. Don't early-clobber operand 0.


[-- Attachment #2: opt-mul32.diff --]
[-- Type: text/x-patch, Size: 31959 bytes --]

Index: config/avr/predicates.md
===================================================================
--- config/avr/predicates.md	(revision 176818)
+++ config/avr/predicates.md	(working copy)
@@ -155,10 +155,34 @@ (define_predicate "call_insn_operand"
        (ior (match_test "register_operand (XEXP (op, 0), mode)")
             (match_test "CONSTANT_ADDRESS_P (XEXP (op, 0))"))))
 
+;; For some insns we must ensure that no hard register is inserted
+;; into their operands because the insns are split and the split
+;; involves hard registers.  An example are divmod insn that are
+;; split to insns that represent implicit library calls.
+
 ;; True for register that is pseudo register.
 (define_predicate "pseudo_register_operand"
-  (and (match_code "reg")
-       (match_test "!HARD_REGISTER_P (op)")))
+  (and (match_operand 0 "register_operand")
+       (not (and (match_code "reg")
+                 (match_test "HARD_REGISTER_P (op)")))))
+
+;; True for operand that is pseudo register or CONST_INT.
+(define_predicate "pseudo_register_or_const_int_operand"
+  (ior (match_operand 0 "const_int_operand")
+       (match_operand 0 "pseudo_register_operand")))
+
+;; We keep combiner from inserting hard registers into the input of sign- and
+;; zero-extends.  A hard register in the input operand is not wanted because
+;; 32-bit multiply patterns clobber some hard registers and extends with a
+;; hard register that overlaps these clobbers won't combine to a widening
+;; multiplication.  There is no need for combine to propagate or insert
+;; hard registers, register allocation can do it just as well.
+
+;; True for operand that is pseudo register at combine time.
+(define_predicate "combine_pseudo_register_operand"
+  (ior (match_operand 0 "pseudo_register_operand")
+       (and (match_operand 0 "register_operand")
+            (match_test "reload_completed || reload_in_progress"))))
 
 ;; Return true if OP is a constant integer that is either
 ;; 8 or 16 or 24.
@@ -189,3 +213,18 @@ (define_predicate "s9_operand"
 (define_predicate "register_or_s9_operand"
   (ior (match_operand 0 "register_operand")
        (match_operand 0 "s9_operand")))
+
+;; Unsigned CONST_INT that fits in 16 bits, i.e. 0..65536.
+(define_predicate "u16_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), 0, (1<<16)-1)")))
+
+;; Signed CONST_INT that fits in 16 bits, i.e. -32768..32767.
+(define_predicate "s16_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), -(1<<15), (1<<15)-1)")))
+
+;; One-extended CONST_INT that fits in 16 bits, i.e. -65536..-1.
+(define_predicate "o16_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), -(1<<16), -1)")))
Index: config/avr/libgcc.S
===================================================================
--- config/avr/libgcc.S	(revision 176818)
+++ config/avr/libgcc.S	(working copy)
@@ -72,10 +72,11 @@ see the files COPYING3 and COPYING.RUNTI
 .endm
 
 \f
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 /* Note: mulqi3, mulhi3 are open-coded on the enhanced core.  */
 #if !defined (__AVR_HAVE_MUL__)
 /*******************************************************
-               Multiplication  8 x 8
+    Multiplication  8 x 8  without MUL
 *******************************************************/
 #if defined (L_mulqi3)
 
@@ -83,9 +84,7 @@ see the files COPYING3 and COPYING.RUNTI
 #define	r_arg1 	r24		/* multiplier */
 #define r_res	__tmp_reg__	/* result */
 
-	.global	__mulqi3
-	.func	__mulqi3
-__mulqi3:
+DEFUN __mulqi3
 	clr	r_res		; clear result
 __mulqi3_loop:
 	sbrc	r_arg1,0
@@ -97,18 +96,16 @@ __mulqi3_loop:
 __mulqi3_exit:	
 	mov	r_arg1,r_res	; result to return register
 	ret
+ENDF __mulqi3
 
 #undef r_arg2  
 #undef r_arg1  
 #undef r_res   
 	
-.endfunc
 #endif 	/* defined (L_mulqi3) */
 
 #if defined (L_mulqihi3)
-	.global	__mulqihi3
-	.func	__mulqihi3
-__mulqihi3:
+DEFUN __mulqihi3
 	clr	r25
 	sbrc	r24, 7
 	dec	r25
@@ -116,21 +113,19 @@ __mulqihi3:
 	sbrc	r22, 7
 	dec	r22
 	rjmp	__mulhi3
-	.endfunc
+ENDF __mulqihi3:
 #endif /* defined (L_mulqihi3) */
 
 #if defined (L_umulqihi3)
-	.global	__umulqihi3
-	.func	__umulqihi3
-__umulqihi3:
+DEFUN __umulqihi3
 	clr	r25
 	clr	r23
 	rjmp	__mulhi3
-	.endfunc
+ENDF __umulqihi3
 #endif /* defined (L_umulqihi3) */
 
 /*******************************************************
-               Multiplication  16 x 16
+    Multiplication  16 x 16  without MUL
 *******************************************************/
 #if defined (L_mulhi3)
 #define	r_arg1L	r24		/* multiplier Low */
@@ -140,9 +135,7 @@ __umulqihi3:
 #define r_resL	__tmp_reg__	/* result Low */
 #define r_resH  r21		/* result High */
 
-	.global	__mulhi3
-	.func	__mulhi3
-__mulhi3:
+DEFUN __mulhi3
 	clr	r_resH		; clear result
 	clr	r_resL		; clear result
 __mulhi3_loop:
@@ -166,6 +159,7 @@ __mulhi3_exit:
 	mov	r_arg1H,r_resH	; result to return register
 	mov	r_arg1L,r_resL
 	ret
+ENDF __mulhi3
 
 #undef r_arg1L
 #undef r_arg1H
@@ -174,168 +168,51 @@ __mulhi3_exit:
 #undef r_resL 	
 #undef r_resH 
 
-.endfunc
 #endif /* defined (L_mulhi3) */
-#endif /* !defined (__AVR_HAVE_MUL__) */
 
 /*******************************************************
-      Widening Multiplication  32 = 16 x 16
+    Widening Multiplication  32 = 16 x 16  without MUL
 *******************************************************/
-                              
+
 #if defined (L_mulhisi3)
 DEFUN __mulhisi3
-#if defined (__AVR_HAVE_MUL__)
-
-;; r25:r22 = r19:r18 * r21:r20
-
-#define A0 18
-#define B0 20
-#define C0 22
-
-#define A1 A0+1
-#define B1 B0+1
-#define C1 C0+1
-#define C2 C0+2
-#define C3 C0+3
- 
-    ; C = (signed)A1 * (signed)B1
-    muls  A1, B1
-    movw  C2, R0
-
-    ; C += A0 * B0
-    mul   A0, B0
-    movw  C0, R0
-
-    ; C += (signed)A1 * B0
-    mulsu A1, B0
-    sbci  C3, 0
-    add   C1, R0
-    adc   C2, R1
-    clr   __zero_reg__
-    adc   C3, __zero_reg__
-
-    ; C += (signed)B1 * A0
-    mulsu B1, A0
-    sbci  C3, 0
-    XJMP  __xmulhisi3_exit
-
-#undef A0
-#undef A1
-#undef B0
-#undef B1
-#undef C0
-#undef C1
-#undef C2
-#undef C3
-
-#else /* !__AVR_HAVE_MUL__ */
 ;;; FIXME: This is dead code (noone calls it)
-	mov_l	r18, r24
-	mov_h	r19, r25
-	clr	r24
-	sbrc	r23, 7
-	dec	r24
-	mov	r25, r24
-	clr	r20
-	sbrc	r19, 7
-	dec	r20
-	mov	r21, r20
-	XJMP	__mulsi3
-#endif /* __AVR_HAVE_MUL__ */
+    mov_l   r18, r24
+    mov_h   r19, r25
+    clr     r24
+    sbrc    r23, 7
+    dec     r24
+    mov     r25, r24
+    clr     r20
+    sbrc    r19, 7
+    dec     r20
+    mov     r21, r20
+    XJMP    __mulsi3
 ENDF __mulhisi3
 #endif /* defined (L_mulhisi3) */
 
 #if defined (L_umulhisi3)
 DEFUN __umulhisi3
-#if defined (__AVR_HAVE_MUL__)
-
-;; r25:r22 = r19:r18 * r21:r20
-
-#define A0 18
-#define B0 20
-#define C0 22
-
-#define A1 A0+1
-#define B1 B0+1
-#define C1 C0+1
-#define C2 C0+2
-#define C3 C0+3
-
-    ; C = A1 * B1
-    mul   A1, B1
-    movw  C2, R0
-
-    ; C += A0 * B0
-    mul   A0, B0
-    movw  C0, R0
-
-    ; C += A1 * B0
-    mul   A1, B0
-    add   C1, R0
-    adc   C2, R1
-    clr   __zero_reg__
-    adc   C3, __zero_reg__
-
-    ; C += B1 * A0
-    mul   B1, A0
-    XJMP  __xmulhisi3_exit
-
-#undef A0
-#undef A1
-#undef B0
-#undef B1
-#undef C0
-#undef C1
-#undef C2
-#undef C3
-
-#else /* !__AVR_HAVE_MUL__ */
 ;;; FIXME: This is dead code (noone calls it)
-	mov_l	r18, r24
-	mov_h	r19, r25
-	clr	r24
-	clr	r25
-	clr	r20
-	clr	r21
-	XJMP	__mulsi3
-#endif /* __AVR_HAVE_MUL__ */
+    mov_l   r18, r24
+    mov_h   r19, r25
+    clr     r24
+    clr     r25
+    mov_l   r20, r24
+    mov_h   r21, r25
+    XJMP    __mulsi3
 ENDF __umulhisi3
 #endif /* defined (L_umulhisi3) */
 
-#if defined (L_xmulhisi3_exit)
-
-;;; Helper for __mulhisi3 resp. __umulhisi3.
-
-#define C0 22
-#define C1 C0+1
-#define C2 C0+2
-#define C3 C0+3
-
-DEFUN __xmulhisi3_exit
-    add   C1, R0
-    adc   C2, R1
-    clr   __zero_reg__
-    adc   C3, __zero_reg__
-    ret
-ENDF __xmulhisi3_exit
-
-#undef C0
-#undef C1
-#undef C2
-#undef C3
-
-#endif /* defined (L_xmulhisi3_exit) */
-
 #if defined (L_mulsi3)
 /*******************************************************
-               Multiplication  32 x 32
+    Multiplication  32 x 32  without MUL
 *******************************************************/
 #define r_arg1L  r22		/* multiplier Low */
 #define r_arg1H  r23
 #define	r_arg1HL r24
 #define	r_arg1HH r25		/* multiplier High */
 
-
 #define	r_arg2L  r18		/* multiplicand Low */
 #define	r_arg2H  r19	
 #define	r_arg2HL r20
@@ -346,43 +223,7 @@ ENDF __xmulhisi3_exit
 #define r_resHL	 r30
 #define r_resHH  r31		/* result High */
 
-	
-	.global	__mulsi3
-	.func	__mulsi3
-__mulsi3:
-#if defined (__AVR_HAVE_MUL__)
-	mul	r_arg1L, r_arg2L
-	movw	r_resL, r0
-	mul	r_arg1H, r_arg2H
-	movw	r_resHL, r0
-	mul	r_arg1HL, r_arg2L
-	add	r_resHL, r0
-	adc	r_resHH, r1
-	mul	r_arg1L, r_arg2HL
-	add	r_resHL, r0
-	adc	r_resHH, r1
-	mul	r_arg1HH, r_arg2L
-	add	r_resHH, r0
-	mul	r_arg1HL, r_arg2H
-	add	r_resHH, r0
-	mul	r_arg1H, r_arg2HL
-	add	r_resHH, r0
-	mul	r_arg1L, r_arg2HH
-	add	r_resHH, r0
-	clr	r_arg1HH	; use instead of __zero_reg__ to add carry
-	mul	r_arg1H, r_arg2L
-	add	r_resH, r0
-	adc	r_resHL, r1
-	adc	r_resHH, r_arg1HH ; add carry
-	mul	r_arg1L, r_arg2H
-	add	r_resH, r0
-	adc	r_resHL, r1
-	adc	r_resHH, r_arg1HH ; add carry
-	movw	r_arg1L, r_resL
-	movw	r_arg1HL, r_resHL
-	clr	r1		; __zero_reg__ clobbered by "mul"
-	ret
-#else
+DEFUN __mulsi3
 	clr	r_resHH		; clear result
 	clr	r_resHL		; clear result
 	clr	r_resH		; clear result
@@ -414,13 +255,13 @@ __mulsi3_exit:
 	mov_h	r_arg1H,r_resH
 	mov_l	r_arg1L,r_resL
 	ret
-#endif /* defined (__AVR_HAVE_MUL__) */
+ENDF __mulsi3
+
 #undef r_arg1L 
 #undef r_arg1H 
 #undef r_arg1HL
 #undef r_arg1HH
              
-             
 #undef r_arg2L 
 #undef r_arg2H 
 #undef r_arg2HL
@@ -431,9 +272,181 @@ __mulsi3_exit:
 #undef r_resHL 
 #undef r_resHH 
 
-.endfunc
 #endif /* defined (L_mulsi3) */
+
+#endif /* !defined (__AVR_HAVE_MUL__) */
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+\f
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#if defined (__AVR_HAVE_MUL__)    
+#define A0 26
+#define B0 18
+#define C0 22
+
+#define A1 A0+1
+
+#define B1 B0+1
+#define B2 B0+2
+#define B3 B0+3
+
+#define C1 C0+1
+#define C2 C0+2
+#define C3 C0+3
+
+/*******************************************************
+    Widening Multiplication  32 = 16 x 16
+*******************************************************/
+                              
+#if defined (L_mulhisi3)
+;;; R25:R22 = (signed long) R27:R26 * (signed long) R19:R18
+;;; C3:C0   = (signed long) A1:A0   * (signed long) B1:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __mulhisi3
+    XCALL   __umulhisi3
+    ;; Sign-extend B
+    tst     B1
+    brpl    1f
+    sub     C2, A0
+    sbc     C3, A1
+1:  ;; Sign-extend A
+    XJMP __usmulhisi3_tail
+ENDF __mulhisi3
+#endif /* L_mulhisi3 */
+
+#if defined (L_usmulhisi3)
+;;; R25:R22 = (signed long) R27:R26 * (unsigned long) R19:R18
+;;; C3:C0   = (signed long) A1:A0   * (unsigned long) B1:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __usmulhisi3
+    XCALL   __umulhisi3
+    ;; FALLTHRU
+ENDF __usmulhisi3
+
+DEFUN __usmulhisi3_tail
+    ;; Sign-extend A
+    sbrs    A1, 7
+    ret
+    sub     C2, B0
+    sbc     C3, B1
+    ret
+ENDF __usmulhisi3_tail
+#endif /* L_usmulhisi3 */
+
+#if defined (L_umulhisi3)
+;;; R25:R22 = (unsigned long) R27:R26 * (unsigned long) R19:R18
+;;; C3:C0   = (unsigned long) A1:A0   * (unsigned long) B1:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __umulhisi3
+    mul     A0, B0
+    movw    C0, r0
+    mul     A1, B1
+    movw    C2, r0
+    mul     A0, B1
+    rcall   1f
+    mul     A1, B0
+1:  add     C1, r0
+    adc     C2, r1
+    clr     __zero_reg__
+    adc     C3, __zero_reg__
+    ret
+ENDF __umulhisi3
+#endif /* L_umulhisi3 */
+
+/*******************************************************
+    Widening Multiplication  32 = 16 x 32
+*******************************************************/
+
+#if defined (L_mulshisi3)
+;;; R25:R22 = (signed long) R27:R26 * R21:R18
+;;; (C3:C0) = (signed long) A1:A0   * B3:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __mulshisi3
+#ifdef __AVR_HAVE_JMP_CALL__
+    ;; Some cores have problem skipping 2-word instruction
+    tst     A1
+    brmi    __mulohisi3
+#else
+    sbrs    A1, 7
+#endif /* __AVR_HAVE_JMP_CALL__ */
+    XJMP    __muluhisi3
+    ;; FALLTHRU
+ENDF __mulshisi3
+    
+;;; R25:R22 = (one-extended long) R27:R26 * R21:R18
+;;; (C3:C0) = (one-extended long) A1:A0   * B3:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __mulohisi3
+    XCALL   __muluhisi3
+    ;; One-extend R27:R26 (A1:A0)
+    sub     C2, B0
+    sbc     C3, B1
+    ret
+ENDF __mulohisi3
+#endif /* L_mulshisi3 */
+
+#if defined (L_muluhisi3)
+;;; R25:R22 = (unsigned long) R27:R26 * R21:R18
+;;; (C3:C0) = (unsigned long) A1:A0   * B3:B0
+;;; Clobbers: __tmp_reg__
+DEFUN __muluhisi3
+    XCALL   __umulhisi3
+    mul     A0, B3
+    add     C3, r0
+    mul     A1, B2
+    add     C3, r0
+    mul     A0, B2
+    add     C2, r0
+    adc     C3, r1
+    clr     __zero_reg__
+    ret
+ENDF __muluhisi3
+#endif /* L_muluhisi3 */
+
+/*******************************************************
+    Multiplication  32 x 32
+*******************************************************/
+
+#if defined (L_mulsi3)
+;;; R25:R22 = R25:R22 * R21:R18
+;;; (C3:C0) = C3:C0   * B3:B0
+;;; Clobbers: R26, R27, __tmp_reg__
+DEFUN __mulsi3
+    movw    A0, C0
+    push    C2
+    push    C3
+    XCALL   __muluhisi3
+    pop     A1
+    pop     A0
+    ;; A1:A0 now contains the high word of A
+    mul     A0, B0
+    add     C2, r0
+    adc     C3, r1
+    mul     A0, B1
+    add     C3, r0
+    mul     A1, B0
+    add     C3, r0
+    clr     __zero_reg__
+    ret
+ENDF __mulsi3
+#endif /* L_mulsi3 */
+
+#undef A0
+#undef A1
+
+#undef B0
+#undef B1
+#undef B2
+#undef B3
+
+#undef C0
+#undef C1
+#undef C2
+#undef C3
+
+#endif /* __AVR_HAVE_MUL__ */
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 	
+\f
 /*******************************************************
        Division 8 / 8 => (result + remainder)
 *******************************************************/
Index: config/avr/avr.md
===================================================================
--- config/avr/avr.md	(revision 176818)
+++ config/avr/avr.md	(working copy)
@@ -126,12 +126,25 @@ (define_attr "length" ""
 		       (const_int 2))]
         (const_int 2)))
 
-;; Define mode iterator
+;; Define mode iterators
+(define_mode_iterator QIHI  [(QI "") (HI "")])
+(define_mode_iterator QIHI2 [(QI "") (HI "")])
 (define_mode_iterator QISI [(QI "") (HI "") (SI "")])
 (define_mode_iterator QIDI [(QI "") (HI "") (SI "") (DI "")])
 (define_mode_iterator HIDI [(HI "") (SI "") (DI "")])
 (define_mode_iterator HISI [(HI "") (SI "")])
 
+;; Define code iterators
+;; Define two incarnations so that we can build the cross product.
+(define_code_iterator any_extend  [sign_extend zero_extend])
+(define_code_iterator any_extend2 [sign_extend zero_extend])
+
+;; Define code attributes
+(define_code_attr extend_prefix
+  [(sign_extend "s")
+   (zero_extend "u")])
+
+
 ;;========================================================================
 ;; The following is used by nonlocal_goto and setjmp.
 ;; The receiver pattern will create no instructions since internally
@@ -1349,69 +1362,310 @@ (define_insn "*mulhi3_call"
 
 ;; Operand 2 (reg:SI 18) not clobbered on the enhanced core.
 ;; All call-used registers clobbered otherwise - normal library call.
+;;    To support widening multiplicatioon with constant we postpone
+;; expanding to the implicit library call until post combine and
+;; prior to register allocation.  Clobber all hard registers that
+;; might be used by the (widening) multiply until it is split and
+;; it's final register footprint is worked out.
+
 (define_expand "mulsi3"
-  [(set (reg:SI 22) (match_operand:SI 1 "register_operand" ""))
-   (set (reg:SI 18) (match_operand:SI 2 "register_operand" ""))
-   (parallel [(set (reg:SI 22) (mult:SI (reg:SI 22) (reg:SI 18)))
-	      (clobber (reg:HI 26))
-	      (clobber (reg:HI 30))])
-   (set (match_operand:SI 0 "register_operand" "") (reg:SI 22))]
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (match_operand:SI 1 "register_operand" "")
+                            (match_operand:SI 2 "nonmemory_operand" "")))
+              (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
-  "")
+  {
+    if (u16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_muluhisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
 
-(define_insn "*mulsi3_call"
-  [(set (reg:SI 22) (mult:SI (reg:SI 22) (reg:SI 18)))
-   (clobber (reg:HI 26))
-   (clobber (reg:HI 30))]
-  "AVR_HAVE_MUL"
-  "%~call __mulsi3"
-  [(set_attr "type" "xcall")
-   (set_attr "cc" "clobber")])
+    if (o16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_mulohisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
+  })
 
-(define_expand "mulhisi3"
-  [(set (reg:HI 18)
-        (match_operand:HI 1 "register_operand" ""))
-   (set (reg:HI 20)
-        (match_operand:HI 2 "register_operand" ""))
+(define_insn_and_split "*mulsi3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
+        (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:SI 18)
+        (match_dup 1))
    (set (reg:SI 22) 
-        (mult:SI (sign_extend:SI (reg:HI 18))
-                 (sign_extend:SI (reg:HI 20))))
-   (set (match_operand:SI 0 "register_operand" "") 
+        (match_dup 2))
+   (parallel [(set (reg:SI 22)
+                   (mult:SI (reg:SI 22)
+                            (reg:SI 18)))
+              (clobber (reg:HI 26))])
+   (set (match_dup 0)
+        (reg:SI 22))]
+  {
+    if (u16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_muluhisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
+
+    if (o16_operand (operands[2], SImode))
+      {
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_mulohisi3 (operands[0], operands[2], operands[1]));
+        DONE;
+      }
+  })
+
+;; "muluqisi3"
+;; "muluhisi3"
+(define_insn_and_split "mulu<mode>si3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                           "=r")
+        (mult:SI (zero_extend:SI (match_operand:QIHI 1 "pseudo_register_operand" "r"))
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand"      "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:HI 26)
+        (match_dup 1))
+   (set (reg:SI 18)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (zero_extend:SI (reg:HI 26))
+                 (reg:SI 18)))
+   (set (match_dup 0)
+        (reg:SI 22))]
+  {
+    /* Do the QI -> HI extension explicitely before the multiplication.  */
+    /* Do the HI -> SI extension implicitely and after the multiplication.  */
+       
+    if (QImode == <MODE>mode)
+      operands[1] = gen_rtx_ZERO_EXTEND (HImode, operands[1]);
+
+    if (u16_operand (operands[2], SImode))
+      {
+        operands[1] = force_reg (HImode, operands[1]);
+        operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+        emit_insn (gen_umulhisi3 (operands[0], operands[1], operands[2]));
+        DONE;
+      }
+  })
+
+;; "mulsqisi3"
+;; "mulshisi3"
+(define_insn_and_split "muls<mode>si3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                           "=r")
+        (mult:SI (sign_extend:SI (match_operand:QIHI 1 "pseudo_register_operand" "r"))
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand"      "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:HI 26)
+        (match_dup 1))
+   (set (reg:SI 18)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (sign_extend:SI (reg:HI 26))
+                 (reg:SI 18)))
+   (set (match_dup 0)
+        (reg:SI 22))]
+  {
+    /* Do the QI -> HI extension explicitely before the multiplication.  */
+    /* Do the HI -> SI extension implicitely and after the multiplication.  */
+       
+    if (QImode == <MODE>mode)
+      operands[1] = gen_rtx_SIGN_EXTEND (HImode, operands[1]);
+
+    if (u16_operand (operands[2], SImode)
+        || s16_operand (operands[2], SImode))
+      {
+        rtx xop2 = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
+
+        operands[1] = force_reg (HImode, operands[1]);
+
+        if (u16_operand (operands[2], SImode))
+          emit_insn (gen_usmulhisi3 (operands[0], xop2, operands[1]));
+        else
+          emit_insn (gen_mulhisi3 (operands[0], operands[1], xop2));
+
+        DONE;
+      }
+  })
+
+;; One-extend operand 1
+
+(define_insn_and_split "mulohisi3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                          "=r")
+        (mult:SI (not:SI (zero_extend:SI 
+                          (not:HI (match_operand:HI 1 "pseudo_register_operand" "r"))))
+                 (match_operand:SI 2 "pseudo_register_or_const_int_operand"     "rn")))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (reg:HI 26)
+        (match_dup 1))
+   (set (reg:SI 18)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (not:SI (zero_extend:SI (not:HI (reg:HI 26))))
+                 (reg:SI 18)))
+   (set (match_dup 0)
         (reg:SI 22))]
+  "")
+
+(define_expand "mulhisi3"
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (sign_extend:SI (match_operand:HI 1 "register_operand" ""))
+                            (sign_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
   "")
 
 (define_expand "umulhisi3"
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
+                            (zero_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:DI 18))])]
+  "AVR_HAVE_MUL"
+  "")
+
+(define_expand "usmulhisi3"
+  [(parallel [(set (match_operand:SI 0 "register_operand" "")
+                   (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
+                            (sign_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:DI 18))])]
+  "AVR_HAVE_MUL"
+  "")
+
+;; "*uumulqihisi3" "*uumulhiqisi3" "*uumulhihisi3" "*uumulqiqisi3"
+;; "*usmulqihisi3" "*usmulhiqisi3" "*usmulhihisi3" "*usmulqiqisi3"
+;; "*sumulqihisi3" "*sumulhiqisi3" "*sumulhihisi3" "*sumulqiqisi3"
+;; "*ssmulqihisi3" "*ssmulhiqisi3" "*ssmulhihisi3" "*ssmulqiqisi3"
+(define_insn_and_split
+  "*<any_extend:extend_prefix><any_extend2:extend_prefix>mul<QIHI:mode><QIHI2:mode>si3"
+  [(set (match_operand:SI 0 "pseudo_register_operand"                            "=r")
+        (mult:SI (any_extend:SI (match_operand:QIHI 1 "pseudo_register_operand"   "r"))
+                 (any_extend2:SI (match_operand:QIHI2 2 "pseudo_register_operand" "r"))))
+   (clobber (reg:DI 18))]
+  "AVR_HAVE_MUL && !reload_completed"
+  { gcc_unreachable(); }
+  "&& 1"
   [(set (reg:HI 18)
-        (match_operand:HI 1 "register_operand" ""))
-   (set (reg:HI 20)
-        (match_operand:HI 2 "register_operand" ""))
-   (set (reg:SI 22) 
-        (mult:SI (zero_extend:SI (reg:HI 18))
-                 (zero_extend:SI (reg:HI 20))))
-   (set (match_operand:SI 0 "register_operand" "") 
+        (match_dup 1))
+   (set (reg:HI 26)
+        (match_dup 2))
+   (set (reg:SI 22)
+        (mult:SI (match_dup 3)
+                 (match_dup 4)))
+   (set (match_dup 0)
         (reg:SI 22))]
+  {
+    rtx xop1 = operands[1];
+    rtx xop2 = operands[2];
+
+    /* Do the QI -> HI extension explicitely before the multiplication.  */
+    /* Do the HI -> SI extension implicitely and after the multiplication.  */
+       
+    if (QImode == <QIHI:MODE>mode)
+      xop1 = gen_rtx_fmt_e (<any_extend:CODE>, HImode, xop1);
+
+    if (QImode == <QIHI2:MODE>mode)
+      xop2 = gen_rtx_fmt_e (<any_extend2:CODE>, HImode, xop2);
+
+    if (<any_extend:CODE> == <any_extend2:CODE>
+        || <any_extend:CODE> == ZERO_EXTEND)
+      {
+        operands[1] = xop1;
+        operands[2] = xop2;
+        operands[3] = gen_rtx_fmt_e (<any_extend:CODE>, SImode, gen_rtx_REG (HImode, 18));
+        operands[4] = gen_rtx_fmt_e (<any_extend2:CODE>, SImode, gen_rtx_REG (HImode, 26));
+      }
+    else
+      {
+        /* <any_extend:CODE>  = SIGN_EXTEND */
+        /* <any_extend2:CODE> = ZERO_EXTEND */
+
+        operands[1] = xop2;
+        operands[2] = xop1;
+        operands[3] = gen_rtx_ZERO_EXTEND (SImode, gen_rtx_REG (HImode, 18));
+        operands[4] = gen_rtx_SIGN_EXTEND (SImode, gen_rtx_REG (HImode, 26));
+      }
+  })
+
+(define_insn "*mulsi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (reg:SI 22)
+                 (reg:SI 18)))
+   (clobber (reg:HI 26))]
   "AVR_HAVE_MUL"
-  "")
+  "%~call __mulsi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
 
 (define_insn "*mulhisi3_call"
-  [(set (reg:SI 22) 
+  [(set (reg:SI 22)
         (mult:SI (sign_extend:SI (reg:HI 18))
-                 (sign_extend:SI (reg:HI 20))))]
+                 (sign_extend:SI (reg:HI 26))))]
   "AVR_HAVE_MUL"
   "%~call __mulhisi3"
   [(set_attr "type" "xcall")
    (set_attr "cc" "clobber")])
 
 (define_insn "*umulhisi3_call"
-  [(set (reg:SI 22) 
+  [(set (reg:SI 22)
         (mult:SI (zero_extend:SI (reg:HI 18))
-                 (zero_extend:SI (reg:HI 20))))]
+                 (zero_extend:SI (reg:HI 26))))]
   "AVR_HAVE_MUL"
   "%~call __umulhisi3"
   [(set_attr "type" "xcall")
    (set_attr "cc" "clobber")])
 
+(define_insn "*usmulhisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (zero_extend:SI (reg:HI 18))
+                 (sign_extend:SI (reg:HI 26))))]
+  "AVR_HAVE_MUL"
+  "%~call __usmulhisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+(define_insn "*muluhisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (zero_extend:SI (reg:HI 26))
+                 (reg:SI 18)))]
+  "AVR_HAVE_MUL"
+  "%~call __muluhisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+(define_insn "*mulshisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (sign_extend:SI (reg:HI 26))
+                 (reg:SI 18)))]
+  "AVR_HAVE_MUL"
+  "%~call __mulshisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+(define_insn "*mulohisi3_call"
+  [(set (reg:SI 22)
+        (mult:SI (not:SI (zero_extend:SI (not:HI (reg:HI 26))))
+                 (reg:SI 18)))]
+  "AVR_HAVE_MUL"
+  "%~call __mulohisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
 ; / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / % / %
 ; divmod
 
@@ -2399,9 +2653,16 @@ (define_insn "one_cmplsi2"
 ;; xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x xx<---x
 ;; sign extend
 
+;; We keep combiner from inserting hard registers into the input of sign- and
+;; zero-extends.  A hard register in the input operand is not wanted because
+;; 32-bit multiply patterns clobber some hard registers and extends with a
+;; hard register that overlaps these clobbers won't be combined to a widening
+;; multiplication.  There is no need for combine to propagate hard registers,
+;; register allocation can do it just as well.
+
 (define_insn "extendqihi2"
   [(set (match_operand:HI 0 "register_operand" "=r,r")
-        (sign_extend:HI (match_operand:QI 1 "register_operand" "0,*r")))]
+        (sign_extend:HI (match_operand:QI 1 "combine_pseudo_register_operand" "0,*r")))]
   ""
   "@
 	clr %B0\;sbrc %0,7\;com %B0
@@ -2411,7 +2672,7 @@ (define_insn "extendqihi2"
 
 (define_insn "extendqisi2"
   [(set (match_operand:SI 0 "register_operand" "=r,r")
-        (sign_extend:SI (match_operand:QI 1 "register_operand" "0,*r")))]
+        (sign_extend:SI (match_operand:QI 1 "combine_pseudo_register_operand" "0,*r")))]
   ""
   "@
 	clr %B0\;sbrc %A0,7\;com %B0\;mov %C0,%B0\;mov %D0,%B0
@@ -2420,8 +2681,8 @@ (define_insn "extendqisi2"
    (set_attr "cc" "set_n,set_n")])
 
 (define_insn "extendhisi2"
-  [(set (match_operand:SI 0 "register_operand"               "=r,&r")
-        (sign_extend:SI (match_operand:HI 1 "register_operand" "0,*r")))]
+  [(set (match_operand:SI 0 "register_operand"                               "=r,r")
+        (sign_extend:SI (match_operand:HI 1 "combine_pseudo_register_operand" "0,*r")))]
   ""
   "@
 	clr %C0\;sbrc %B0,7\;com %C0\;mov %D0,%C0
@@ -2438,7 +2699,7 @@ (define_insn "extendhisi2"
 
 (define_insn_and_split "zero_extendqihi2"
   [(set (match_operand:HI 0 "register_operand" "=r")
-        (zero_extend:HI (match_operand:QI 1 "register_operand" "r")))]
+        (zero_extend:HI (match_operand:QI 1 "combine_pseudo_register_operand" "r")))]
   ""
   "#"
   "reload_completed"
@@ -2454,7 +2715,7 @@ (define_insn_and_split "zero_extendqihi2
 
 (define_insn_and_split "zero_extendqisi2"
   [(set (match_operand:SI 0 "register_operand" "=r")
-        (zero_extend:SI (match_operand:QI 1 "register_operand" "r")))]
+        (zero_extend:SI (match_operand:QI 1 "combine_pseudo_register_operand" "r")))]
   ""
   "#"
   "reload_completed"
@@ -2469,8 +2730,8 @@ (define_insn_and_split "zero_extendqisi2
 })
 
 (define_insn_and_split "zero_extendhisi2"
-  [(set (match_operand:SI 0 "register_operand" "=r")
-        (zero_extend:SI (match_operand:HI 1 "register_operand" "r")))]
+  [(set (match_operand:SI 0 "register_operand"                               "=r")
+        (zero_extend:SI (match_operand:HI 1 "combine_pseudo_register_operand" "r")))]
   ""
   "#"
   "reload_completed"
Index: config/avr/t-avr
===================================================================
--- config/avr/t-avr	(revision 176818)
+++ config/avr/t-avr	(working copy)
@@ -41,7 +41,9 @@ LIB1ASMFUNCS = \
 	_mulhi3 \
 	_mulhisi3 \
 	_umulhisi3 \
-	_xmulhisi3_exit \
+	_usmulhisi3 \
+	_muluhisi3 \
+	_mulshisi3 \
 	_mulsi3 \
 	_udivmodqi4 \
 	_divmodqi4 \
Index: config/avr/avr.c
===================================================================
--- config/avr/avr.c	(revision 176818)
+++ config/avr/avr.c	(working copy)
@@ -5515,6 +5515,34 @@ avr_rtx_costs (rtx x, int codearg, int o
 	    return false;
 	  break;
 
+	case SImode:
+	  if (AVR_HAVE_MUL)
+            {
+              if (!speed)
+                {
+                  /* Add some additional costs besides CALL like moves etc.  */
+
+                  *total = COSTS_N_INSNS (AVR_HAVE_JMP_CALL ? 5 : 4);
+                }
+              else
+                {
+                  /* Just a rough estimate.  Even with -O2 we don't want bulky
+                     code expanded inline.  */
+
+                  *total = COSTS_N_INSNS (25);
+                }
+            }
+          else
+            {
+              if (speed)
+                *total = COSTS_N_INSNS (300);
+              else
+                /* Add some additional costs besides CALL like moves etc.  */
+                *total = COSTS_N_INSNS (AVR_HAVE_JMP_CALL ? 5 : 4);
+            }
+          
+          return true;
+          
 	default:
 	  return false;
 	}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-27 14:08       ` Georg-Johann Lay
@ 2011-07-27 15:58         ` Richard Henderson
  2011-07-27 16:12           ` Georg-Johann Lay
  2011-07-29 12:14           ` Georg-Johann Lay
  2011-07-29 11:58         ` [Committed,AVR]: Addendum to fix thinko in PR49687 Georg-Johann Lay
  1 sibling, 2 replies; 17+ messages in thread
From: Richard Henderson @ 2011-07-27 15:58 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: Weddington, Eric, gcc-patches, Anatoly Sokolov, Denis Chertykov

On 07/27/2011 06:21 AM, Georg-Johann Lay wrote:
> +(define_insn_and_split "*mulsi3"
> +  [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
> +        (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
> +                 (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
> +   (clobber (reg:DI 18))]
> +  "AVR_HAVE_MUL && !reload_completed"
> +  { gcc_unreachable(); }
> +  "&& 1"
> +  [(set (reg:SI 18)
> +        (match_dup 1))

That seems like it's guaranteed to force an unnecessary move.
Have you tried defining special-purpose register classes to
force reload to move the data into the right hard regs?

E.g.  "Y" prefix
      "QHS" size
      two digit starting register number, as needed.

You'll probably end up with quite a few register classes 
out of this, but hopefully reload can do a better job than
you can manually...


r~

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-27 15:58         ` Richard Henderson
@ 2011-07-27 16:12           ` Georg-Johann Lay
  2011-07-27 17:06             ` Richard Henderson
  2011-07-29 12:14           ` Georg-Johann Lay
  1 sibling, 1 reply; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-27 16:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Weddington, Eric, gcc-patches, Anatoly Sokolov, Denis Chertykov

Richard Henderson wrote:
> On 07/27/2011 06:21 AM, Georg-Johann Lay wrote:
>> +(define_insn_and_split "*mulsi3"
>> +  [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
>> +        (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
>> +                 (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
>> +   (clobber (reg:DI 18))]
>> +  "AVR_HAVE_MUL && !reload_completed"
>> +  { gcc_unreachable(); }
>> +  "&& 1"
>> +  [(set (reg:SI 18)
>> +        (match_dup 1))
> 
> That seems like it's guaranteed to force an unnecessary move.

It's the same as with the present implementation:


long mul (long a, long b)
{
    return a*b;
}

long mul2 (long a, long b)
{
    return b*a;
}

translates -Os -mmcu=atmega8 to

mul:
	rcall __mulsi3
	ret

mul2:
	push r12
	push r13
	push r14
	push r15
	movw r12,r22
	movw r14,r24
	movw r24,r20
	movw r22,r18
	movw r20,r14
	movw r18,r12
	rcall __mulsi3
	pop r15
	pop r14
	pop r13
	pop r12
	ret

> Have you tried defining special-purpose register classes to
> force reload to move the data into the right hard regs?
> 
> E.g.  "Y" prefix
>       "QHS" size
>       two digit starting register number, as needed.

I already thought about such register classes/constraints with
almost the same nomenclature, i.e. with prefix "R".

> You'll probably end up with quite a few register classes 
> out of this, but hopefully reload can do a better job than
> you can manually...

Agreed.

insns that will benefit are insns with two input operands that
commute, i.e. mulsi3, umulhisi3, mulhisi3, mulhi3.

Maybe even other 2-input insns could benefit because there's no
predetermined order in which the moves are accomplished; e.g.
moving R24 before R22 in udivmodqi4.  I don't know if register
allocator is smart enough to swap the assignments if that is
better.

Moreover, it would reduce the number of insns resp. split
patterns and help cleanup md.

I'd prefer to do that work in a separate patch.  The current patch
behaves the same as the old code, so it's not a performance
regression of the current patch.

Johann




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-27 16:12           ` Georg-Johann Lay
@ 2011-07-27 17:06             ` Richard Henderson
  2011-07-27 21:40               ` Georg-Johann Lay
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2011-07-27 17:06 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: Weddington, Eric, gcc-patches, Anatoly Sokolov, Denis Chertykov

On 07/27/2011 08:57 AM, Georg-Johann Lay wrote:
>> > You'll probably end up with quite a few register classes 
>> > out of this, but hopefully reload can do a better job than
>> > you can manually...
> Agreed.
> 
> insns that will benefit are insns with two input operands that
> commute, i.e. mulsi3, umulhisi3, mulhisi3, mulhi3.
> 
> Maybe even other 2-input insns could benefit because there's no
> predetermined order in which the moves are accomplished; e.g.
> moving R24 before R22 in udivmodqi4.  I don't know if register
> allocator is smart enough to swap the assignments if that is
> better.
> 
> Moreover, it would reduce the number of insns resp. split
> patterns and help cleanup md.
> 
> I'd prefer to do that work in a separate patch.  The current patch
> behaves the same as the old code, so it's not a performance
> regression of the current patch.

Fair enough.

I didn't review the asm code, but the rest of the patch look ok to me.


r~

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-27 17:06             ` Richard Henderson
@ 2011-07-27 21:40               ` Georg-Johann Lay
  2011-07-28  1:57                 ` Weddington, Eric
  0 siblings, 1 reply; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-27 21:40 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Weddington, Eric, gcc-patches, Anatoly Sokolov, Denis Chertykov

http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02391.html

Richard Henderson schrieb:

> On 07/27/2011 08:57 AM, Georg-Johann Lay wrote:
> 
>>>>You'll probably end up with quite a few register classes 
>>>>out of this, but hopefully reload can do a better job than
>>>>you can manually...
>>
>>Agreed.
>>
>>insns that will benefit are insns with two input operands that
>>commute, i.e. mulsi3, umulhisi3, mulhisi3, mulhi3.
>>
>>Maybe even other 2-input insns could benefit because there's no
>>predetermined order in which the moves are accomplished; e.g.
>>moving R24 before R22 in udivmodqi4.  I don't know if register
>>allocator is smart enough to swap the assignments if that is
>>better.
>>
>>Moreover, it would reduce the number of insns resp. split
>>patterns and help cleanup md.
>>
>>I'd prefer to do that work in a separate patch.  The current patch
>>behaves the same as the old code, so it's not a performance
>>regression of the current patch.
> 
> Fair enough.
> 
> I didn't review the asm code, but the rest of the patch look ok to me.
> 
> r~

Thanks, Eric will review the asm part  :-)

Johann



^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-27 21:40               ` Georg-Johann Lay
@ 2011-07-28  1:57                 ` Weddington, Eric
  2011-07-28  9:40                   ` Georg-Johann Lay
  0 siblings, 1 reply; 17+ messages in thread
From: Weddington, Eric @ 2011-07-28  1:57 UTC (permalink / raw)
  To: Georg-Johann Lay, Richard Henderson
  Cc: gcc-patches, Anatoly Sokolov, Denis Chertykov


> -----Original Message-----
> From: Georg-Johann Lay [mailto:avr@gjlay.de]
> Sent: Wednesday, July 27, 2011 3:00 PM
> To: Richard Henderson
> Cc: Weddington, Eric; gcc-patches@gcc.gnu.org; Anatoly Sokolov; Denis
> Chertykov
> Subject: Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
> 
> >
> > Fair enough.
> >
> > I didn't review the asm code, but the rest of the patch look ok to me.
> >
> > r~
> 
> Thanks, Eric will review the asm part  :-)

LOL
I trust you on the asm stuff. Ok by me.

However, how is our test coverage in this area?

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-28  1:57                 ` Weddington, Eric
@ 2011-07-28  9:40                   ` Georg-Johann Lay
  0 siblings, 0 replies; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-28  9:40 UTC (permalink / raw)
  To: Weddington, Eric
  Cc: Richard Henderson, gcc-patches, Anatoly Sokolov, Denis Chertykov,
	Anatoly Sokolov, Jörg Wunsch

Weddington, Eric wrote:
>> Subject: Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
>>
>>> I didn't review the asm code, but the rest of the patch look ok to me.
>>>
>>> r~
>> Thanks, Eric will review the asm part  :-)
> 
> LOL
> I trust you on the asm stuff. Ok by me.

Ok, I installed it.  Don't forget to rebuild *all* your libraries
including avr-libc after upgrading!

> However, how is our test coverage in this area?
> 
> Eric

As I wrote in the initial mail

http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02113.html

> The patch passes without regressions, of course.
> 
> Moreover, I drove individual tests of the routines against the old implementation
> before integrating them into libgcc to run regression tests.

so I think there is reasonable test coverage.

libgcc.S already contained routines for widening multiply for the case when no
multiplier is available; these parts are dead code up to now; IMHO small devices
would benefit from supporting them in the compiler; in particular the two
16 = 8 * 8 cases.

I did not yet try to run the testsuite for a target without MUL, i.e. compile
for a target without MUL but simulate on ATmega128.

Did you ever run testsuite for a target without MUL?

Johann

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-29 12:14           ` Georg-Johann Lay
@ 2011-07-28 16:03             ` Richard Henderson
  2011-07-31 18:06             ` Denis Chertykov
  1 sibling, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2011-07-28 16:03 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: Weddington, Eric, gcc-patches, Anatoly Sokolov, Denis Chertykov,
	Vladimir Makarov

On 07/28/2011 07:59 AM, Georg-Johann Lay wrote:
> So it appears that IRA is not as smart as we thought and not
> prepared for this...
> 
> Or did I do something fundamentally wrong?

It sure doesn't look like you've done anything wrong.


r~

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Committed,AVR]: Addendum to fix thinko in PR49687
  2011-07-27 14:08       ` Georg-Johann Lay
  2011-07-27 15:58         ` Richard Henderson
@ 2011-07-29 11:58         ` Georg-Johann Lay
  1 sibling, 0 replies; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-29 11:58 UTC (permalink / raw)
  To: Weddington, Eric
  Cc: gcc-patches, Anatoly Sokolov, Denis Chertykov, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02391.html

(reg:DI 18) does not cover (reg:HI 26) which also contributes to
the register footprint of implicit libgcc calls.

I should return to elementary school and learn counting again...

Installed as obvious, passed without regressions.

http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=176923

Johann

	PR target/49687
	* config/avr/avr.md (mulsi3, *mulsi3, mulu<mode>si3,
	muls<mode>si3, mulohisi3, mulhisi3, umulhisi3, usmulhisi3,
	*<any_extend:extend_prefix><any_extend2:extend_prefix>mul<QIHI:mode><QIHI2:mode>si3):
	Add X to register footprint: Clobber r26/r27.

[-- Attachment #2: opt-mul32-clobber.diff --]
[-- Type: text/x-patch, Size: 3845 bytes --]

Index: config/avr/avr.md
===================================================================
--- config/avr/avr.md	(revision 176920)
+++ config/avr/avr.md	(working copy)
@@ -1373,6 +1373,7 @@ (define_expand "mulsi3"
   [(parallel [(set (match_operand:SI 0 "register_operand" "")
                    (mult:SI (match_operand:SI 1 "register_operand" "")
                             (match_operand:SI 2 "nonmemory_operand" "")))
+              (clobber (reg:HI 26))
               (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
   {
@@ -1395,6 +1396,7 @@ (define_insn_and_split "*mulsi3"
   [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
         (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
                  (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
+   (clobber (reg:HI 26))
    (clobber (reg:DI 18))]
   "AVR_HAVE_MUL && !reload_completed"
   { gcc_unreachable(); }
@@ -1431,6 +1433,7 @@ (define_insn_and_split "mulu<mode>si3"
   [(set (match_operand:SI 0 "pseudo_register_operand"                           "=r")
         (mult:SI (zero_extend:SI (match_operand:QIHI 1 "pseudo_register_operand" "r"))
                  (match_operand:SI 2 "pseudo_register_or_const_int_operand"      "rn")))
+   (clobber (reg:HI 26))
    (clobber (reg:DI 18))]
   "AVR_HAVE_MUL && !reload_completed"
   { gcc_unreachable(); }
@@ -1466,6 +1469,7 @@ (define_insn_and_split "muls<mode>si3"
   [(set (match_operand:SI 0 "pseudo_register_operand"                           "=r")
         (mult:SI (sign_extend:SI (match_operand:QIHI 1 "pseudo_register_operand" "r"))
                  (match_operand:SI 2 "pseudo_register_or_const_int_operand"      "rn")))
+   (clobber (reg:HI 26))
    (clobber (reg:DI 18))]
   "AVR_HAVE_MUL && !reload_completed"
   { gcc_unreachable(); }
@@ -1509,6 +1513,7 @@ (define_insn_and_split "mulohisi3"
         (mult:SI (not:SI (zero_extend:SI 
                           (not:HI (match_operand:HI 1 "pseudo_register_operand" "r"))))
                  (match_operand:SI 2 "pseudo_register_or_const_int_operand"     "rn")))
+   (clobber (reg:HI 26))
    (clobber (reg:DI 18))]
   "AVR_HAVE_MUL && !reload_completed"
   { gcc_unreachable(); }
@@ -1528,6 +1533,7 @@ (define_expand "mulhisi3"
   [(parallel [(set (match_operand:SI 0 "register_operand" "")
                    (mult:SI (sign_extend:SI (match_operand:HI 1 "register_operand" ""))
                             (sign_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:HI 26))
               (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
   "")
@@ -1536,6 +1542,7 @@ (define_expand "umulhisi3"
   [(parallel [(set (match_operand:SI 0 "register_operand" "")
                    (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
                             (zero_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:HI 26))
               (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
   "")
@@ -1544,6 +1551,7 @@ (define_expand "usmulhisi3"
   [(parallel [(set (match_operand:SI 0 "register_operand" "")
                    (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
                             (sign_extend:SI (match_operand:HI 2 "register_operand" ""))))
+              (clobber (reg:HI 26))
               (clobber (reg:DI 18))])]
   "AVR_HAVE_MUL"
   "")
@@ -1557,6 +1565,7 @@ (define_insn_and_split
   [(set (match_operand:SI 0 "pseudo_register_operand"                            "=r")
         (mult:SI (any_extend:SI (match_operand:QIHI 1 "pseudo_register_operand"   "r"))
                  (any_extend2:SI (match_operand:QIHI2 2 "pseudo_register_operand" "r"))))
+   (clobber (reg:HI 26))
    (clobber (reg:DI 18))]
   "AVR_HAVE_MUL && !reload_completed"
   { gcc_unreachable(); }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-27 15:58         ` Richard Henderson
  2011-07-27 16:12           ` Georg-Johann Lay
@ 2011-07-29 12:14           ` Georg-Johann Lay
  2011-07-28 16:03             ` Richard Henderson
  2011-07-31 18:06             ` Denis Chertykov
  1 sibling, 2 replies; 17+ messages in thread
From: Georg-Johann Lay @ 2011-07-29 12:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Weddington, Eric, gcc-patches, Anatoly Sokolov, Denis Chertykov,
	Anatoly Sokolov, Vladimir Makarov

[-- Attachment #1: Type: text/plain, Size: 5763 bytes --]

Richard Henderson wrote:
> On 07/27/2011 06:21 AM, Georg-Johann Lay wrote:
>> +(define_insn_and_split "*mulsi3"
>> +  [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
>> +        (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
>> +                 (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
>> +   (clobber (reg:DI 18))]
>> +  "AVR_HAVE_MUL && !reload_completed"
>> +  { gcc_unreachable(); }
>> +  "&& 1"
>> +  [(set (reg:SI 18)
>> +        (match_dup 1))
> 
> That seems like it's guaranteed to force an unnecessary move.
> Have you tried defining special-purpose register classes to
> force reload to move the data into the right hard regs?
> 
> E.g.  "Y" prefix
>       "QHS" size
>       two digit starting register number, as needed.
> 
> You'll probably end up with quite a few register classes 
> out of this, but hopefully reload can do a better job than
> you can manually...
> 
> 
> r~

Waaaaaahh, I introduced register classes and constraints to tell register allocator
what's the intention if the insns, the ... parts just dealing with CONST_INTs
and not needed in the remainder:

(define_expand "mulsi3"
  [(parallel [(set (match_operand:SI 0 "register_operand" "")
                   (mult:SI (match_operand:SI 1 "register_operand" "")
                            (match_operand:SI 2 "nonmemory_operand" "")))
              (clobber (reg:HI 26))])]
  "AVR_HAVE_MUL"
  {
     ...
  })

(define_insn_and_split "*mulsi3"
  [(set (match_operand:SI 0 "register_operand"           "=RS22")
        (mult:SI (match_operand:SI 1 "register_operand"  "%RS22")
                 (match_operand:SI 2 "nonmemory_operand"  "RS18")))
   (clobber (reg:HI 26))]
  "AVR_HAVE_MUL"
  "%~call __mulsi3"
  "&& !reload_completed"
  [(clobber (const_int 0))]
  {
    ...
    FAIL;
  }
  [(set_attr "type" "xcall")
   (set_attr "cc" "clobber")])


Again, I used the simple test case from above:

long mul (long a, long b)
{
    return a*b;
}

long mul2 (long a, long b)
{
    return b*a;
}

Compiled with -Os -mmcu=atmega8 -fno-split-wide-types:

mul:
/* prologue: function */
	rcall __mulsi3	 ;  7	*mulsi3	[length = 1]
/* epilogue start */
	ret	 ;  21	return	[length = 1]

mul2:
	push r8	 ;  22	*pushqi/1	[length = 1]
	push r9	 ;  23	*pushqi/1	[length = 1]
	push r10	 ;  24	*pushqi/1	[length = 1]
	push r11	 ;  25	*pushqi/1	[length = 1]
	push r12	 ;  26	*pushqi/1	[length = 1]
	push r13	 ;  27	*pushqi/1	[length = 1]
	push r14	 ;  28	*pushqi/1	[length = 1]
	push r15	 ;  29	*pushqi/1	[length = 1]
	push r28	 ;  30	*pushqi/1	[length = 1]
	push r29	 ;  31	*pushqi/1	[length = 1]
	rcall .	 ;  35	*addhi3_sp_R_pc2	[length = 2]
	rcall .
	in r28,__SP_L__	 ;  36	*movhi_sp/2	[length = 2]
	in r29,__SP_H__
/* prologue: function */
/* frame size = 4 */
/* stack size = 14 */
.L__stack_usage = 14
	movw r12,r22	 ;  2	*movsi/1	[length = 2]
	movw r14,r24
	movw r24,r20	 ;  19	*movsi/1	[length = 2]
	movw r22,r18
	movw r20,r14	 ;  21	*movsi/1	[length = 2]
	movw r18,r12
	rcall __mulsi3	 ;  7	*mulsi3	[length = 1]
/* epilogue start */
	pop __tmp_reg__	 ;  41	*addhi3_sp_R_pc2	[length = 4]
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop r29	 ;  42	popqi	[length = 1]
	pop r28	 ;  43	popqi	[length = 1]
	pop r15	 ;  44	popqi	[length = 1]
	pop r14	 ;  45	popqi	[length = 1]
	pop r13	 ;  46	popqi	[length = 1]
	pop r12	 ;  47	popqi	[length = 1]
	pop r11	 ;  48	popqi	[length = 1]
	pop r10	 ;  49	popqi	[length = 1]
	pop r9	 ;  50	popqi	[length = 1]
	pop r8	 ;  51	popqi	[length = 1]
	ret	 ;  52	return_from_epilogue	[length = 1]


With -fsplit-wide-types (which is on per default) the code is even
worse and the first function inflates to unacceptable code, too.

Using constraints "=RS22,%0,RS18" instead of "=RS22,%RS22,RS18"
the code of the second function is a bit better:

mul2:
	push r28	 ;  20	*pushqi/1	[length = 1]
	push r29	 ;  21	*pushqi/1	[length = 1]
	rcall .	 ;  25	*addhi3_sp_R_pc2	[length = 2]
	rcall .
	in r28,__SP_L__	 ;  26	*movhi_sp/2	[length = 2]
	in r29,__SP_H__
/* prologue: function */
/* frame size = 4 */
/* stack size = 6 */
.L__stack_usage = 6
	std Y+1,r22	 ;  2	*movsi/4	[length = 4]
	std Y+2,r23
	std Y+3,r24
	std Y+4,r25
	movw r24,r20	 ;  3	*movsi/1	[length = 2]
	movw r22,r18
	ldd r18,Y+1	 ;  19	*movsi/3	[length = 4]
	ldd r19,Y+2
	ldd r20,Y+3
	ldd r21,Y+4
	rcall __mulsi3	 ;  7	*mulsi3	[length = 1]
/* epilogue start */
	pop __tmp_reg__	 ;  31	*addhi3_sp_R_pc2	[length = 4]
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop r29	 ;  32	popqi	[length = 1]
	pop r28	 ;  33	popqi	[length = 1]
	ret	 ;  34	return_from_epilogue	[length = 1]


Remember that without register constraints and explicit hard-reg
move expansion the result was (both with -fsplit-wide-types and
-fno-split-wide-types):

mul:
	push r12	 ;  35	*pushqi/1	[length = 1]
	push r13	 ;  36	*pushqi/1	[length = 1]
	push r14	 ;  37	*pushqi/1	[length = 1]
	push r15	 ;  38	*pushqi/1	[length = 1]
/* prologue: function */
/* frame size = 0 */
/* stack size = 4 */
.L__stack_usage = 4
	movw r12,r22	 ;  2	*movsi/1	[length = 2]
	movw r14,r24
	movw r24,r20	 ;  3	*movsi/1	[length = 2]
	movw r22,r18
	movw r20,r14	 ;  26	*movsi/1	[length = 2]
	movw r18,r12
	rcall __mulsi3	 ;  28	*mulsi3_call	[length = 1]
/* epilogue start */
	pop r15	 ;  41	popqi	[length = 1]
	pop r14	 ;  42	popqi	[length = 1]
	pop r13	 ;  43	popqi	[length = 1]
	pop r12	 ;  44	popqi	[length = 1]
	ret	 ;  45	return_from_epilogue	[length = 1]

So it appears that IRA is not as smart as we thought and not
prepared for this...

Or did I do something fundamentally wrong?

Attached the changes that I made.  Note that md has just some changes
for the simple example as I had a look at what the compiler produces
and I was not amused.

Johann

[-- Attachment #2: reg-constraints.diff --]
[-- Type: text/x-patch, Size: 18254 bytes --]

Index: config/avr/avr.md
===================================================================
--- config/avr/avr.md	(revision 176865)
+++ config/avr/avr.md	(working copy)
@@ -726,7 +726,7 @@ (define_insn "*addhi3_zero_extend1"
 (define_insn "*addhi3_sp_R_pc2"
   [(set (match_operand:HI 1 "stack_register_operand" "=q")
         (plus:HI (match_operand:HI 2 "stack_register_operand" "q")
-                 (match_operand:HI 0 "avr_sp_immediate_operand" "R")))]
+                 (match_operand:HI 0 "avr_sp_immediate_operand" "C65")))]
   "AVR_2_BYTE_PC"
   "*{
       if (CONST_INT_P (operands[0]))
@@ -795,7 +795,7 @@ (define_insn "*addhi3_sp_R_pc2"
 (define_insn "*addhi3_sp_R_pc3"
   [(set (match_operand:HI 1 "stack_register_operand" "=q")
         (plus:HI (match_operand:HI 2 "stack_register_operand" "q")
-                 (match_operand:QI 0 "avr_sp_immediate_operand" "R")))]
+                 (match_operand:QI 0 "avr_sp_immediate_operand" "C65")))]
   "AVR_3_BYTE_PC"
   "*{
       if (CONST_INT_P (operands[0]))
@@ -1372,7 +1372,7 @@ (define_expand "mulsi3"
   [(parallel [(set (match_operand:SI 0 "register_operand" "")
                    (mult:SI (match_operand:SI 1 "register_operand" "")
                             (match_operand:SI 2 "nonmemory_operand" "")))
-              (clobber (reg:DI 18))])]
+              (clobber (reg:HI 26))])]
   "AVR_HAVE_MUL"
   {
     if (u16_operand (operands[2], SImode))
@@ -1391,23 +1391,14 @@ (define_expand "mulsi3"
   })
 
 (define_insn_and_split "*mulsi3"
-  [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
-        (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
-                 (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
-   (clobber (reg:DI 18))]
-  "AVR_HAVE_MUL && !reload_completed"
-  { gcc_unreachable(); }
-  "&& 1"
-  [(set (reg:SI 18)
-        (match_dup 1))
-   (set (reg:SI 22) 
-        (match_dup 2))
-   (parallel [(set (reg:SI 22)
-                   (mult:SI (reg:SI 22)
-                            (reg:SI 18)))
-              (clobber (reg:HI 26))])
-   (set (match_dup 0)
-        (reg:SI 22))]
+  [(set (match_operand:SI 0 "register_operand"           "=RS22")
+        (mult:SI (match_operand:SI 1 "register_operand"  "%0")
+                 (match_operand:SI 2 "nonmemory_operand"  "RS18")))
+   (clobber (reg:HI 26))]
+  "AVR_HAVE_MUL"
+  "%~call __mulsi3"
+  "&& !reload_completed"
+  [(clobber (const_int 0))]
   {
     if (u16_operand (operands[2], SImode))
       {
@@ -1422,37 +1413,43 @@ (define_insn_and_split "*mulsi3"
         emit_insn (gen_mulohisi3 (operands[0], operands[2], operands[1]));
         DONE;
       }
-  })
 
-;; "muluqisi3"
-;; "muluhisi3"
-(define_insn_and_split "mulu<mode>si3"
-  [(set (match_operand:SI 0 "pseudo_register_operand"                           "=r")
-        (mult:SI (zero_extend:SI (match_operand:QIHI 1 "pseudo_register_operand" "r"))
-                 (match_operand:SI 2 "pseudo_register_or_const_int_operand"      "rn")))
-   (clobber (reg:DI 18))]
-  "AVR_HAVE_MUL && !reload_completed"
+    FAIL;
+  }
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+(define_insn "muluhisi3"
+  [(set (match_operand:SI 0 "register_operand"                           "=RS22")
+        (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand"   "x"))
+                 (match_operand:SI 2 "register_operand"                   "RS18")))]
+  "AVR_HAVE_MUL"
+  "%~call __muluhisi3"
+  [(set_attr "type" "xcall")
+   (set_attr "cc" "clobber")])
+
+;; "*muluqisi3.split"
+;; "*muluhisi3.split"
+(define_insn_and_split "*mulu<mode>si3.split"
+  [(set (match_operand:SI 0 "register_operand"                           "=r")
+        (mult:SI (zero_extend:SI (match_operand:QIHI 1 "register_operand" "r"))
+                 (match_operand:SI 2 "nonmemory_operand"                  "rn")))
+   (clobber (reg:HI 26))]
+  "AVR_HAVE_MUL"
   { gcc_unreachable(); }
-  "&& 1"
-  [(set (reg:HI 26)
-        (match_dup 1))
-   (set (reg:SI 18)
-        (match_dup 2))
-   (set (reg:SI 22)
-        (mult:SI (zero_extend:SI (reg:HI 26))
-                 (reg:SI 18)))
-   (set (match_dup 0)
-        (reg:SI 22))]
+  "&& !reload_completed"
+  [(set (match_dup 0)
+        (mult:SI (zero_extend:SI (match_dup 1))
+                 (match_dup 2)))]
   {
     /* Do the QI -> HI extension explicitely before the multiplication.  */
     /* Do the HI -> SI extension implicitely and after the multiplication.  */
        
     if (QImode == <MODE>mode)
-      operands[1] = gen_rtx_ZERO_EXTEND (HImode, operands[1]);
+      operands[1] = force_reg (HImode, gen_rtx_ZERO_EXTEND (HImode, operands[1]));
 
     if (u16_operand (operands[2], SImode))
       {
-        operands[1] = force_reg (HImode, operands[1]);
         operands[2] = force_reg (HImode, gen_int_mode (INTVAL (operands[2]), HImode));
         emit_insn (gen_umulhisi3 (operands[0], operands[1], operands[2]));
         DONE;
@@ -1602,16 +1599,6 @@ (define_insn_and_split
       }
   })
 
-(define_insn "*mulsi3_call"
-  [(set (reg:SI 22)
-        (mult:SI (reg:SI 22)
-                 (reg:SI 18)))
-   (clobber (reg:HI 26))]
-  "AVR_HAVE_MUL"
-  "%~call __mulsi3"
-  [(set_attr "type" "xcall")
-   (set_attr "cc" "clobber")])
-
 (define_insn "*mulhisi3_call"
   [(set (reg:SI 22)
         (mult:SI (sign_extend:SI (reg:HI 18))
@@ -1639,15 +1626,6 @@ (define_insn "*usmulhisi3_call"
   [(set_attr "type" "xcall")
    (set_attr "cc" "clobber")])
 
-(define_insn "*muluhisi3_call"
-  [(set (reg:SI 22)
-        (mult:SI (zero_extend:SI (reg:HI 26))
-                 (reg:SI 18)))]
-  "AVR_HAVE_MUL"
-  "%~call __muluhisi3"
-  [(set_attr "type" "xcall")
-   (set_attr "cc" "clobber")])
-
 (define_insn "*mulshisi3_call"
   [(set (reg:SI 22)
         (mult:SI (sign_extend:SI (reg:HI 26))
Index: config/avr/constraints.md
===================================================================
--- config/avr/constraints.md	(revision 176864)
+++ config/avr/constraints.md	(working copy)
@@ -17,14 +17,74 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; <http://www.gnu.org/licenses/>.
 
-;; Register constraints
+;; 1-Register Register Constraints
 
 (define_register_constraint "t" "R0_REG"
   "Temporary register r0")
 
+(define_register_constraint "RQ18" "R18_REG"
+  "Register r18")
+
+(define_register_constraint "RQ19" "R19_REG"
+  "Register r19")
+
+(define_register_constraint "RQ20" "R20_REG"
+  "Register r20")
+
+(define_register_constraint "RQ21" "R21_REG"
+  "Register r21")
+
+(define_register_constraint "RQ22" "R22_REG"
+  "Register r22")
+
+(define_register_constraint "RQ23" "R23_REG"
+  "Register r23")
+
+(define_register_constraint "RQ24" "R24_REG"
+  "Register r24")
+
+(define_register_constraint "RQ25" "R25_REG"
+  "Register r25")
+
+;; 2-Register Register Constraints
+
+(define_register_constraint "RH18" "R18_R19_REGS"
+  "Register pair r18--r19")
+
+(define_register_constraint "RH20" "R20_R21_REGS"
+  "Register pair r20--r21")
+
+(define_register_constraint "RH22" "R22_R23_REGS"
+  "Register pair r22--r23")
+
+(define_register_constraint "RH24" "R24_R25_REGS"
+  "Register pair r24--r25")
+
+(define_register_constraint "x" "POINTER_X_REGS"
+  "Register pair X (r27:r26).")
+
+(define_register_constraint "y" "POINTER_Y_REGS"
+  "Register pair Y (r29:r28).")
+
+(define_register_constraint "z" "POINTER_Z_REGS"
+  "Register pair Z (r31:r30).")
+
+(define_register_constraint "q" "STACK_REG"
+  "Stack pointer register (SPH:SPL).")
+
+;; 4-Register Register Constraints
+
+(define_register_constraint "RS18" "R18_R21_REGS"
+  "Register r18--r21")
+
+(define_register_constraint "RS22" "R22_R25_REGS"
+  "Register r22--r25")
+
 (define_register_constraint "b" "BASE_POINTER_REGS"
   "Base pointer registers (r28--r31)")
 
+;; Even more Register Constraints
+
 (define_register_constraint "e" "POINTER_REGS"
   "Pointer registers (r26--r31)")
 
@@ -41,18 +101,6 @@ (define_register_constraint "l" "NO_LD_R
 (define_register_constraint "a" "SIMPLE_LD_REGS"
   "Registers from r16 to r23.")
 
-(define_register_constraint "x" "POINTER_X_REGS"
-  "Register pair X (r27:r26).")
-
-(define_register_constraint "y" "POINTER_Y_REGS"
-  "Register pair Y (r29:r28).")
-
-(define_register_constraint "z" "POINTER_Z_REGS"
-  "Register pair Z (r31:r30).")
-
-(define_register_constraint "q" "STACK_REG"
-  "Stack pointer register (SPH:SPL).")
-
 (define_constraint "I"
   "Integer constant in the range 0 @dots{} 63."
   (and (match_code "const_int")
@@ -98,7 +146,7 @@ (define_constraint "G"
   (and (match_code "const_double")
        (match_test "op == CONST0_RTX (SFmode)")))
 
-(define_constraint "R"
+(define_constraint "C65"
   "Integer constant in the range -6 @dots{} 5."
   (and (match_code "const_int")
        (match_test "ival >= -6 && ival <= 5")))
Index: config/avr/avr.c
===================================================================
--- config/avr/avr.c	(revision 176864)
+++ config/avr/avr.c	(working copy)
@@ -277,22 +277,6 @@ avr_option_override (void)
   init_machine_status = avr_init_machine_status;
 }
 
-/*  return register class from register number.  */
-
-static const enum reg_class reg_class_tab[]={
-  GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,
-  GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,
-  GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,GENERAL_REGS,
-  GENERAL_REGS, /* r0 - r15 */
-  LD_REGS,LD_REGS,LD_REGS,LD_REGS,LD_REGS,LD_REGS,LD_REGS,
-  LD_REGS,                      /* r16 - 23 */
-  ADDW_REGS,ADDW_REGS,          /* r24,r25 */
-  POINTER_X_REGS,POINTER_X_REGS, /* r26,27 */
-  POINTER_Y_REGS,POINTER_Y_REGS, /* r28,r29 */
-  POINTER_Z_REGS,POINTER_Z_REGS, /* r30,r31 */
-  STACK_REG,STACK_REG           /* SPL,SPH */
-};
-
 /* Function to set up the backend function structure.  */
 
 static struct machine_function *
@@ -306,7 +290,30 @@ avr_init_machine_status (void)
 enum reg_class
 avr_regno_reg_class (int r)
 {
-  if (r <= 33)
+  static const enum reg_class
+    reg_class_tab[] =
+    {
+      /* r0 - r15 */
+      R0_REG,     NO_LD_REGS, NO_LD_REGS, NO_LD_REGS,
+      NO_LD_REGS, NO_LD_REGS, NO_LD_REGS, NO_LD_REGS,
+      NO_LD_REGS, NO_LD_REGS, NO_LD_REGS, NO_LD_REGS,
+      NO_LD_REGS, NO_LD_REGS, NO_LD_REGS, NO_LD_REGS,
+      /* r16 - r17 */
+      SIMPLE_LD_REGS, SIMPLE_LD_REGS,
+      /* r18 - r25 */
+      R18_REG, R19_REG, R20_REG, R21_REG,
+      R22_REG, R23_REG, R24_REG, R25_REG,
+      /* r26 - r27 */
+      POINTER_X_REGS, POINTER_X_REGS,
+      /* r28 - r29 */
+      POINTER_Y_REGS, POINTER_Y_REGS,
+      /* r30 - r31 */
+      POINTER_Z_REGS, POINTER_Z_REGS,
+      /* SPL, SPH */
+      STACK_REG, STACK_REG
+    };
+
+  if (r <= 33) 
     return reg_class_tab[r];
   return ALL_REGS;
 }
Index: config/avr/avr.h
===================================================================
--- config/avr/avr.h	(revision 176864)
+++ config/avr/avr.h	(working copy)
@@ -235,59 +235,102 @@ extern GTY(()) section *progmem_section;
 
 enum reg_class {
   NO_REGS,
-  R0_REG,			/* r0 */
-  POINTER_X_REGS,		/* r26 - r27 */
-  POINTER_Y_REGS,		/* r28 - r29 */
-  POINTER_Z_REGS,		/* r30 - r31 */
-  STACK_REG,			/* STACK */
-  BASE_POINTER_REGS,		/* r28 - r31 */
-  POINTER_REGS,			/* r26 - r31 */
-  ADDW_REGS,			/* r24 - r31 */
-  SIMPLE_LD_REGS,		/* r16 - r23 */
-  LD_REGS,			/* r16 - r31 */
-  NO_LD_REGS,			/* r0 - r15 */
-  GENERAL_REGS,			/* r0 - r31 */
-  ALL_REGS, LIM_REG_CLASSES
+  R0_REG,               /* r0 */
+  R18_REG,              /* r18 */
+  R19_REG,              /* r19 */
+  R20_REG,              /* r20 */
+  R21_REG,              /* r21 */
+  R22_REG,              /* r22 */
+  R23_REG,              /* r23 */
+  R24_REG,              /* r24 */
+  R25_REG,              /* r25 */
+  R18_R19_REGS,         /* r18-r19 */
+  R20_R21_REGS,         /* r20-r21 */
+  R22_R23_REGS,         /* r22-r23 */
+  R24_R25_REGS,         /* r24-r25 */
+  R18_R21_REGS,         /* r18-r21 */
+  R22_R25_REGS,         /* r22-r25 */
+  POINTER_X_REGS,       /* r26 - r27 */
+  POINTER_Y_REGS,       /* r28 - r29 */
+  POINTER_Z_REGS,       /* r30 - r31 */
+  STACK_REG,            /* STACK */
+  BASE_POINTER_REGS,    /* r28 - r31 */
+  POINTER_REGS,         /* r26 - r31 */
+  ADDW_REGS,            /* r24 - r31 */
+  SIMPLE_LD_REGS,       /* r16 - r23 */
+  LD_REGS,              /* r16 - r31 */
+  NO_LD_REGS,           /* r0 - r15 */
+  GENERAL_REGS,         /* r0 - r31 */
+  ALL_REGS,
+  LIM_REG_CLASSES
 };
 
 
 #define N_REG_CLASSES (int)LIM_REG_CLASSES
 
-#define REG_CLASS_NAMES {					\
-		 "NO_REGS",					\
-		   "R0_REG",	/* r0 */                        \
-		   "POINTER_X_REGS", /* r26 - r27 */		\
-		   "POINTER_Y_REGS", /* r28 - r29 */		\
-		   "POINTER_Z_REGS", /* r30 - r31 */		\
-		   "STACK_REG",	/* STACK */			\
-		   "BASE_POINTER_REGS",	/* r28 - r31 */		\
-		   "POINTER_REGS", /* r26 - r31 */		\
-		   "ADDW_REGS",	/* r24 - r31 */			\
-                   "SIMPLE_LD_REGS", /* r16 - r23 */            \
-		   "LD_REGS",	/* r16 - r31 */			\
-                   "NO_LD_REGS", /* r0 - r15 */                 \
-		   "GENERAL_REGS", /* r0 - r31 */		\
-		   "ALL_REGS" }
-
-#define REG_CLASS_CONTENTS {						\
-  {0x00000000,0x00000000},	/* NO_REGS */				\
-  {0x00000001,0x00000000},	/* R0_REG */                            \
-  {3 << REG_X,0x00000000},      /* POINTER_X_REGS, r26 - r27 */		\
-  {3 << REG_Y,0x00000000},      /* POINTER_Y_REGS, r28 - r29 */		\
-  {3 << REG_Z,0x00000000},      /* POINTER_Z_REGS, r30 - r31 */		\
-  {0x00000000,0x00000003},	/* STACK_REG, STACK */			\
-  {(3 << REG_Y) | (3 << REG_Z),						\
-     0x00000000},		/* BASE_POINTER_REGS, r28 - r31 */	\
-  {(3 << REG_X) | (3 << REG_Y) | (3 << REG_Z),				\
-     0x00000000},		/* POINTER_REGS, r26 - r31 */		\
-  {(3 << REG_X) | (3 << REG_Y) | (3 << REG_Z) | (3 << REG_W),		\
-     0x00000000},		/* ADDW_REGS, r24 - r31 */		\
-  {0x00ff0000,0x00000000},	/* SIMPLE_LD_REGS r16 - r23 */          \
-  {(3 << REG_X)|(3 << REG_Y)|(3 << REG_Z)|(3 << REG_W)|(0xff << 16),	\
-     0x00000000},	/* LD_REGS, r16 - r31 */			\
-  {0x0000ffff,0x00000000},	/* NO_LD_REGS  r0 - r15 */              \
-  {0xffffffff,0x00000000},	/* GENERAL_REGS, r0 - r31 */		\
-  {0xffffffff,0x00000003}	/* ALL_REGS */				\
+#define REG_CLASS_NAMES {                       \
+  "NO_REGS",                                    \
+  "R0_REG",           /* r0 */                  \
+  "R18_REG",          /* r18 */                 \
+  "R19_REG",          /* r19 */                 \
+  "R20_REG",          /* r20 */                 \
+  "R21_REG",          /* r21 */                 \
+  "R22_REG",          /* r22 */                 \
+  "R23_REG",          /* r23 */                 \
+  "R24_REG",          /* r24 */                 \
+  "R25_REG",          /* r25 */                 \
+  "R18_R19_REGS",     /* r18 - r19 */           \
+  "R20_R21_REGS",     /* r20 - r21 */           \
+  "R22_R23_REGS",     /* r22 - r23 */           \
+  "R24_R25_REGS",     /* r24 - r25 */           \
+  "R18_R21_REGS",     /* r18 - r21 */           \
+  "R22_R25_REGS",     /* r22 - r25 */           \
+  "POINTER_X_REGS",   /* r26 - r27 */           \
+  "POINTER_Y_REGS",   /* r28 - r29 */           \
+  "POINTER_Z_REGS",   /* r30 - r31 */           \
+  "STACK_REG",        /* STACK */               \
+  "BASE_POINTER_REGS",/* r28 - r31 */           \
+  "POINTER_REGS",     /* r26 - r31 */           \
+  "ADDW_REGS",        /* r24 - r31 */           \
+  "SIMPLE_LD_REGS",   /* r16 - r23 */           \
+  "LD_REGS",          /* r16 - r31 */           \
+  "NO_LD_REGS",       /* r0  - r15 */           \
+  "GENERAL_REGS",     /* r0  - r31 */           \
+  "ALL_REGS"                                    \
+}
+
+#define REG_CLASS_CONTENTS {                                            \
+  { 0, 0 },             /* NO_REGS */                                   \
+  {  1 << 0,  0 },      /* R0_REG */                                    \
+  {  1 << 18, 0 },      /* r18 */                                       \
+  {  1 << 19, 0 },      /* r19 */                                       \
+  {  1 << 20, 0 },      /* r20 */                                       \
+  {  1 << 21, 0 },      /* r21 */                                       \
+  {  1 << 22, 0 },      /* r22 */                                       \
+  {  1 << 23, 0 },      /* r23 */                                       \
+  {  1 << 24, 0 },      /* r24 */                                       \
+  {  1 << 25, 0 },      /* r25 */                                       \
+  {  3 << 18, 0 },      /* r18 - r19 */                                 \
+  {  3 << 20, 0 },      /* r20 - r21 */                                 \
+  {  3 << 22, 0 },      /* r22 - r23 */                                 \
+  {  3 << 24, 0 },      /* r24 - r25 */                                 \
+  { 15 << 18, 0 },      /* r28 - r21 */                                 \
+  { 15 << 22, 0 },      /* r22 - r25 */                                 \
+  {  3 << REG_X, 0 },   /* POINTER_X_REGS, r26 - r27 */                 \
+  {  3 << REG_Y, 0 },   /* POINTER_Y_REGS, r28 - r29 */                 \
+  {  3 << REG_Z, 0 },   /* POINTER_Z_REGS, r30 - r31 */                 \
+  { 0, 0x3 },           /* STACK_REG, Stack Pointer  */                 \
+  /* BASE_POINTER_REGS, r28-r31 */                                      \
+  { (3 << REG_Y) | (3 << REG_Z), 0 },                                   \
+  /* POINTER_REGS, r26 - r31 */                                         \
+  { (3 << REG_X) | (3 << REG_Y) | (3 << REG_Z), 0 },                    \
+  /* ADDW_REGS, r24 - r31 */                                            \
+  { (3 << REG_X) | (3 << REG_Y) | (3 << REG_Z) | (3 << REG_W), 0 },     \
+  { 0x00ff0000, 0 },    /* SIMPLE_LD_REGS, r16 - r23 */                 \
+  { 0xffff0000, 0 },    /* LD_REGS,        r16 - r31 */                 \
+  { 0x0000ffff, 0 },    /* NO_LD_REGS,     r0  - r15  */                \
+  { 0xffffffff, 0 },    /* GENERAL_REGS,   r0  - r31  */                \
+  { 0xffffffff, 0x3 }   /* ALL_REGS */                                  \
 }
 
 #define REGNO_REG_CLASS(R) avr_regno_reg_class(R)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  2011-07-29 12:14           ` Georg-Johann Lay
  2011-07-28 16:03             ` Richard Henderson
@ 2011-07-31 18:06             ` Denis Chertykov
  1 sibling, 0 replies; 17+ messages in thread
From: Denis Chertykov @ 2011-07-31 18:06 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: Richard Henderson, Weddington, Eric, gcc-patches,
	Anatoly Sokolov, Vladimir Makarov

2011/7/28 Georg-Johann Lay <avr@gjlay.de>:
> Richard Henderson wrote:
>> On 07/27/2011 06:21 AM, Georg-Johann Lay wrote:
>>> +(define_insn_and_split "*mulsi3"
>>> +  [(set (match_operand:SI 0 "pseudo_register_operand"                      "=r")
>>> +        (mult:SI (match_operand:SI 1 "pseudo_register_operand"              "r")
>>> +                 (match_operand:SI 2 "pseudo_register_or_const_int_operand" "rn")))
>>> +   (clobber (reg:DI 18))]
>>> +  "AVR_HAVE_MUL && !reload_completed"
>>> +  { gcc_unreachable(); }
>>> +  "&& 1"
>>> +  [(set (reg:SI 18)
>>> +        (match_dup 1))
>>
>> That seems like it's guaranteed to force an unnecessary move.
>> Have you tried defining special-purpose register classes to
>> force reload to move the data into the right hard regs?
>>
>> E.g.  "Y" prefix
>>       "QHS" size
>>       two digit starting register number, as needed.
>>
>> You'll probably end up with quite a few register classes
>> out of this, but hopefully reload can do a better job than
>> you can manually...
>>
>>
>> r~
>
> Waaaaaahh, I introduced register classes and constraints to tell register allocator
> what's the intention if the insns, the ... parts just dealing with CONST_INTs
> and not needed in the remainder:
>
> (define_expand "mulsi3"
>  [(parallel [(set (match_operand:SI 0 "register_operand" "")
>                   (mult:SI (match_operand:SI 1 "register_operand" "")
>                            (match_operand:SI 2 "nonmemory_operand" "")))
>              (clobber (reg:HI 26))])]
>  "AVR_HAVE_MUL"
>  {
>     ...
>  })
>
> (define_insn_and_split "*mulsi3"
>  [(set (match_operand:SI 0 "register_operand"           "=RS22")
>        (mult:SI (match_operand:SI 1 "register_operand"  "%RS22")
>                 (match_operand:SI 2 "nonmemory_operand"  "RS18")))
>   (clobber (reg:HI 26))]
>  "AVR_HAVE_MUL"
>  "%~call __mulsi3"
>  "&& !reload_completed"
>  [(clobber (const_int 0))]
>  {
>    ...
>    FAIL;
>  }
>  [(set_attr "type" "xcall")
>   (set_attr "cc" "clobber")])
>
>
> Again, I used the simple test case from above:
>
> long mul (long a, long b)
> {
>    return a*b;
> }
>
> long mul2 (long a, long b)
> {
>    return b*a;
> }
>
> Compiled with -Os -mmcu=atmega8 -fno-split-wide-types:
>
> mul:
> /* prologue: function */
>        rcall __mulsi3   ;  7   *mulsi3 [length = 1]
> /* epilogue start */
>        ret      ;  21  return  [length = 1]
>
> mul2:
>        push r8  ;  22  *pushqi/1       [length = 1]
>        push r9  ;  23  *pushqi/1       [length = 1]
>        push r10         ;  24  *pushqi/1       [length = 1]
>        push r11         ;  25  *pushqi/1       [length = 1]
>        push r12         ;  26  *pushqi/1       [length = 1]
>        push r13         ;  27  *pushqi/1       [length = 1]
>        push r14         ;  28  *pushqi/1       [length = 1]
>        push r15         ;  29  *pushqi/1       [length = 1]
>        push r28         ;  30  *pushqi/1       [length = 1]
>        push r29         ;  31  *pushqi/1       [length = 1]
>        rcall .  ;  35  *addhi3_sp_R_pc2        [length = 2]
>        rcall .
>        in r28,__SP_L__  ;  36  *movhi_sp/2     [length = 2]
>        in r29,__SP_H__
> /* prologue: function */
> /* frame size = 4 */
> /* stack size = 14 */
> .L__stack_usage = 14
>        movw r12,r22     ;  2   *movsi/1        [length = 2]
>        movw r14,r24
>        movw r24,r20     ;  19  *movsi/1        [length = 2]
>        movw r22,r18
>        movw r20,r14     ;  21  *movsi/1        [length = 2]
>        movw r18,r12
>        rcall __mulsi3   ;  7   *mulsi3 [length = 1]
> /* epilogue start */
>        pop __tmp_reg__  ;  41  *addhi3_sp_R_pc2        [length = 4]
>        pop __tmp_reg__
>        pop __tmp_reg__
>        pop __tmp_reg__
>        pop r29  ;  42  popqi   [length = 1]
>        pop r28  ;  43  popqi   [length = 1]
>        pop r15  ;  44  popqi   [length = 1]
>        pop r14  ;  45  popqi   [length = 1]
>        pop r13  ;  46  popqi   [length = 1]
>        pop r12  ;  47  popqi   [length = 1]
>        pop r11  ;  48  popqi   [length = 1]
>        pop r10  ;  49  popqi   [length = 1]
>        pop r9   ;  50  popqi   [length = 1]
>        pop r8   ;  51  popqi   [length = 1]
>        ret      ;  52  return_from_epilogue    [length = 1]
>
>
> With -fsplit-wide-types (which is on per default) the code is even
> worse and the first function inflates to unacceptable code, too.
>
> Using constraints "=RS22,%0,RS18" instead of "=RS22,%RS22,RS18"
> the code of the second function is a bit better:
>
> mul2:
>        push r28         ;  20  *pushqi/1       [length = 1]
>        push r29         ;  21  *pushqi/1       [length = 1]
>        rcall .  ;  25  *addhi3_sp_R_pc2        [length = 2]
>        rcall .
>        in r28,__SP_L__  ;  26  *movhi_sp/2     [length = 2]
>        in r29,__SP_H__
> /* prologue: function */
> /* frame size = 4 */
> /* stack size = 6 */
> .L__stack_usage = 6
>        std Y+1,r22      ;  2   *movsi/4        [length = 4]
>        std Y+2,r23
>        std Y+3,r24
>        std Y+4,r25
>        movw r24,r20     ;  3   *movsi/1        [length = 2]
>        movw r22,r18
>        ldd r18,Y+1      ;  19  *movsi/3        [length = 4]
>        ldd r19,Y+2
>        ldd r20,Y+3
>        ldd r21,Y+4
>        rcall __mulsi3   ;  7   *mulsi3 [length = 1]
> /* epilogue start */
>        pop __tmp_reg__  ;  31  *addhi3_sp_R_pc2        [length = 4]
>        pop __tmp_reg__
>        pop __tmp_reg__
>        pop __tmp_reg__
>        pop r29  ;  32  popqi   [length = 1]
>        pop r28  ;  33  popqi   [length = 1]
>        ret      ;  34  return_from_epilogue    [length = 1]
>
>
> Remember that without register constraints and explicit hard-reg
> move expansion the result was (both with -fsplit-wide-types and
> -fno-split-wide-types):
>
> mul:
>        push r12         ;  35  *pushqi/1       [length = 1]
>        push r13         ;  36  *pushqi/1       [length = 1]
>        push r14         ;  37  *pushqi/1       [length = 1]
>        push r15         ;  38  *pushqi/1       [length = 1]
> /* prologue: function */
> /* frame size = 0 */
> /* stack size = 4 */
> .L__stack_usage = 4
>        movw r12,r22     ;  2   *movsi/1        [length = 2]
>        movw r14,r24
>        movw r24,r20     ;  3   *movsi/1        [length = 2]
>        movw r22,r18
>        movw r20,r14     ;  26  *movsi/1        [length = 2]
>        movw r18,r12
>        rcall __mulsi3   ;  28  *mulsi3_call    [length = 1]
> /* epilogue start */
>        pop r15  ;  41  popqi   [length = 1]
>        pop r14  ;  42  popqi   [length = 1]
>        pop r13  ;  43  popqi   [length = 1]
>        pop r12  ;  44  popqi   [length = 1]
>        ret      ;  45  return_from_epilogue    [length = 1]
>
> So it appears that IRA is not as smart as we thought and not
> prepared for this...

I have tried this method before I have realising the current.
I drop the "right constraints" idea because of "unable to fiand a
register to spill".
Seems that something changed (register allocator changed) but force
move again seems as a best solution.

Denis.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-07-31 14:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-25 10:08 [Patch,AVR]: PR49687 (better widening 32-bit mul) Georg-Johann Lay
2011-07-25 14:47 ` Weddington, Eric
2011-07-25 17:06   ` Georg-Johann Lay
2011-07-25 21:10     ` Weddington, Eric
2011-07-25 23:00       ` Georg-Johann Lay
2011-07-26 13:00         ` Georg-Johann Lay
2011-07-27 14:08       ` Georg-Johann Lay
2011-07-27 15:58         ` Richard Henderson
2011-07-27 16:12           ` Georg-Johann Lay
2011-07-27 17:06             ` Richard Henderson
2011-07-27 21:40               ` Georg-Johann Lay
2011-07-28  1:57                 ` Weddington, Eric
2011-07-28  9:40                   ` Georg-Johann Lay
2011-07-29 12:14           ` Georg-Johann Lay
2011-07-28 16:03             ` Richard Henderson
2011-07-31 18:06             ` Denis Chertykov
2011-07-29 11:58         ` [Committed,AVR]: Addendum to fix thinko in PR49687 Georg-Johann Lay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).