[Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions
@ 2012-07-13  9:01 olegendo at gcc dot gnu.org
  2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-13  9:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

             Bug #: 53949
           Summary: [SH] Add support for mac.w / mac.l instructions
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: olegendo@gcc.gnu.org
                CC: chrbr@gcc.gnu.org
            Target: sh*-*-*


So far, GCC does not utilize the integer multiply-add instructions.
On SH1 only the mac.w instruction is supported.
On SH2 and above the mac.w and mac.l instructions are available.

Carry over from PR 39423 comment #20

> On a related thread, for further work, I'm thinking on adding support for the
> MAC instruction, now that was have the multiply and add. But this requires
> exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> like to give this a try pretty soon.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
@ 2012-07-13 10:34 ` olegendo at gcc dot gnu.org
  2012-07-13 11:01 ` chrbr at gcc dot gnu.org
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-13 10:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #1 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-13 10:34:20 UTC ---
(In reply to comment #0)
> So far, GCC does not utilize the integer multiply-add instructions.
> On SH1 only the mac.w instruction is supported.
> On SH2 and above the mac.w and mac.l instructions are available.
> 
> Carry over from PR 39423 comment #20
> 
> > On a related thread, for further work, I'm thinking on adding support for the
> > MAC instruction, now that was have the multiply and add. But this requires
> > exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> > like to give this a try pretty soon.

I think the biggest problem is that the mac operands have to be in memory.
For example:

long long fun (int a, int long b, long long c)
{
  return (long long)a * (long long)b + c;
}

would need to become something like ... 

mov.l  r4,@-r15
mov    r15,r1
mov.l  r5,@-r15
lds      r6,mach
lds      r7,macl
mac.l  @r15+,@r1+
sts      mach,r1
sts      macl,r0
rts
add     #4,r15

not using the mac instruction seems a bit simpler in this case:

dmuls.l  r4,r5
sts      mach,r1
clrt
sts      macl,r0
addc   r6,r0
rts
addc   r7,r1

I think the mac instructions can be very useful when they can be used inside of
loops,  but for this the whole post-inc memory stuff has to integrate properly
into
the surrounding code.

Chris, do you have any ideas/plans on how to handle the SR.S bit, for example
to implement
the ssmaddhisi4 pattern with mac.w?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
  2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
@ 2012-07-13 11:01 ` chrbr at gcc dot gnu.org
  2012-07-13 11:24 ` chrbr at gcc dot gnu.org
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: chrbr at gcc dot gnu.org @ 2012-07-13 11:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #2 from chrbr at gcc dot gnu.org 2012-07-13 11:00:55 UTC ---
I see the MAC only as a global optimization, since its interest is to spawns
across several loop BBs as you said. Their is also problem on clear the
accumulator. 

That should certainly be new extension in the gimple SSA loop optimizers, based
on the presence on a multiply and and pattern. Not sure what is the best way to
do this as this point.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
  2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
  2012-07-13 11:01 ` chrbr at gcc dot gnu.org
@ 2012-07-13 11:24 ` chrbr at gcc dot gnu.org
  2012-07-15 12:11 ` olegendo at gcc dot gnu.org
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: chrbr at gcc dot gnu.org @ 2012-07-13 11:24 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

chrbr at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2012-07-13 11:24 ` chrbr at gcc dot gnu.org
@ 2012-07-15 12:11 ` olegendo at gcc dot gnu.org
  2012-07-17 19:13 ` olegendo at gcc dot gnu.org
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-15 12:11 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #3 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-15 12:11:20 UTC ---
Created attachment 27799
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27799
Proof of concept patch

This is a proof of concept patch just to probe around.
The idea is to allow the RA to allocate macl and mach registers in DImode, and
have mac insns that use the macl/mach regs as a pair in DImode.

With the patch applied, the following function ...

int64_t test01 (const int16_t* a, const int16_t* b)
{
  int64_t sum = 0;
  for (int i = 0; i < 16; ++i)
    sum += (int64_t)(*a++) * (int64_t)(*b++);
  return sum;
}

compiled with -m4 -O2 results in ...

__Z6test01PKsS0_:
.LFB0:
        .cfi_startproc
        mov    #16,r1          ! 88    movsi_ie/3    [length = 2]
        clrmac                  ! 39    clrmac/1    [length = 2]
        .align 2
.L3:
        dt    r1              ! 89    dect    [length = 2]
        bf/s    .L3             ! 90    branch_false    [length = 2]
        mac.w    @r4+,@r5+       ! 61    *macw    [length = 2]
        sts    macl,r0         ! 82    movsi_ie/8    [length = 2]
        rts                     ! 99    *return_i    [length = 2]
        sts    mach,r1         ! 83    movsi_ie/8    [length = 2]

... which is not that bad already.


Some notes I took while playing around with this:


- When compiling for big endian the RA mistakes mach and macl when
  storing mach:macl to a DImode reg:reg pair.
  This could probably fixed by providing appropriate move insns patterns.


- Move insns/splits for DImode mach:macl <-> memory have to be added.
  I've seen an ICE when compiling with -O1:
  error: unrecognizable insn:
  (insn 122 14 15 2 (set (mem/c:DI (plus:SI (reg/f:SI 15 r15)
                (const_int 8 [0x8])) [0 %sfp+-8 S8 A32])
        (reg:DI 148 macl)) sh_mac.cpp:38 -1
     (nil))


- In some cases the mach:macl reg pair gets swapped to a general reg pair
  without any obvious need.  Example function:

  int64_t test04 (const int16_t* a, const int16_t* b,
                  const int16_t* c, const int16_t* d)
  {
    int64_t sum0 = 0;
    int64_t sum1 = 0;
    for (int i = 0; i < 16; ++i)
      sum0 += (int64_t)(*a++) * (int64_t)(*b++);

    for (int i = 0; i < 16; ++i)
      sum1 += (int64_t)(*c++) * (int64_t)(*d++);

    return sum0 - sum1;
  }

  The IRA pass first allocates sum0 and sum1 to mach:macl, but then reload
  seems to think that they are conflicting and moves sum0 to a general regs
  pair.  This results in ...

        mov     #0,r2
        mov     #16,r1
        mov     r2,r3
  .L16:
        lds     r2,macl
        lds     r3,mach
        dt      r1
        mac.w   @r4+,@r5+
        sts     macl,r2
        bf/s    .L16
        sts     mach,r3

        mov     #16,r1
        clrmac
        .align 2
  .L18:
        dt      r1
        bf/s    .L18
        mac.w   @r6+,@r7+


  which would be better as:
        mov     #16,r1
        clrmac
  .L16:
        dt      r1
        bf/s    .L16
        mac.w   @r4+,@r5+

        sts     macl,r2
        sts     mach,r3
        clrmac
        mov     #16,r1
  .L18:
        dt      r1
        bf/s    .L18
        mac.w   @r6+,@r7+


- Loops with multiple running sums like
  for (int i = 0; i < 16; ++i)
  {
    sum0 += (int64_t)(*a++) * (int64_t)(*b++);
    sum1 += (int64_t)(*c++) * (int64_t)(*d++);
  }

  result in macl:mach swapping to general reg pairs between subsequent
  mac.w instructions.  Ideally such loops should be split into multiple
  loops like in the previous example.


- When loop unrolling is turned on the auto-inc addresses refs are
  converted to displacement addresses.  Because the auto-inc-dec pass
  currently fails to detect a lot of auto-inc-dec possibilities the 
  mac.w pattern will not match.
  The same goes for manually unrolled code like 

  sum += (int64_t)(*a++) * (int64_t)(*b++);
  sum += (int64_t)(*a++) * (int64_t)(*b++);


- Running sum variables should be turned into DImode variables if possible:
  int32_t test00 (const int16_t* a, const int16_t* b)
  {
    int32_t sum = 0;
    for (int i = 0; i < 16; ++i)
      sum += (*a++) * (*b++);
    return sum;
  }


- The existing multiplication patterns could be adopted to utilize macl:mach
  reg pair allocation, especially 32x32 -> 64 bit multiplications.


- Normal multiplications that do not need a full MAC operation but use
  memory operands can be done with a clrmac-mac sequence.


Probably there are more subtle issues.  Also, I have not tried expanding
the standard name 'maddmn4' pattern,  maybe it would make some of the
problems mentioned above automagically disappear.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2012-07-15 12:11 ` olegendo at gcc dot gnu.org
@ 2012-07-17 19:13 ` olegendo at gcc dot gnu.org
  2012-07-17 23:05 ` kkojima at gcc dot gnu.org
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-17 19:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

Oleg Endo <olegendo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kkojima at gcc dot gnu.org

--- Comment #4 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-17 19:12:49 UTC ---
To be able to use the mac.w instruction for implementing the ssmaddhisi4
pattern I think the already existing mode-switching facilities can be used, as
it is done for float/double mode switching.

Even without doing SR.S mode switches, before any mac instructions can be used,
we must make sure that the SR.S bit is in a defined state at function entry and
function leave.

Kaz, do you think it is safe to assume that SR.S = 0 at function entry?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2012-07-17 19:13 ` olegendo at gcc dot gnu.org
@ 2012-07-17 23:05 ` kkojima at gcc dot gnu.org
  2012-07-22 16:47 ` olegendo at gcc dot gnu.org
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: kkojima at gcc dot gnu.org @ 2012-07-17 23:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #5 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2012-07-17 23:04:54 UTC ---
(In reply to comment #4)
> Kaz, do you think it is safe to assume that SR.S = 0 at function entry?

I think so.  I can't imagine a practical system with setting
SR.S to one in its start-up code, though I'm wrong about it.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2012-07-17 23:05 ` kkojima at gcc dot gnu.org
@ 2012-07-22 16:47 ` olegendo at gcc dot gnu.org
  2012-10-11 20:43 ` olegendo at gcc dot gnu.org
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-22 16:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #6 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-22 16:47:44 UTC ---
If I understand correctly PR 29961 is somewhat related to this.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2012-07-22 16:47 ` olegendo at gcc dot gnu.org
@ 2012-10-11 20:43 ` olegendo at gcc dot gnu.org
  2012-11-07 21:31 ` olegendo at gcc dot gnu.org
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-10-11 20:43 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #7 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-10-11 20:43:02 UTC ---
A note regarding the SR.S bit.  The insns sets and clrs are available only on
SH3* and SH4*.  SH1* and SH2* (incl SH2A) do not implement them.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2012-10-11 20:43 ` olegendo at gcc dot gnu.org
@ 2012-11-07 21:31 ` olegendo at gcc dot gnu.org
  2013-05-04 13:39 ` olegendo at gcc dot gnu.org
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-11-07 21:31 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #8 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-11-07 21:31:39 UTC ---
Christian, I just wanted to check with you whether you've already started doing
something regarding the mac.w / mac.l instructions?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2012-11-07 21:31 ` olegendo at gcc dot gnu.org
@ 2013-05-04 13:39 ` olegendo at gcc dot gnu.org
  2013-12-17 12:25 ` olegendo at gcc dot gnu.org
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-05-04 13:39 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #9 from Oleg Endo <olegendo at gcc dot gnu.org> 2013-05-04 13:39:10 UTC ---
(In reply to comment #3)
> - Loops with multiple running sums like
>   for (int i = 0; i < 16; ++i)
>   {
>     sum0 += (int64_t)(*a++) * (int64_t)(*b++);
>     sum1 += (int64_t)(*c++) * (int64_t)(*d++);
>   }
> 
>   result in macl:mach swapping to general reg pairs between subsequent
>   mac.w instructions.  Ideally such loops should be split into multiple
>   loops like in the previous example.

This is basically what -ftree-loop-distribution does.  The question would be
how to re-use it for this particular case.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2013-05-04 13:39 ` olegendo at gcc dot gnu.org
@ 2013-12-17 12:25 ` olegendo at gcc dot gnu.org
  2013-12-17 12:37 ` olegendo at gcc dot gnu.org
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-12-17 12:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #10 from Oleg Endo <olegendo at gcc dot gnu.org> ---
I was wondering whether it would make sense to convert sequences such as

                             SH4       SH4A
        mov.l   @r15,r3      LS/2      LS/2  
        mul.l   r2,r3        CO/4      EX/3
        sts     macl,r3      CO/3      LS/2
        add     r1,r3        EX/1      EX/1

into
        mov     r15,r0       MT/0      MT/1
        mov.l   r2,@-r15     LS/1      LS/1
        lds     r1,macl      CO/3      LS/1
        mac.l   @r15+,@r0+   CO/4      CO/5
        sts     macl,r3      CO/3      LS/2

Looking simply at the issue cycles (the numbers above) would suggest that it's
not worth doing it, at least not if the value has to be pulled out from the mac
register immediately after the mac operation.  Probably it's not beneficial to
emit a single mac insn if the data is not already in place so that it can be
reached easily with the post-inc addressing.

On the other hand something like ...

int test33 (int* x, int y, int z)
{
  return x[0] * 40 + z;
}

currently compiles to:
        mov.l   @r4,r2
        mov     #40,r1
        mul.l   r1,r2
        sts     macl,r0
        rts
        add     r6,r0

where this one maybe could be better:
        mova    .L40,r0
        lds     r6,macl
        mac.l   @r4+,r0+
        rts
        sts     macl,r0

        .align 2
.L40:   .long   40


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2013-12-17 12:25 ` olegendo at gcc dot gnu.org
@ 2013-12-17 12:37 ` olegendo at gcc dot gnu.org
  2015-01-30 19:37 ` olegendo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-12-17 12:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #11 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Another question is whether the following is OK to do on all SH
implementations:

int test33 (int x, int y, int z)
{
  return x * y + z;
}

currently compiles:
        mul.l   r5,r4
        sts     macl,r0
        rts
        add     r6,r0

could also be done as:
        lds     r6,macl
        mov.l   r4,@-r15
        mov.l   r5,@-r15
        mac.l   @r15+,@r15+
        rts
        sts     macl,r0

This assumes that a mac insn with both address operands being the same works
exactly as it's described in the Renesas manuals:

 tempn = Read_32 (R[n]);
  R[n] += 4;
  tempm = Read_32 (R[m]);
  R[m] += 4;

However, I don't know whether this is true for all SH implementations.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2013-12-17 12:37 ` olegendo at gcc dot gnu.org
@ 2015-01-30 19:37 ` olegendo at gcc dot gnu.org
  2015-02-01  0:37 ` olegendo at gcc dot gnu.org
  2015-04-08 20:08 ` olegendo at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-01-30 19:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

Oleg Endo <olegendo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-01-30
     Ever confirmed|0                           |1

--- Comment #12 from Oleg Endo <olegendo at gcc dot gnu.org> ---
This is an example from Renesas public material regarding the SH-DSP.  The
example program can utilize either the regular mac.w instruction, or the DSP
ISA (found on SH2-DSP, SH3-DSP, SH4AL-DSP).

typedef short  Fixed;
typedef long Lfixed;
typedef long Laccum;
#define __X
#define __Y

void func(__X Fixed x[5],Fixed *out, __Y Fixed y[128][5] )
{
  int i;
  __Y Fixed *yp=y[0];
  __X Fixed *xp=x;
  Fixed x0,y0;
  Lfixed m0;
  Laccum a0;
  for(i=0;i<128;i++)
  {
    a0=0;
    x0=*xp++; y0=*yp++;
    m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; 
    a0+=m0;
    *out++=a0;
  }
}

which should compile to something like:
_func:
        mov     #-128,r1
        mov     r5,r3
        extu.b  r1,r1
.L11:
        clrmac
        mac.w   @r6+,@r4+
        dt      r1
        mac.w   @r6+,@r4+
        mac.w   @r6+,@r4+
        mac.w   @r6+,@r4+
        sts     macl,r2
        mov.w   r2,@r3
        bf/s    .L11
        add     #2,r3

        rts
        nop


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2015-01-30 19:37 ` olegendo at gcc dot gnu.org
@ 2015-02-01  0:37 ` olegendo at gcc dot gnu.org
  2015-04-08 20:08 ` olegendo at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-02-01  0:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #13 from Oleg Endo <olegendo at gcc dot gnu.org> ---
A more interesting real-world example from libjpeg would be function
jpeg_idct_ifast (jidctint.c).

If we take the code as-is, there are few mac opportunities due to sharing of
the terms.  The expressions could be un-CSE'd which would result in longer mac
chains, but the overall result gets worse because the data layout is not in a
mac friendly way.

The first loop in jpeg_idct_ifast can be split into 8 independent loops for the
output value wsptr[8*n+i].
For n = 1,2,3,4,5,6 the loops look a bit complex, but for n = 0 and n = 7 we
get similar looking loops like:

for (int i = 0; i < 8; ++i)
{
  wsptr[8*7+i] = inptr[8*0 + i] * quantptr[8*0 + i]
                 - inptr[8*1 + i] * quantptr[8*1 + i]
                 + inptr[8*2 + i] * quantptr[8*2 + i]
                 - inptr[8*3 + i] * quantptr[8*3 + i]
                 + inptr[8*4 + i] * quantptr[8*4 + i]
                 - inptr[8*5 + i] * quantptr[8*5 + i]
                 + inptr[8*6 + i] * quantptr[8*6 + i]
                 - inptr[8*7 + i] * quantptr[8*7 + i];
}

Still, due to the subtractions and memory access pattern, plain mac insns can't
be used.

The subtractions can be converted into additions by negating the operands. 
Since mac wants both operands in memory, those can be placed on the stack. 
Also, in this case the address registers can be pre-computed outside the loop,
since there are enough registers.
A possible outcome would be something like this:

                                    // r4 = inptr[8*0+i]
                                    // r5 = quantptr[8*0+i]
                                    // r6 = wsptr[8*0+i]

  mov    r4,r3;    add    #32,r3    // r3 = inptr[8*1+i]
  mov    r3,r7;    add    #32,r7    // r7 = inptr[8*2+i]
  mov    r7,r8;    add    #32,r8    // r8 = inptr[8*3+i]
  mov    r8,r9;    add    #32,r9    // r9 = inptr[8*4+i]
  mov    r9,r10;   add    #32,r10   // r10 = inptr[8*5+i]
  mov    r10,r11;  add    #32,r11   // r11 = inptr[8*6+i]
  mov    r11,r12;  add    #32,r12   // r12 = inptr[8*7+i]

  mov    #8,r14
  add    #126,r6;  add    #102,r6   // r6 = wpstr + 8*7*4 + 4
  mov    r4,r0;    sub    r5,r0     // r0 = quantptr - intptr

.Loop:
  mov.l  @(r0,r12),r1   // quantptr[8*7+i]
  mov.l  @(r0,r11),r2    // quantptr[8*6+i]
  mov.l  @(r0,r10),r13    // quantptr[8*5+i]
  neg    r1,r1
  mov.l  r1,@-r15
  mov.l  r2,@-r15
  neg    r13,r13
  mov.l  @(r0,r8),r1    // quantptr[8*3+i]
  mov.l  @(r0,r9),r2    // quantptr[8*4+i]
  mov.l  r13,@-r15
  neg    r1,r1
  mov.l  r2,@-r15
  mov.l  @(r0,r7),r2    // quantptr[8*2+i]
  mov.l  @(r0,r3),r13   // quantptr[8*1+i]
  mov.l  r1,@-r15
  mov.l  r2,@-r15
  neg    r13,r13
  mov.l  r13,@-r15

  clrmac
  mac.l    @r4+,@r5+
  mac.l    @r3+,@r15+
  mac.l    @r7+,@r15+
  mac.l    @r8+,@r15+
  mac.l    @r9+,@r15+
  mac.l    @r10+,@r15+
  mac.l    @r11+,@r15+
  mac.l    @r12+,@r15+
  dt    r14
  sts    macl,@-r6
  bf/s    .Loop
  add    #8,r6

which is 31 insns per loop and (almost) no pipeline stalls, vs. 53 insns per
loop + stalls on mul-sts sequences when the mac insn is not used.
The above loop can be optimized even further with partial unrolling to avoid
the latency of the last mac and sts.
Of course it'd be even better, if the application's data was in a mac friendly
layout.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
  2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2015-02-01  0:37 ` olegendo at gcc dot gnu.org
@ 2015-04-08 20:08 ` olegendo at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-04-08 20:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949

--- Comment #14 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Oleg Endo from comment #3)
> 
> - When compiling for big endian the RA mistakes mach and macl when
>   storing mach:macl to a DImode reg:reg pair.
>   This could probably fixed by providing appropriate move insns patterns.

I believe this is PR 29961.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-04-08 20:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-13  9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
2012-07-13 11:01 ` chrbr at gcc dot gnu.org
2012-07-13 11:24 ` chrbr at gcc dot gnu.org
2012-07-15 12:11 ` olegendo at gcc dot gnu.org
2012-07-17 19:13 ` olegendo at gcc dot gnu.org
2012-07-17 23:05 ` kkojima at gcc dot gnu.org
2012-07-22 16:47 ` olegendo at gcc dot gnu.org
2012-10-11 20:43 ` olegendo at gcc dot gnu.org
2012-11-07 21:31 ` olegendo at gcc dot gnu.org
2013-05-04 13:39 ` olegendo at gcc dot gnu.org
2013-12-17 12:25 ` olegendo at gcc dot gnu.org
2013-12-17 12:37 ` olegendo at gcc dot gnu.org
2015-01-30 19:37 ` olegendo at gcc dot gnu.org
2015-02-01  0:37 ` olegendo at gcc dot gnu.org
2015-04-08 20:08 ` olegendo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).