m68k optimisation for beginners?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* m68k optimisation for beginners?
@ 2014-02-12  9:37 Fredrik Olsson
  2014-02-12 14:48 ` Jeff Law
  0 siblings, 1 reply; 2+ messages in thread
From: Fredrik Olsson @ 2014-02-12  9:37 UTC (permalink / raw)
  To: gcc

Hi.

I would like to get started with how to improve code generation for a
backend. Any pointers, especially to good documentation is welcome.

For this example consider this C function for a reference counted type:
void TCRelease(TCTypeRef tc) {
  if (--tc->retainCount == 0) {
    if (tc->destroy) {
      tc->destroy(tc);
    }
    free((void *)tc);
  }
}

The generated m68k asm is this:
_TCRelease:
    move.l %a2,-(%sp)
    move.l 8(%sp),%a2
    move.w (%a2),%d0  ; Question 1:
    subq.w #1,%d0
    move.w %d0,(%a2)
jne .L7
    move.l 4(%a2),%a0  ; Question 2:
    cmp.w #0,%a0
jeq .L9
    move.l %a2,-(%sp)   ; Question 3:
    jsr (%a0)
    addq.l #4,%sp
.L9:
    move.l %a2,8(%sp)
    move.l (%sp)+,%a2
    jra _free
.L7:
    move.l (%sp)+,%a2
    rts

Question 1:
This could be done as one instructions "sub.l #1, (%a2)", the result
in d0 is never used again, and adding directly to memory will update
the status flags. Would save 4 bytes, and 8 cycles on a 68000.
How would I attack this problem? Peephole optimisation, or maybe the
gcc is not aware that the instruction updates flags?

Question 2:
Doing this as a "move.l 4(%a2), %d0" to a temporary data register
would update the status register, allowing for the branch without the
compare with immediate instruction. Obviously requiring an extra "move
%d0, %a0" if the branch is not taken to be able to make the jump. But
still 2 bytes, and 8 cycles saved in work case (12 cycles is best
case).
Is this a peephole optimisation? Or is it about providing accurate
instruction costs for inst?

Question 3:
Storing a2 on the stack is only ever needed if this code path is
taken. Is this even worth to bother with? And is this something that
moving from reload to LRA for the m68k target solves?

// Fredrik Olsson

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: m68k optimisation for beginners?
  2014-02-12  9:37 m68k optimisation for beginners? Fredrik Olsson
@ 2014-02-12 14:48 ` Jeff Law
  0 siblings, 0 replies; 2+ messages in thread
From: Jeff Law @ 2014-02-12 14:48 UTC (permalink / raw)
  To: Fredrik Olsson, gcc

On 02/12/14 02:37, Fredrik Olsson wrote:
> Hi.
>
> I would like to get started with how to improve code generation for a
> backend. Any pointers, especially to good documentation is welcome.
>
> For this example consider this C function for a reference counted type:
> void TCRelease(TCTypeRef tc) {
>    if (--tc->retainCount == 0) {
>      if (tc->destroy) {
>        tc->destroy(tc);
>      }
>      free((void *)tc);
>    }
> }
>
> The generated m68k asm is this:
> _TCRelease:
>      move.l %a2,-(%sp)
>      move.l 8(%sp),%a2
>      move.w (%a2),%d0  ; Question 1:
>      subq.w #1,%d0
>      move.w %d0,(%a2)
> jne .L7
>      move.l 4(%a2),%a0  ; Question 2:
>      cmp.w #0,%a0
> jeq .L9
>      move.l %a2,-(%sp)   ; Question 3:
>      jsr (%a0)
>      addq.l #4,%sp
> .L9:
>      move.l %a2,8(%sp)
>      move.l (%sp)+,%a2
>      jra _free
> .L7:
>      move.l (%sp)+,%a2
>      rts
>
> Question 1:
> This could be done as one instructions "sub.l #1, (%a2)", the result
> in d0 is never used again, and adding directly to memory will update
> the status flags. Would save 4 bytes, and 8 cycles on a 68000.
> How would I attack this problem? Peephole optimisation, or maybe the
> gcc is not aware that the instruction updates flags?
Most likely an issue in the combiner.  Prior to conversion to RTL the 
decrement is turned into a three statement format (load from mem, 
decrement, store back to memory).  The decremented value is used in the 
comparison.  So I can reasonably guess the combiner is unable to squash 
all that back into a single insn.

Also note that flags are effectively not exposed on the m68k. Instead a 
conditional branch is modeled as two insns.  One which sets a special 
register, cc0 and one that uses the cc0 register.  Those two insns are 
kept consecutive throughout the RTL optimizers and only during final 
assembly do we try to eliminate the compare by tracking the state of the 
flags register.

There are better ways to do that, but nobody has converted the m68k to 
the newer style.  It's a fair amount of work and not a high priority.

>
> Question 2:
> Doing this as a "move.l 4(%a2), %d0" to a temporary data register
> would update the status register, allowing for the branch without the
> compare with immediate instruction. Obviously requiring an extra "move
> %d0, %a0" if the branch is not taken to be able to make the jump. But
> still 2 bytes, and 8 cycles saved in work case (12 cycles is best
> case).
> Is this a peephole optimisation? Or is it about providing accurate
> instruction costs for inst?
Can't be tackled without first fixing how we track the flags register.

>
> Question 3:
> Storing a2 on the stack is only ever needed if this code path is
> taken. Is this even worth to bother with? And is this something that
> moving from reload to LRA for the m68k target solves?
This is called shrink wrapping.  GCC has some limited support for 
shrink-wrapping these days.  Someone would have to look into why the 
shrink-wrapping optimization did not apply here.

Jeff


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-02-12 14:48 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-12  9:37 m68k optimisation for beginners? Fredrik Olsson
2014-02-12 14:48 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).