public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF)
@ 2023-06-02 11:11 rguenth at gcc dot gnu.org
  2023-06-02 11:13 ` [Bug middle-end/110089] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-02 11:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

            Bug ID: 110089
           Summary: sub-optimal code for attempting to produce JNA (jump
                    on CF or ZF)
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

void bar ();
void foo (unsigned int n, unsigned s)
{
  do
    {
      bar (MIN (n, s));
    }
  while (!__builtin_usub_overflow (n, s, &n) && n != 0);
}

tries to construct a loop processing N elements S at a time and in the
last iteration N' < S elements.  The x86 sub instruction sets CF and ZF
so a jump on CF or ZF aka JNA (aka ! GT) should be possible but I can't
arrive at that code gen.

We expand from

  # n_4 = PHI <n_6(D)(2), _2(3)>
  _1 = n_4 % s_8(D);
  bar (_1);
  _10 = .SUB_OVERFLOW (n_4, s_8(D));
  _2 = REALPART_EXPR <_10>;
  _3 = IMAGPART_EXPR <_10>;
  _11 = _2 != 0;
  _15 = _3 ^ 1;
  _12 = (_Bool) _15;
  _13 = _11 & _12;
  if (_13 != 0)
    goto <bb 3>; [89.30%]

but that's all too complicated for combine or compare-elim

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)
  2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
@ 2023-06-02 11:13 ` rguenth at gcc dot gnu.org
  2023-06-02 11:21 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-02 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
What's the idea how we synthesize these kind of conditional jumps?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)
  2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
  2023-06-02 11:13 ` [Bug middle-end/110089] " rguenth at gcc dot gnu.org
@ 2023-06-02 11:21 ` rguenth at gcc dot gnu.org
  2023-06-02 11:30 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-02 11:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|x86_64-*-*                  |x86_64-*-*, powerpc*
                 CC|                            |linkw at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
powerpc could use subf. and jgt?  This is for the downward iterating IV for
partial-vector-usage=2 loop vectorization.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)
  2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
  2023-06-02 11:13 ` [Bug middle-end/110089] " rguenth at gcc dot gnu.org
  2023-06-02 11:21 ` rguenth at gcc dot gnu.org
@ 2023-06-02 11:30 ` rguenth at gcc dot gnu.org
  2023-06-02 11:48 ` amonakov at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-02 11:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
That said, writing it the way the vectorizer currently emits it is

void bar ();
void foo (unsigned int n, unsigned s)
{
  unsigned oldn;
  do
    {
      bar (n % s);
      oldn = n;
      n = n - s;
    }
  while (oldn > s);
}

which results in (on x86):

.L2:
        movl    %ebx, %eax
        xorl    %edx, %edx
        divl    %ebp
        xorl    %eax, %eax
        movl    %edx, %edi
        call    bar
        movl    %ebx, %eax
        subl    %ebp, %ebx
        cmpl    %eax, %ebp
        jb      .L2

where the sub and the cmp should be combinable.  combine sees the following
but it doesn't try 12 -> 13 (since we set 84 in 13 which is used in 12)
and thus also not 12 -> 13 -> 15.  Also there's no "combine" incentive
for this since the 15 doesn't use 84 (12 -> 15 fails).

(insn 12 11 13 3 (set (reg/v:SI 83 [ n ])
        (reg/v:SI 84 [ n ])) 89 {*movsi_internal}
     (nil))      
(insn 13 12 15 3 (parallel [
            (set (reg/v:SI 84 [ n ])
                (minus:SI (reg/v:SI 84 [ n ])
                    (reg/v:SI 85 [ s ])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":9:9 330 {*subsi_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))  
(insn 15 13 16 3 (set (reg:CC 17 flags)
        (compare:CC (reg/v:SI 85 [ s ])
            (reg/v:SI 83 [ n ]))) "t.c":11:15 discrim 1 11 {*cmpsi_1}
     (expr_list:REG_DEAD (reg/v:SI 83 [ n ])
        (nil)))
(jump_insn 16 15 17 3 (set (pc)
        (if_then_else (ltu (reg:CC 17 flags)
                (const_int 0 [0]))
            (label_ref:DI 14)
            (pc))) "t.c":11:15 discrim 1 1002 {*jcc}
     (expr_list:REG_DEAD (reg:CC 17 flags)
        (int_list:REG_BR_PROB 955630228 (nil)))

cmpelim sees basically the same IL:

(insn 12 11 13 3 (set (reg/v:SI 0 ax [orig:83 n ] [83])
        (reg/v:SI 3 bx [orig:84 n ] [84])) 89 {*movsi_internal}
     (nil))
(insn 13 12 15 3 (parallel [
            (set (reg/v:SI 3 bx [orig:84 n ] [84])
                (minus:SI (reg/v:SI 3 bx [orig:84 n ] [84])
                    (reg/v:SI 6 bp [orig:85 s ] [85])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":9:9 330 {*subsi_1}
     (nil))
(insn 15 13 16 3 (set (reg:CC 17 flags)
        (compare:CC (reg/v:SI 6 bp [orig:85 s ] [85])
            (reg/v:SI 0 ax [orig:83 n ] [83]))) "t.c":11:15 discrim 1 11
{*cmpsi_1}
     (nil))
(jump_insn 16 15 25 3 (set (pc)
        (if_then_else (ltu (reg:CC 17 flags)
                (const_int 0 [0]))
            (label_ref:DI 14)
            (pc))) "t.c":11:15 discrim 1 1002 {*jcc}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)
  2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-06-02 11:30 ` rguenth at gcc dot gnu.org
@ 2023-06-02 11:48 ` amonakov at gcc dot gnu.org
  2023-06-02 11:57 ` ubizjak at gmail dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-06-02 11:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
If 'min' is needed anyway you can use it in subtraction:

void bar ();
void foo (unsigned int n, unsigned s)
{
  do
    {
      np = MIN (n, s);
      bar (np);
    }
  while (n -= np);
}

but getting the sub-jcc trick to work should yield more efficient code.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)
  2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-06-02 11:48 ` amonakov at gcc dot gnu.org
@ 2023-06-02 11:57 ` ubizjak at gmail dot com
  2023-06-02 14:01 ` rguenth at gcc dot gnu.org
  2023-06-05  2:58 ` linkw at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: ubizjak at gmail dot com @ 2023-06-02 11:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

--- Comment #5 from Uroš Bizjak <ubizjak at gmail dot com> ---
The important pattern in i386.md is *sub<mode>2, which allows CCGOCmode
compare. This means that garbage in Overlow and Carry flags are allowed.

In ix86_cc_modes_compatible, CCmode is returned for combination of CCCmode and
CCZmode. This is admittely not optimal, but it was OK untill now for testing of
separate bits. Maybe we should add new combined mode for CCZ and CCC flags?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)
  2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-06-02 11:57 ` ubizjak at gmail dot com
@ 2023-06-02 14:01 ` rguenth at gcc dot gnu.org
  2023-06-05  2:58 ` linkw at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-02 14:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #4)
> If 'min' is needed anyway you can use it in subtraction:
> 
> void bar ();
> void foo (unsigned int n, unsigned s)
> {
>   do
>     {
>       np = MIN (n, s);
>       bar (np);
>     }
>   while (n -= np);
> }
> 
> but getting the sub-jcc trick to work should yield more efficient code.

It is needed, but using it in the subtraction creates an IV that's not
friendly to niter analysis because the step is no longer invariant.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)
  2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-06-02 14:01 ` rguenth at gcc dot gnu.org
@ 2023-06-05  2:58 ` linkw at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: linkw at gcc dot gnu.org @ 2023-06-05  2:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |segher at gcc dot gnu.org

--- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> powerpc could use subf. and jgt?  This is for the downward iterating IV for
> partial-vector-usage=2 loop vectorization.

Currently subf. exploitation only happens for CCmode and <MODE>mode == Pmode
(patterns *subf<mode>3_dot and *subf<mode>3_dot2), as the record form on all
fixed point insns perform "the first three bits of CR Field 0 are set by
**signed** comparison of the result to zero", it can't be used for CCUNSmode
and 
<MODE>mode == Pmode. But it looks we can extend it for unsigned int on 64-bit.

Even if I did a hack to extend it with one *subf<mode>3_dot3 for UNSCC + SImode
+ Pmode==DImode, it still failed to match in combine, the pattern looks like:

  (insn 24 23 25 4 (set (reg/v:DI 127 [ n ])
          (reg/v:DI 131 [ n ])) 682 {*movdi_internal64}
       (nil))
  (insn 25 24 26 4 (set (reg:SI 142)
          (minus:SI (subreg/s/v:SI (reg/v:DI 131 [ n ]) 0)
              (subreg/s/v:SI (reg/v:DI 132 [ s ]) 0))) "test1.c":20:9 94
{*subfsi3}
       (expr_list:REG_DEAD (reg/v:DI 131 [ n ])
          (nil)))
  (insn 26 25 28 4 (set (reg/v:DI 131 [ n ])
          (zero_extend:DI (reg:SI 142))) "test1.c":20:9 16 {zero_extendsidi2}
       (expr_list:REG_DEAD (reg:SI 142)
          (nil)))
  (insn 28 26 29 4 (set (reg:CCUNS 143)
          (compare:CCUNS (subreg/s/v:SI (reg/v:DI 127 [ n ]) 0)
              (subreg/s/v:SI (reg/v:DI 132 [ s ]) 0))) "test1.c":22:15 discrim
1 805 {*cmpsi_unsigned}
       (expr_list:REG_DEAD (reg/v:DI 127 [ n ])
          (nil)))

-------

At expand, we have 

    # n_9 = PHI <n_12(D)(2), n_19(3)>
    # sum_10 = PHI <0(2), sum_18(3)>
    _1 = MIN_EXPR <n_9, s_13(D)>;
    len_14 = (int) _1;
    _2 = (long unsigned int) len_14;
    _3 = _2 * 4;
    _4 = a_15(D) + _3;
    _5 = *_4;
    _6 = b_17(D) + _3;
    _7 = *_6;
    _8 = _5 + _7;
    sum_18 = _8 + sum_10;
    n_46 = n_9;
    n_19 = n_9 - s_13(D);     // different operand here and below,
    if (n_46 > s_13(D))       // we don't need this n_46?
      goto <bb 3>; [89.00%]
    else
      goto <bb 4>; [11.00%]
  ;;    succ:       3
  ;;                4

---------

btw, I tried with a constant value for s like 16, it can exploit hardware loop
(bdnz) on Power:

#define TYPE unsigned int
#define MIN(a,b) ((a>b)?b:a)

int
foo (int *__restrict a, int *__restrict b, TYPE n)
{
  if (n <= 0)
    return 0;

  int len;
  int sum = 0;
  TYPE oldn;
  do
    {
      len = MIN (n, 16);
      sum += a[len] + b[len];
      oldn = n;
      n = n - 16;
    }
  while (oldn > 16);

  return sum;
}

.L3:
        cmplwi 0,5,16
        addi 8,5,-16
        isel 9,7,5,1
        rldicl 5,8,0,32
        rldic 9,9,2,30
        lwzx 8,3,9
        lwzx 9,4,9
        add 10,8,10
        add 10,10,9
        bdnz .L3

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-06-05  2:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02 11:11 [Bug middle-end/110089] New: sub-optimal code for attempting to produce JNA (jump on CF or ZF) rguenth at gcc dot gnu.org
2023-06-02 11:13 ` [Bug middle-end/110089] " rguenth at gcc dot gnu.org
2023-06-02 11:21 ` rguenth at gcc dot gnu.org
2023-06-02 11:30 ` rguenth at gcc dot gnu.org
2023-06-02 11:48 ` amonakov at gcc dot gnu.org
2023-06-02 11:57 ` ubizjak at gmail dot com
2023-06-02 14:01 ` rguenth at gcc dot gnu.org
2023-06-05  2:58 ` linkw at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).