public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] x86: Improve expansion of __builtin_parity
@ 2020-06-07 16:50 Uros Bizjak
  0 siblings, 0 replies; 2+ messages in thread
From: Uros Bizjak @ 2020-06-07 16:50 UTC (permalink / raw)
  To: gcc-patches; +Cc: Roger Sayle

Hello Roger and nice to hear from you after a loong time!

> Alas there is a mismatch between RTL's definition of PARITY operation
> which has an integer mode result, and the x86's parity flag.  So when
> Uros addressed PR target/44481 in 2010, by introducing UNSPEC PARITY,
> we lost some of the optimization opportunities of his original patch.
> https://gcc.gnu.org/legacy-ml/gcc-patches/2010-06/msg01259.html
> The early splitting approach in this patch safely restores the 2007-2010
> optimization opportunities.

Yes, I agree that in this case it is better to start early with an
expanded sequence.

> It turns out that that popcount instruction on modern x86_64 processors
> has (almost) made the integer parity flag in the x86 ALU completely
> obsolete, especially as POPCOUNT's integer semantics are a much better
> fit to RTL.  The one remaining case where these transistors are useful
> is where __builtin_parity is immediately tested by a conditional branch,
> and therefore the result is wanted in a flags register rather than as
> an integer.  This case is captured by two peephole2 optimizations in
> the attached patch.

[...]

> Alas I'm very out of practice contributing patches (but my paperwork should
> still be valid), so if a friendly i386/x86_64 backend maintainer could take
> care of things from here, that would be very much appreciated.

I'll take care of the patch. There is a small fix needed (see below),
so I'll re-bootstrap the patch and commit it after a successful
regression test.

2020-06-05  Roger Sayle  <roger@nextmovesoftware.com>

        * config/i386/i386.md (paritydi2, paritysi2): Expand reduction
        via shift and xor to an USPEC PARITY matching a parityhi2_cmp.
        (paritydi2_cmp, paritysi2_cmp): Delete these define_insn_and_split.
        (parityhi2, parityqi2): New expanders.
        (parityhi2_cmp): Implement set parity flag with xorb insn.
        (parityqi2_cmp): Implement set parity flag with testb insn.
        New peephole2s to use these insns (UNSPEC PARITY) when appropriate.

2020-06-05  Roger Sayle  <roger@nextmovesoftware.com>

        * gcc.target/i386/parity-[3-9].c: New tests.

So, OK for mainline with a small fix.

+(define_insn "parityqi2_cmp"
+  [(set (reg:CC FLAGS_REG)
+    (unspec:CC [(match_operand:QI 0 "register_operand")]
+           UNSPEC_PARITY))]
+  ""
+  "test{b}\t%b0, %b0"
+  [(set_attr "mode" "QI")])

This is an instruction pattern, so it needs "q" register constraint.
"q" is a superset of "Q", so there should be no problem converting
HImode parityhi2_cmp to parityqi2_cmp.
Also, there is no need for the "b" operand modifier, since the operand
is already in QImode.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [PATCH] x86: Improve expansion of __builtin_parity
@ 2020-06-06 13:46 Roger Sayle
  0 siblings, 0 replies; 2+ messages in thread
From: Roger Sayle @ 2020-06-06 13:46 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4552 bytes --]

 

This patch changes the way that __builtin_parity is expanded in i386.md.

This changes the code generated for the function:

 

int test(unsigned char a)

{

   unsigned char x = a + 1;

   return __builtin_parity(x);

}

 

from this before the patch:

 

test:   addl    $1, %edi

        movzbl  %dil, %edi

        movl    %edi, %eax

        shrl    $16, %edi

        xorl    %edi, %eax

        xorb    %ah, %al

        setnp   %al

        movzbl  %al, %eax

        ret

 

to this with the patch:

 

test:   leal    1(%rdi), %eax

        testb   %al, %al

        setnp   %al

        movzbl  %al, %eax

        ret

 

GCC currently hides the shift and xor reduction inside a backend

specific UNSPEC PARITY, making it invisible to the RTL optimizers until

very late during compilation.  It is normally reasonable for the

middle-end to maintain wider mode representations for as long as possible

and split them later, but this only helps if the semantics are visible

at the RTL-level (to combine and other passes), but UNSPECs are black

boxes, so in this case splitting early (during RTL expansion) is a

better strategy.

 

I'd originally written this patch after investigating whether GCC could

generate the x86's "jp" and "jnp" conditional branch on parity flag,

and noticed the inefficient reduction.  It was only when I was preparing

this patch for submission that I discovered that this is actually a

regression fix, and that Uros had already implemented/thought about the

above optimization back in 2007.  Indeed the example above is taken from

https://gcc.gnu.org/legacy-ml/gcc-patches/2007-02/msg00972.html

 

Alas there is a mismatch between RTL's definition of PARITY operation

which has an integer mode result, and the x86's parity flag.  So when

Uros addressed PR target/44481 in 2010, by introducing UNSPEC PARITY,

we lost some of the optimization opportunities of his original patch.

https://gcc.gnu.org/legacy-ml/gcc-patches/2010-06/msg01259.html

The early splitting approach in this patch safely restores the 2007-2010

optimization opportunities.

 

 

It turns out that that popcount instruction on modern x86_64 processors

has (almost) made the integer parity flag in the x86 ALU completely

obsolete, especially as POPCOUNT's integer semantics are a much better

fit to RTL.  The one remaining case where these transistors are useful

is where __builtin_parity is immediately tested by a conditional branch,

and therefore the result is wanted in a flags register rather than as

an integer.  This case is captured by two peephole2 optimizations in

the attached patch.

 

Hence the code:

 

void foo(unsigned char x)

{

  if (__builtin_parity(x))

    bar();

}

 

which on modern hardware is currently generated as:

 

foo:    movzbl  %dil, %edi

        popcntl %edi, %edi

        andl    $1, %edi

        jne     .L4

        ret

.L4:    jmp     bar

 

with this patch will be generated as:

 

foo:    testb   %dil, %dil

        jnp     .L4

        ret

.L4:    jmp     bar

 

 

[I believe that popcount has a 3-cycle latency, but a test followed by a

conditional branch can dispatch in the same cycle, but I'm not an expert].

The new parityhi2 and parityqi2 expanders are infrastructure to support

(possible) future middle-end changes.

 

 

This patch has been tested with a "make bootstrap" and "make -k check" on

x86_64-pc-linux-gnu with no regressions.  I've also added 7 new test cases;

4 tests gcc.target/i386/parity-[3-6].c check the existing behaviour, and

3 tests gcc.target/i386/parity-[7-9].c check the changes from this patch.

 

 

Alas I'm very out of practice contributing patches (but my paperwork should

still be valid), so if a friendly i386/x86_64 backend maintainer could take

care of things from here, that would be very much appreciated.

 

Many thanks in advance,

Roger

--

 

2020-06-05  Roger Sayle  <roger@nextmovesoftware.com>

 

        * config/i386/i386.md (paritydi2, paritysi2): Expand reduction

        via shift and xor to an USPEC PARITY matching a parityhi2_cmp.

        (paritydi2_cmp, paritysi2_cmp): Delete these define_insn_and_split.

        (parityhi2, parityqi2): New expanders.

        (parityhi2_cmp): Implement set parity flag with xorb insn.

        (parityqi2_cmp): Implement set parity flag with testb insn.

        New peephole2s to use these insns (UNSPEC PARITY) when appropriate.

 

2020-06-05  Roger Sayle  <roger@nextmovesoftware.com>

 

        * gcc.target/i386/parity-[3-9].c: New tests.

 

 


[-- Attachment #2: patcha.txt --]
[-- Type: text/plain, Size: 8143 bytes --]

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 459cf62..6659657 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -14780,9 +14780,32 @@
   "! TARGET_POPCNT"
 {
   rtx scratch = gen_reg_rtx (QImode);
+  rtx hipart1 = gen_reg_rtx (SImode);
+  rtx lopart1 = gen_reg_rtx (SImode);
+  rtx xor1 = gen_reg_rtx (SImode);
+  rtx shift2 = gen_reg_rtx (SImode);
+  rtx hipart2 = gen_reg_rtx (HImode);
+  rtx lopart2 = gen_reg_rtx (HImode);
+  rtx xor2 = gen_reg_rtx (HImode);
 
-  emit_insn (gen_paritydi2_cmp (NULL_RTX, NULL_RTX,
-				NULL_RTX, operands[1]));
+  if (TARGET_64BIT)
+    {
+      rtx shift1 = gen_reg_rtx (DImode);
+      emit_insn (gen_lshrdi3 (shift1, operands[1], GEN_INT (32)));
+      emit_move_insn (hipart1, gen_lowpart (SImode, shift1));
+    }
+  else
+    emit_move_insn (hipart1, gen_highpart (SImode, operands[1]));
+
+  emit_move_insn (lopart1, gen_lowpart (SImode, operands[1]));
+  emit_insn (gen_xorsi3 (xor1, hipart1, lopart1));
+
+  emit_insn (gen_lshrsi3 (shift2, xor1, GEN_INT (16)));
+  emit_move_insn (hipart2, gen_lowpart (HImode, shift2));
+  emit_move_insn (lopart2, gen_lowpart (HImode, xor1));
+  emit_insn (gen_xorhi3 (xor2, hipart2, lopart2));
+
+  emit_insn (gen_parityhi2_cmp (xor2));
 
   ix86_expand_setcc (scratch, ORDERED,
 		     gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx);
@@ -14805,8 +14828,17 @@
   "! TARGET_POPCNT"
 {
   rtx scratch = gen_reg_rtx (QImode);
+  rtx shift = gen_reg_rtx (SImode);
+  rtx hipart = gen_reg_rtx (HImode);
+  rtx lopart = gen_reg_rtx (HImode);
+  rtx tmp = gen_reg_rtx (HImode);
+
+  emit_insn (gen_lshrsi3 (shift, operands[1], GEN_INT (16)));
+  emit_move_insn (hipart, gen_lowpart (HImode, shift));
+  emit_move_insn (lopart, gen_lowpart (HImode, operands[1]));
+  emit_insn (gen_xorhi3 (tmp, hipart, lopart));
 
-  emit_insn (gen_paritysi2_cmp (NULL_RTX, NULL_RTX, operands[1]));
+  emit_insn (gen_parityhi2_cmp (tmp));
 
   ix86_expand_setcc (scratch, ORDERED,
 		     gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx);
@@ -14815,70 +14847,128 @@
   DONE;
 })
 
-(define_insn_and_split "paritydi2_cmp"
-  [(set (reg:CC FLAGS_REG)
-	(unspec:CC [(match_operand:DI 3 "register_operand" "0")]
-		   UNSPEC_PARITY))
-   (clobber (match_scratch:DI 0 "=r"))
-   (clobber (match_scratch:SI 1 "=&r"))
-   (clobber (match_scratch:HI 2 "=Q"))]
+(define_expand "parityhi2"
+  [(set (match_operand:HI 0 "register_operand")
+	(parity:HI (match_operand:HI 1 "register_operand")))]
   "! TARGET_POPCNT"
-  "#"
-  "&& reload_completed"
-  [(parallel
-     [(set (match_dup 1)
-	   (xor:SI (match_dup 1) (match_dup 4)))
-      (clobber (reg:CC FLAGS_REG))])
-   (parallel
-     [(set (reg:CC FLAGS_REG)
-	   (unspec:CC [(match_dup 1)] UNSPEC_PARITY))
-      (clobber (match_dup 1))
-      (clobber (match_dup 2))])]
 {
-  operands[4] = gen_lowpart (SImode, operands[3]);
+  rtx scratch = gen_reg_rtx (QImode);
 
-  if (TARGET_64BIT)
-    {
-      emit_move_insn (operands[1], gen_lowpart (SImode, operands[3]));
-      emit_insn (gen_lshrdi3 (operands[3], operands[3], GEN_INT (32)));
-    }
-  else
-    operands[1] = gen_highpart (SImode, operands[3]);
+  emit_insn (gen_parityhi2_cmp (operands[1]));
+
+  ix86_expand_setcc (scratch, ORDERED,
+		     gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx);
+
+  emit_insn (gen_zero_extendqihi2 (operands[0], scratch));
+  DONE;
 })
 
-(define_insn_and_split "paritysi2_cmp"
-  [(set (reg:CC FLAGS_REG)
-	(unspec:CC [(match_operand:SI 2 "register_operand" "0")]
-		   UNSPEC_PARITY))
-   (clobber (match_scratch:SI 0 "=r"))
-   (clobber (match_scratch:HI 1 "=&Q"))]
+(define_expand "parityqi2"
+  [(set (match_operand:QI 0 "register_operand")
+	(parity:QI (match_operand:QI 1 "register_operand")))]
   "! TARGET_POPCNT"
-  "#"
-  "&& reload_completed"
-  [(parallel
-     [(set (match_dup 1)
-	   (xor:HI (match_dup 1) (match_dup 3)))
-      (clobber (reg:CC FLAGS_REG))])
-   (parallel
-     [(set (reg:CC FLAGS_REG)
-	   (unspec:CC [(match_dup 1)] UNSPEC_PARITY))
-      (clobber (match_dup 1))])]
 {
-  operands[3] = gen_lowpart (HImode, operands[2]);
+  emit_insn (gen_parityqi2_cmp (operands[1]));
 
-  emit_move_insn (operands[1], gen_lowpart (HImode, operands[2]));
-  emit_insn (gen_lshrsi3 (operands[2], operands[2], GEN_INT (16)));
+  ix86_expand_setcc (operands[0], ORDERED,
+		     gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx);
+  DONE;
 })
 
-(define_insn "*parityhi2_cmp"
+(define_insn "parityhi2_cmp"
   [(set (reg:CC FLAGS_REG)
-	(unspec:CC [(match_operand:HI 1 "register_operand" "0")]
+	(unspec:CC [(match_operand:HI 0 "register_operand" "+Q")]
 		   UNSPEC_PARITY))
-   (clobber (match_scratch:HI 0 "=Q"))]
-  "! TARGET_POPCNT"
+   (clobber (match_dup 0))]
+  ""
   "xor{b}\t{%h0, %b0|%b0, %h0}"
   [(set_attr "length" "2")
-   (set_attr "mode" "HI")])
+   (set_attr "mode" "QI")])
+
+(define_insn "parityqi2_cmp"
+  [(set (reg:CC FLAGS_REG)
+	(unspec:CC [(match_operand:QI 0 "register_operand")]
+		   UNSPEC_PARITY))]
+  ""
+  "test{b}\t%b0, %b0"
+  [(set_attr "mode" "QI")])
+
+;; Replace zero_extend:HI followed by parityhi2_cmp with parityqi2_cmp
+(define_peephole2
+  [(set (match_operand:HI 0 "register_operand")
+	(zero_extend:HI (match_operand:QI 1 "register_operand")))
+   (parallel [(set (reg:CC FLAGS_REG)
+		   (unspec:CC [(match_dup 0)] UNSPEC_PARITY))
+	      (clobber (match_dup 0))])]
+  ""
+  [(set (reg:CC FLAGS_REG)
+	(unspec:CC [(match_dup 1)] UNSPEC_PARITY))])
+
+;; Eliminate QImode popcount&1 using parity flag
+(define_peephole2
+  [(set (match_operand:SI 0 "register_operand")
+	(zero_extend:SI (match_operand:QI 1 "register_operand")))
+   (parallel [(set (match_operand:SI 2 "register_operand")
+		   (popcount:SI (match_dup 0)))
+	      (clobber (reg:CC FLAGS_REG))])
+   (set (reg:CCZ FLAGS_REG)
+	(compare:CCZ (and:QI (match_operand:QI 3 "register_operand")
+			     (const_int 1))
+		     (const_int 0)))
+   (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator"
+			    [(reg:CCZ FLAGS_REG)
+			     (const_int 0)])
+			   (label_ref (match_operand 5))
+			   (pc)))]
+  "REGNO (operands[2]) == REGNO (operands[3])
+   && peep2_reg_dead_p (3, operands[0])
+   && peep2_reg_dead_p (3, operands[2])
+   && peep2_regno_dead_p (4, FLAGS_REG)"
+  [(set (reg:CC FLAGS_REG)
+	(unspec:CC [(match_dup 1)] UNSPEC_PARITY))
+   (set (pc) (if_then_else (match_op_dup 4 [(reg:CC FLAGS_REG)
+					    (const_int 0)])
+			   (label_ref (match_dup 5))
+			   (pc)))]
+{
+  operands[4] = shallow_copy_rtx (operands[4]);
+  PUT_CODE (operands[4], GET_CODE (operands[4]) == EQ ? UNORDERED : ORDERED);
+})
+
+;; Eliminate HImode popcount&1 using parity flag
+(define_peephole2
+  [(match_scratch:HI 0 "Q")
+   (parallel [(set (match_operand:HI 1 "register_operand")
+		   (popcount:HI
+		    (match_operand:HI 2 "nonimmediate_operand")))
+	      (clobber (reg:CC FLAGS_REG))])
+   (set (match_operand 3 "register_operand")
+	(zero_extend (match_dup 1)))
+   (set (reg:CCZ FLAGS_REG)
+	(compare:CCZ (and:QI (match_operand:QI 4 "register_operand")
+			     (const_int 1))
+		     (const_int 0)))
+   (set (pc) (if_then_else (match_operator 5 "bt_comparison_operator"
+			    [(reg:CCZ FLAGS_REG)
+			     (const_int 0)])
+			   (label_ref (match_operand 6))
+			   (pc)))]
+  "REGNO (operands[3]) == REGNO (operands[4])
+   && peep2_reg_dead_p (3, operands[1])
+   && peep2_reg_dead_p (3, operands[3])
+   && peep2_regno_dead_p (4, FLAGS_REG)"
+  [(set (match_dup 0) (match_dup 2))
+   (parallel [(set (reg:CC FLAGS_REG)
+		   (unspec:CC [(match_dup 0)] UNSPEC_PARITY))
+	      (clobber (match_dup 0))])
+   (set (pc) (if_then_else (match_op_dup 5 [(reg:CC FLAGS_REG)
+					    (const_int 0)])
+			   (label_ref (match_dup 6))
+			   (pc)))]
+{
+  operands[5] = shallow_copy_rtx (operands[5]);
+  PUT_CODE (operands[5], GET_CODE (operands[5]) == EQ ? UNORDERED : ORDERED);
+})
 
 \f
 ;; Thread-local storage patterns for ELF.

[-- Attachment #3: parity-3.c --]
[-- Type: text/plain, Size: 458 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */
/* { dg-final { scan-assembler "setp" } } */
/* { dg-final { scan-assembler "jnp" } } */
/* { dg-final { scan-assembler "jp" } } */

void dummy(void);

int foo(unsigned int x)
{
  return !__builtin_parity(x);
}

void bar(unsigned int x)
{
  if (__builtin_parity(x))
    dummy();
}

void baz(unsigned int x)
{
  if (!__builtin_parity(x))
    dummy();
}


[-- Attachment #4: parity-4.c --]
[-- Type: text/plain, Size: 482 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */
/* { dg-final { scan-assembler "setp" } } */
/* { dg-final { scan-assembler "jnp" } } */
/* { dg-final { scan-assembler "jp" } } */

void dummy(void);

int foo(unsigned long long x)
{
  return !__builtin_parityll(x);
}

void bar(unsigned long long x)
{
  if (__builtin_parityll(x))
    dummy();
}

void baz(unsigned long long x)
{
  if (!__builtin_parityll(x))
    dummy();
}


[-- Attachment #5: parity-5.c --]
[-- Type: text/plain, Size: 229 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -march=core-avx2" } */
/* { dg-final { scan-assembler "popcnt" } } */
/* { dg-final { scan-assembler "and" } } */

int foo(unsigned int x)
{
  return __builtin_parity(x);
}


[-- Attachment #6: parity-6.c --]
[-- Type: text/plain, Size: 237 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -march=core-avx2" } */
/* { dg-final { scan-assembler "popcnt" } } */
/* { dg-final { scan-assembler "and" } } */

int foo(unsigned long long x)
{
  return __builtin_parityll(x);
}


[-- Attachment #7: parity-7.c --]
[-- Type: text/plain, Size: 319 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */
/* { dg-final { scan-assembler-times "test" 2 } } */
/* { dg-final { scan-assembler-not "shr" } } */

int foo(unsigned char x)
{
  return __builtin_parity(x);
}

int bar(unsigned char x)
{
  return __builtin_parityll(x);
}


[-- Attachment #8: parity-8.c --]
[-- Type: text/plain, Size: 267 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */
/* { dg-final { scan-assembler-not "shr" } } */

int foo(unsigned short x)
{
  return __builtin_parity(x);
}

int bar(unsigned short x)
{
  return __builtin_parityll(x);
}


[-- Attachment #9: parity-9.c --]
[-- Type: text/plain, Size: 617 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -march=core-avx2" } */
/* { dg-final { scan-assembler-not "popcnt" } } */
/* { dg-final { scan-assembler-not "shr" } } */
/* { dg-final { scan-assembler-times "jp" 2 } } */
/* { dg-final { scan-assembler-times "jnp" 2 } } */

void dummy(void);

void pos8(unsigned char x)
{
  if (__builtin_parity(x))
    dummy();
}

void neg8(unsigned char x)
{
  if (!__builtin_parity(x))
    dummy();
}

void pos16(unsigned short x)
{
  if (__builtin_parity(x))
    dummy();
}

void neg16(unsigned short x)
{
  if (!__builtin_parity(x))
    dummy();
}


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-06-07 16:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-07 16:50 [PATCH] x86: Improve expansion of __builtin_parity Uros Bizjak
  -- strict thread matches above, loose matches on Subject: below --
2020-06-06 13:46 Roger Sayle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).