public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa
@ 2024-03-14 15:48 Explorer09 at gmail dot com
  2024-03-14 16:02 ` [Bug rtl-optimization/114338] " rearnsha at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Explorer09 at gmail dot com @ 2024-03-14 15:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

            Bug ID: 114338
           Summary: (x & (-1 << y)) should be optimized to ((x >> y) << y)
                    or vice versa
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Explorer09 at gmail dot com
  Target Milestone: ---

### Test code

```c
#include <stdint.h>

unsigned int func1(unsigned int x, unsigned char count)
{
  return (x >> count) << count;
}

unsigned int func2(unsigned int x, unsigned char count)
{
  return x & (~0U << count);
}

uint16_t func3(uint16_t x, unsigned char count)
{
  return (x >> count) << count;
}

uint16_t func4(uint16_t x, unsigned char count)
{
  return x & (0xFFFF << count);
}
```

### Expected result

func1 and func2 should compile to identical code. The compiler should
pick the pattern that produces the smallest or fastest code for the
target processor.

func3 and func4 should compile to identical code, too. If the ABI
doesn't require the upper bits of the registers to be zeroed, then
func3 and func4 code size could be as small as func1 and func2.

### Current result (gcc)

With x86-64 gcc 13.2.0 and "-Os" option

func2 generates code that is one byte larger than func1.
func3 generates a MOVZX instruction (one byte larger than func1)
that ideally can be replaced with a MOV. 

For func4, I guess you can put the test code in Compiler Explorer
(godbolt.org) and see the result yourself. (It's the real case I was
facing. I only need to work with a 16-bit input value and I don't
want to internally promote to uint32_t just for optimization purpose.)

### Current result (clang)

clang 18.1.0 and "-Os" option
(Tested in Compiler Explorer (godbolt.org))

func1, func2 and func3 produce identical code in x86-64. It seems
that clang does recognize the pattern and optimize accordingly.

For x86-64 target, ((x >> count) << count) is used.
For ARM and RISC-V targets, ((-1 << count) & x) is used.

However clang doesn't recognize func4 as identical to func3, so
there are rooms to improve.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/114338] (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa
  2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
@ 2024-03-14 16:02 ` rearnsha at gcc dot gnu.org
  2024-03-14 16:08 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2024-03-14 16:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

--- Comment #1 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
Why would that be better?  On a machine that does not lack registers, there's
more instruction-level parallelism in 
 (set (tmp) (-1))
 (set (tmp) (ashift (tmp) (count)))
 (and (x) (x) (tmp))

What's more, on Arm/AArch64 insns 2 and 3 can be merged into a single
instruction:

  (set (tmp) (-1))
  (set (x) (and (ashift (tmp) (count)) (x)))

which is definitely preferable to two register-controlled shifts.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/114338] (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa
  2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
  2024-03-14 16:02 ` [Bug rtl-optimization/114338] " rearnsha at gcc dot gnu.org
@ 2024-03-14 16:08 ` jakub at gcc dot gnu.org
  2024-03-14 16:10 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-03-14 16:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Bet the point of this PR is to canonicalize in GIMPLE to one of the forms (both
have the same number of GIMPLE statements, so that makes it harder to pick the
canonical form) and let expansion try to expand both ways and query rtx costs
what is faster or smaller; we do something like that in other cases (e.g. for
division where ranges tells us we can use unsigned or signed division equally).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/114338] (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa
  2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
  2024-03-14 16:02 ` [Bug rtl-optimization/114338] " rearnsha at gcc dot gnu.org
  2024-03-14 16:08 ` jakub at gcc dot gnu.org
@ 2024-03-14 16:10 ` pinskia at gcc dot gnu.org
  2024-03-14 16:39 ` Explorer09 at gmail dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-14 16:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           Keywords|                            |missed-optimization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/114338] (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa
  2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
                   ` (2 preceding siblings ...)
  2024-03-14 16:10 ` pinskia at gcc dot gnu.org
@ 2024-03-14 16:39 ` Explorer09 at gmail dot com
  2024-03-14 17:50 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Explorer09 at gmail dot com @ 2024-03-14 16:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

--- Comment #3 from Kang-Che Sung <Explorer09 at gmail dot com> ---
(In reply to Jakub Jelinek from comment #2)
> Bet the point of this PR is to canonicalize in GIMPLE to one of the forms
> (both have the same number of GIMPLE statements, so that makes it harder to
> pick the canonical form) and let expansion try to expand both ways and query
> rtx costs what is faster or smaller; we do something like that in other
> cases (e.g. for division where ranges tells us we can use unsigned or signed
> division equally).

Yes. For me who writes C code, the ideal outcome would be I can pick either
pattern and it would "just work", gcc would see the two forms are the same in
semantics, and generate the optimal code for the target processor (whichever
form it is).

The other issue is that ((x >> count) << count) pattern might generate
unnecessary instruction that intend to "truncate" upper bits. GCC can assume
they are "truncated" already and focus on "truncating" lower bits instead
(which I was intended to do).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/114338] (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa
  2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
                   ` (3 preceding siblings ...)
  2024-03-14 16:39 ` Explorer09 at gmail dot com
@ 2024-03-14 17:50 ` pinskia at gcc dot gnu.org
  2024-03-15  8:20 ` [Bug rtl-optimization/114338] Optimizing (x & (-1 << y)) " rguenth at gcc dot gnu.org
  2024-03-15  8:38 ` Explorer09 at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-14 17:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note I added this to the list of Canonicalization issues in gimple on the wiki:
https://gcc.gnu.org/wiki/GimpleCanonical

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/114338] Optimizing (x & (-1 << y)) to ((x >> y) << y) or vice versa
  2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
                   ` (4 preceding siblings ...)
  2024-03-14 17:50 ` pinskia at gcc dot gnu.org
@ 2024-03-15  8:20 ` rguenth at gcc dot gnu.org
  2024-03-15  8:38 ` Explorer09 at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-15  8:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
For canonicalization the BIT_AND variants might be preferable since they
possibly combine with other logical ops.  Also more constant operands
when the number of operations is the same might be preferable.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/114338] Optimizing (x & (-1 << y)) to ((x >> y) << y) or vice versa
  2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
                   ` (5 preceding siblings ...)
  2024-03-15  8:20 ` [Bug rtl-optimization/114338] Optimizing (x & (-1 << y)) " rguenth at gcc dot gnu.org
@ 2024-03-15  8:38 ` Explorer09 at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: Explorer09 at gmail dot com @ 2024-03-15  8:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338

--- Comment #6 from Kang-Che Sung <Explorer09 at gmail dot com> ---
(In reply to Richard Biener from comment #5)
> For canonicalization the BIT_AND variants might be preferable since they
> possibly combine with other logical ops.  Also more constant operands
> when the number of operations is the same might be preferable.

I got your point (and Richard Earnshaw in comment #1 suggested the same, too).

When processing an array of numbers, e.g. `(array[i] & (-1U << count))` would
work faster than `((array[i] >> count) << count)` since the `(-1U << count)`
can be processed outside of a loop.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-03-15  8:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-14 15:48 [Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa Explorer09 at gmail dot com
2024-03-14 16:02 ` [Bug rtl-optimization/114338] " rearnsha at gcc dot gnu.org
2024-03-14 16:08 ` jakub at gcc dot gnu.org
2024-03-14 16:10 ` pinskia at gcc dot gnu.org
2024-03-14 16:39 ` Explorer09 at gmail dot com
2024-03-14 17:50 ` pinskia at gcc dot gnu.org
2024-03-15  8:20 ` [Bug rtl-optimization/114338] Optimizing (x & (-1 << y)) " rguenth at gcc dot gnu.org
2024-03-15  8:38 ` Explorer09 at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).