[Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division
@ 2021-12-29  3:00 zhaoweiliew at gmail dot com
  2021-12-29  3:16 ` [Bug tree-optimization/103855] " zhaoweiliew at gmail dot com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: zhaoweiliew at gmail dot com @ 2021-12-29  3:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

            Bug ID: 103855
           Summary: Missed optimization: 64bit division used instead of
                    32bit division
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: zhaoweiliew at gmail dot com
  Target Milestone: ---

Compiler Explorer link: https://gcc.godbolt.org/z/KvW8sMsqz

I compiled the following code with `x86-64 gcc (trunk) -std=c++20 -O3 -Wall
-Wextra -Werror`:

```
unsigned int optimized(unsigned int a, unsigned int b) {
    return (unsigned long long)a / b;
}

unsigned int unoptimized(unsigned int a, unsigned int b) {
    unsigned long long all = a;
    return all / b;
}
```

This is the assembly output:

```
optimized(unsigned int, unsigned int):
        mov     eax, edi
        xor     edx, edx
        div     esi
        ret
unoptimized(unsigned int, unsigned int):
        mov     eax, edi
        mov     esi, esi
        xor     edx, edx
        div     rsi
        ret
```

GCC uses a 64-bit divide for `unoptimized()`, when a 32-bit divide would be
equivalent and faster. GCC uses a 32-bit divide for `optimized()`, which is
fine. Note that LLVM does a 32-bit division in both cases.

I would like to tackle this optimization, but am not sure how to go about doing
it. Could someone tell me/point me to resources that tell me what part of the
compiler and codebase I should be looking to optimize? Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103855] Missed optimization: 64bit division used instead of 32bit division
  2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
@ 2021-12-29  3:16 ` zhaoweiliew at gmail dot com
  2021-12-29  3:34 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: zhaoweiliew at gmail dot com @ 2021-12-29  3:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

--- Comment #1 from Zhao Wei Liew <zhaoweiliew at gmail dot com> ---
After some research, I decided to look into value range propagation (VRP). I
compiled with `-fdump-tree-vrp` and the VRP files contained the following for
`optimized()`:

For the file ending in .vrp1,

```
;; Function unoptimized (_Z11unoptimizedjj, funcdef_no=1, decl_uid=2121,
cgraph_uid=2, symbol_order=1)

;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2
;; 2 succs { 1 }

Value ranges after VRP:

_1: long long unsigned int [0, 4294967295]
_2: long long unsigned int [0, 4294967295]
a_3(D): unsigned int VARYING
all_4: long long unsigned int [0, 4294967295]
b_5(D): unsigned int VARYING
_6: unsigned int VARYING


unsigned int unoptimized (unsigned int a, unsigned int b)
{
  long long unsigned int all;
  long long unsigned int _1;
  long long unsigned int _2;
  unsigned int _6;

  <bb 2> [local count: 1073741824]:
  all_4 = (long long unsigned int) a_3(D);
  _1 = (long long unsigned int) b_5(D);
  _2 = all_4 / _1;
  _6 = (unsigned int) _2;
  return _6;

}
```

and for the file ending in .vrp2,

```
;; Function unoptimized (_Z11unoptimizedjj, funcdef_no=1, decl_uid=2121,
cgraph_uid=2, symbol_order=1)

;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2
;; 2 succs { 1 }
Exported global range table:
============================
_1  : long long unsigned int [0, 4294967295]
_2  : long long unsigned int [0, 4294967295]
all_4  : long long unsigned int [0, 4294967295]
unsigned int unoptimized (unsigned int a, unsigned int b)
{
  long long unsigned int all;
  long long unsigned int _1;
  long long unsigned int _2;
  unsigned int _6;

  <bb 2> [local count: 1073741824]:
  all_4 = (long long unsigned int) a_3(D);
  _1 = (long long unsigned int) b_5(D);
  _2 = all_4 / _1;
  _6 = (unsigned int) _2;
  return _6;

}
```

If I'm interpreting the above correctly, the optimizer seems to think all_4 can
be in the range [0, 4294967295], which is correct.

Hence, I suppose it's safe to conclude that VRP isn't the issue here? Please
correct me if I'm wrong.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103855] Missed optimization: 64bit division used instead of 32bit division
  2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
  2021-12-29  3:16 ` [Bug tree-optimization/103855] " zhaoweiliew at gmail dot com
@ 2021-12-29  3:34 ` pinskia at gcc dot gnu.org
  2021-12-29  3:46 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-29  3:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-12-29
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
convert.c (convert_to_integer_1 <case TRUNC_DIV_EXPR>) is where it is done for
the optimized case.

You might be able to something similar in match.pd
Something like:
(simplify
 (convert1 (trunc_div (convert2 @0) (convert2 @1)))
 (if (.....

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103855] Missed optimization: 64bit division used instead of 32bit division
  2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
  2021-12-29  3:16 ` [Bug tree-optimization/103855] " zhaoweiliew at gmail dot com
  2021-12-29  3:34 ` pinskia at gcc dot gnu.org
@ 2021-12-29  3:46 ` pinskia at gcc dot gnu.org
  2021-12-29  3:53 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-29  3:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The other way to fix this is during expand, look for the smallest mode which
fits the range of the two operands of TRUNC_DIV_EXPR and there is a div optab
for that mode and use that mode.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103855] Missed optimization: 64bit division used instead of 32bit division
  2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
                   ` (2 preceding siblings ...)
  2021-12-29  3:46 ` pinskia at gcc dot gnu.org
@ 2021-12-29  3:53 ` pinskia at gcc dot gnu.org
  2021-12-29  8:21 ` zhaoweiliew at gmail dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-29  3:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---

unsigned int optimized(unsigned int a, unsigned int b) {
    return (unsigned long long)a / b;
}

unsigned int unoptimized(unsigned int a, unsigned int b) {
    unsigned long long all = a;
    return all / b;
}

unsigned long long unoptimized1(unsigned int a, unsigned int b) {
    unsigned long long all = a;
    return all / b;
}

unsigned long long unoptimized2(unsigned long a, unsigned long b) {
    if (((unsigned int)a) != a) __builtin_unreachable();
    if (((unsigned int)b) != b) __builtin_unreachable();
    return a / b;
}

Is the full testcase, clang is able to handle all of them.  GCC only handles
the first one. If you just do the match.pd patch, unoptimized2 would be not
handled.
expmed.c (expand_divmod) is where the expansion happens from gimple to rtl.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103855] Missed optimization: 64bit division used instead of 32bit division
  2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
                   ` (3 preceding siblings ...)
  2021-12-29  3:53 ` pinskia at gcc dot gnu.org
@ 2021-12-29  8:21 ` zhaoweiliew at gmail dot com
  2021-12-29  8:32 ` zhaoweiliew at gmail dot com
  2021-12-29  9:41 ` jakub at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: zhaoweiliew at gmail dot com @ 2021-12-29  8:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

--- Comment #5 from Zhao Wei Liew <zhaoweiliew at gmail dot com> ---
Thanks for your guidance. I'm looking into adding a fix in expmed.c, but I
can't figure out how to get the range of op1 and op0 from within
expand_divmod().

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103855] Missed optimization: 64bit division used instead of 32bit division
  2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
                   ` (4 preceding siblings ...)
  2021-12-29  8:21 ` zhaoweiliew at gmail dot com
@ 2021-12-29  8:32 ` zhaoweiliew at gmail dot com
  2021-12-29  9:41 ` jakub at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: zhaoweiliew at gmail dot com @ 2021-12-29  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

--- Comment #6 from Zhao Wei Liew <zhaoweiliew at gmail dot com> ---
I see that the vect_get_range_info(tree, wide_int*, wide_int*) function returns
the range of a tree type, but in expand_divmod(), the operands are of rtx type.
Is there still a way to extract the range information from an rtx type?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103855] Missed optimization: 64bit division used instead of 32bit division
  2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
                   ` (5 preceding siblings ...)
  2021-12-29  8:32 ` zhaoweiliew at gmail dot com
@ 2021-12-29  9:41 ` jakub at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-12-29  9:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103855

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
vect_get_range_info is meant for the vectorizer only, what you want is I think
get_min_precision which would need to be exported out of internal-fn.c.
In any case, it needs to be called still on the trees (SSA_NAMEs or
INTEGER_CSTs etc.) during expansion.
expand_expr_divmod already performs a different optimization - if both operands
are known to have the most significant bit clear (checked with
get_range_pos_neg), then it expands it as both signed and unsigned division or
modulo and checks if one of them isn't cheaper.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-29  9:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-29  3:00 [Bug tree-optimization/103855] New: Missed optimization: 64bit division used instead of 32bit division zhaoweiliew at gmail dot com
2021-12-29  3:16 ` [Bug tree-optimization/103855] " zhaoweiliew at gmail dot com
2021-12-29  3:34 ` pinskia at gcc dot gnu.org
2021-12-29  3:46 ` pinskia at gcc dot gnu.org
2021-12-29  3:53 ` pinskia at gcc dot gnu.org
2021-12-29  8:21 ` zhaoweiliew at gmail dot com
2021-12-29  8:32 ` zhaoweiliew at gmail dot com
2021-12-29  9:41 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).