[Bug tree-optimization/65335] New: Potential optimization issue with 'tree-loop-vectorize'

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/65335] New: Potential optimization issue with 'tree-loop-vectorize'
@ 2015-03-06 16:16 anwilli5 at ncsu dot edu
  2015-03-09 12:06 ` [Bug tree-optimization/65335] " rguenth at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: anwilli5 at ncsu dot edu @ 2015-03-06 16:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65335

            Bug ID: 65335
           Summary: Potential optimization issue with
                    'tree-loop-vectorize'
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: anwilli5 at ncsu dot edu

When I enable the tree-loop-vectorize optimization I'm seeing some behavior
that I don't understand...

Here is a minimized test case that highlights the scenario:


typedef long unsigned int size_t;
extern void *malloc (size_t __size);

int main(){

    unsigned int a = 2;
    unsigned int *buffer = malloc(10000 * sizeof(*buffer));

    for (int i = 0; i < 10000; i++){
        if ((i % 1000) == 0){
            a = a * a * a * a * a;
        }
        buffer[i] = a;
    }

    return buffer[999];
}


When compiled with the option disabled (xgcc -save-temps -m32 -Wall -Wextra
-std=c99 -O3 -fno-tree-loop-vectorize -S -masm=intel test.c) the following code
is produced:

    mov    ebx, 2           ; a = 2
    mov    edi, 274877907
    sub    esp, 20
    push    40000
    call    malloc
    add    esp, 16
    mov    esi, eax         ; buffer = malloc(...)
    xor    ecx, ecx         ; i = 0
    .p2align 4,,10
    .p2align 3
.L3:
    mov    eax, ecx
    imul    edi
    mov    eax, ecx
    sar    eax, 31
    sar    edx, 6
    sub    edx, eax
    imul    edx, edx, 1000 
    cmp    ecx, edx
    jne    .L2              ; if ((i % 1000) == 0) {
    mov    eax, ebx
    imul    eax, ebx
    imul    eax, eax
    imul    ebx, eax         ; a = a * a * a * a * a; }
.L2:
    mov    DWORD PTR [esi+ecx*4], ebx ; buffer[i] = a
    add    ecx, 1           ; i++
    cmp    ecx, 10000
    jne    .L3              ; continue if i < 10000
        ...

When the tree-loop-vectorize option is enabled (xgcc -save-temps -m32 -Wall
-Wextra -std=c99 -O3 -ftree-loop-vectorize test.c -S -masm=intel), though, the
following code is generated:

    mov    esi, 2           ; a = 2
    sub    esp, 20
    push    40000
    call    malloc
    add    esp, 16
    mov    edi, eax         ; buffer = malloc(...)
    xor    ecx, ecx         ; i = 0
    .p2align 4,,10
    .p2align 3
.L2:
    mov    ebx, esi
    mov    eax, 274877907
    imul    ecx
    mov    eax, ecx
    imul    ebx, esi
    sar    eax, 31
    sar    edx, 6
    imul    ebx, ebx
    sub    edx, eax
    imul    edx, edx, 1000
    imul    ebx, esi          ; a = a * a * a * a * a;
    cmp    ecx, edx
    cmove    esi, ebx          ; move new value if ((i % 1000) == 0)
    mov    DWORD PTR [edi+ecx*4], esi ; buffer[i] = a
    add    ecx, 1           ; i++
    cmp    ecx, 10000
    jne    .L2              ; continue if i < 10000


The main difference here is that the 'a * a * a * a * a' calculation is done
every loop iteration instead of every 1000th, but a is only assigned the new
value every 1000th time via the conditional move instruction.  It seems
inefficient to do this, and from basic testing the code compiled without the
tree-loop-vectorize optimization seems to run faster on my machine.

The "real" code that I derived this from has it worse - it uses 64-bit data
types in a 32-bit binary, so there are several multiply instructions for each
logical multiplication in the code, the stack gets used for storing some of the
intermediate values, and after computing everything into some registers it just
replaces those values with ones stored on the stack from previous iterations in
the case where the modulus condition is not met.  :(

GCC version:

Using built-in specs.
COLLECT_GCC=xgcc
Target: x86_64-unknown-linux-gnu
Configured with: ./configure
Thread model: posix
gcc version 4.9.2 (GCC)

I've also reproduced the issue on gcc 4.9.2 20141224 (prerelease) from an Arch
Linux distro and gcc 4.5.2-8ubuntu4 from a Ubuntu distro.

I'm happy to provide any other information needed.  Thanks!


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/65335] Potential optimization issue with 'tree-loop-vectorize'
  2015-03-06 16:16 [Bug tree-optimization/65335] New: Potential optimization issue with 'tree-loop-vectorize' anwilli5 at ncsu dot edu
@ 2015-03-09 12:06 ` rguenth at gcc dot gnu.org
  2021-08-21 18:59 ` pinskia at gcc dot gnu.org
  2021-08-23  6:44 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-03-09 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65335

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-03-09
                 CC|                            |law at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is if-conversion at work making the code vectorizable.  You can disable it
with -fno-tree-loop-if-convert.

It's not easy to assess profitablility here (well, locally it should use
bb frequencies), as computing a*a*a*a*a always could be a win if vectorizing
the rest of the loop compensates for the cost.

In this case the optimal thing to do is to turn the loop into a loop nest.
Not sure how you'd call this kind of loop transform - sth with unswitching
IV-based conditions (it's not exactly splitting in this case).  We want

    for (int j = 0; j < 10000/1000; j++){
       int i;
       for (i = j*1000; i < 999; ++i)
         buffer[i] = a;
       a = a * a * a * a * a;
       buffer[i] = a;
    }

so we can vectorize the inner loop and don't need to evaluate the conditional
there.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/65335] Potential optimization issue with 'tree-loop-vectorize'
  2015-03-06 16:16 [Bug tree-optimization/65335] New: Potential optimization issue with 'tree-loop-vectorize' anwilli5 at ncsu dot edu
  2015-03-09 12:06 ` [Bug tree-optimization/65335] " rguenth at gcc dot gnu.org
@ 2021-08-21 18:59 ` pinskia at gcc dot gnu.org
  2021-08-23  6:44 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-21 18:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65335

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So for GCC 7+,  -ftree-vectorize vs  -fno-tree-vectorize case is no longer
different and there is no cmov in the code any more.

Note I noticed ICX/LLVM convert this to cmov while ICC and MSVC do not.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/65335] Potential optimization issue with 'tree-loop-vectorize'
  2015-03-06 16:16 [Bug tree-optimization/65335] New: Potential optimization issue with 'tree-loop-vectorize' anwilli5 at ncsu dot edu
  2015-03-09 12:06 ` [Bug tree-optimization/65335] " rguenth at gcc dot gnu.org
  2021-08-21 18:59 ` pinskia at gcc dot gnu.org
@ 2021-08-23  6:44 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-23  6:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65335

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |7.0

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
This has been fixed in GCC 7+ which now performs if-conversion only when the
result is vectorized.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-08-23  6:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-06 16:16 [Bug tree-optimization/65335] New: Potential optimization issue with 'tree-loop-vectorize' anwilli5 at ncsu dot edu
2015-03-09 12:06 ` [Bug tree-optimization/65335] " rguenth at gcc dot gnu.org
2021-08-21 18:59 ` pinskia at gcc dot gnu.org
2021-08-23  6:44 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).