[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
@ 2015-04-09  6:22 ` evan@coeus-group.com
  2015-04-09  6:35 ` trippels at gcc dot gnu.org
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: evan@coeus-group.com @ 2015-04-09  6:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #1 from Evan Nemerson <evan@coeus-group.com> ---
Created attachment 35267
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35267&action=edit
preprocessed test case


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
@ 2015-04-09  6:22 evan@coeus-group.com
  2015-04-09  6:22 ` [Bug tree-optimization/65709] " evan@coeus-group.com
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: evan@coeus-group.com @ 2015-04-09  6:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

            Bug ID: 65709
           Summary: [5 Regression] Bad code for LZ4 decompression with -O3
                    on x86_64
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: evan@coeus-group.com

Created attachment 35266
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35266&action=edit
Test data

With GCC 5 at -O3 on x86_64 some inputs to LZ4 will generate a segfault when
decompressing.  This doesn't happen with GCC 4.9, or -O2.  It does happen with
-O2 -ftree-loop-vectorize -fvect-cost-model=dynamic.

I have tested gcc-5.0.0-0.21.fc22.x86_64 from Fedora 22, as well as SVN
revision 221940.

To trigger the segfault pass the attached input (sum.lz4) as the only command
line argument to the preprocessed test case.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
  2015-04-09  6:22 ` [Bug tree-optimization/65709] " evan@coeus-group.com
@ 2015-04-09  6:35 ` trippels at gcc dot gnu.org
  2015-04-09  6:41 ` trippels at gcc dot gnu.org
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-04-09  6:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Markus Trippelsdorf <trippels at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |trippels at gcc dot gnu.org

--- Comment #2 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
markus@x4 tmp % gcc -fsanitize=undefined -O3 test.i
markus@x4 tmp % ./a.out sum.lz4
test.c:285:29: runtime error: load of misaligned address 0x7efefb79d001 for
type 'const void', which requires 8 byte alignment
0x7efefb79d001: note: pointer points here
 00 00 00  85 7f 45 4c 46 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01
0b 98  00 00 00 34 00
              ^ 
test.c:200:16: runtime error: load of misaligned address 0x7efefb79d009 for
type 'const void', which requires 2 byte alignment
0x7efefb79d009: note: pointer points here
 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01 0b 98  00 00 00 34 00 00
91 50  1c 00 f1 00 34
              ^ 
test.c:285:27: runtime error: store to misaligned address 0x000001205c31 for
type 'void', which requires 8 byte alignment
0x000001205c31: note: pointer points here
 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00
00 00  00 00 00 00 00
              ^ 
test.c:285:27: runtime error: store to misaligned address 0x000001205c44 for
type 'void', which requires 8 byte alignment
0x000001205c44: note: pointer points here
  00 00 91 50 1c 00 f1 00  34 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00
00 00 00 00 00 00 00
              ^ 
test.c:285:27: runtime error: store to misaligned address 0x000001205c4c for
type 'void', which requires 8 byte alignment
0x000001205c4c: note: pointer points here
  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00
00 00 00 00 00 00 00
              ^ 
test.c:285:29: runtime error: load of misaligned address 0x000001205c3c for
type 'const void', which requires 8 byte alignment
0x000001205c3c: note: pointer points here
  00 01 0b 98 00 00 00 34  00 00 91 50 00 00 00 00  00 34 00 20 00 05 00 28  00
1a 00 17 00 00 00 06
              ^ 
test.c:285:29: runtime error: load of misaligned address 0x000001205c44 for
type 'const void', which requires 8 byte alignment
0x000001205c44: note: pointer points here
  00 00 91 50 00 00 00 00  00 34 00 20 00 05 00 28  00 1a 00 17 00 00 00 06  00
00 00 34 00 00 91 50
              ^ 
test.c:273:25: runtime error: load of misaligned address 0x000001205c62 for
type 'const void', which requires 4 byte alignment
0x000001205c62: note: pointer points here
 00 34  00 00 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00
00  00 00 00 00 00 00
              ^ 
test.c:273:23: runtime error: store to misaligned address 0x000001205c66 for
type 'void', which requires 4 byte alignment
0x000001205c66: note: pointer points here
 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00
00 00 00 00 00  00 00
             ^ 
test.c:285:27: runtime error: store to misaligned address 0x00000120f177 for
type 'void', which requires 8 byte alignment
0x00000120f177: note: pointer points here
 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  81 6e 01 00
00 00 00 00  00 00 00
             ^


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
  2015-04-09  6:22 ` [Bug tree-optimization/65709] " evan@coeus-group.com
  2015-04-09  6:35 ` trippels at gcc dot gnu.org
@ 2015-04-09  6:41 ` trippels at gcc dot gnu.org
  2015-04-09  7:29 ` trippels at gcc dot gnu.org
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-04-09  6:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #3 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
BTW clang's output is more informative:

test.c:285:29: runtime error: load of misaligned address 0x7fe669eee001 for
type 'U64' (aka 'unsigned long'), which requires 8 byte alignment
0x7fe669eee001: note: pointer points here
<memory cannot be printed>
test.c:200:16: runtime error: load of misaligned address 0x7fe669eee009 for
type 'U16' (aka 'unsigned short'), which requires 2 byte alignment
0x7fe669eee009: note: pointer points here
 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01 0b 98  00 00 00 34 00 00
91 50  1c 00 f1 00 34
              ^ 
test.c:285:13: runtime error: store to misaligned address 0x000002020021 for
type 'U64' (aka 'unsigned long'), which requires 8 byte alignment
0x000002020021: note: pointer points here
 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00
00 00  00 00 00 00 00
              ^ 
test.c:273:25: runtime error: load of misaligned address 0x000002020052 for
type 'U32' (aka 'unsigned int'), which requires 4 byte alignment
0x000002020052: note: pointer points here
 00 34  00 00 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00
00  00 00 00 00 00 00
              ^ 
test.c:273:9: runtime error: store to misaligned address 0x000002020056 for
type 'U32' (aka 'unsigned int'), which requires 4 byte alignment
0x000002020056: note: pointer points here
 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00
00 00 00 00 00  00 00
             ^ 

e.g. "type 'const void'" vs. "type 'U32' (aka 'unsigned int')"


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (2 preceding siblings ...)
  2015-04-09  6:41 ` trippels at gcc dot gnu.org
@ 2015-04-09  7:29 ` trippels at gcc dot gnu.org
  2015-04-09  8:02 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-04-09  7:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #4 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
If you change:

4309 static void LZ4_copy8(void* dstPtr, const void* srcPtr)                    
4310 {                                                                          
4311                                                                            
4312     if (1)                                                                 
4313     {                                                                      
4314         if (LZ4_64bits())                                                  
4315             *(U64*)dstPtr = *(U64*)srcPtr;                                 
4316         else                                                               
4317             ((U32*)dstPtr)[0] = ((U32*)srcPtr)[0],                         
4318             ((U32*)dstPtr)[1] = ((U32*)srcPtr)[1];                         
4319         return;                                                            
4320     }                                                                      
4321                                                                            
4322     memcpy(dstPtr, srcPtr, 8);                                             
4323 } 

to if (0) it will work.

And looking at the lz4 git repository I see:

 269 static void LZ4_copy8(void* dstPtr, const void* srcPtr)                    
 270 {                                                                          
 271 #if GCC_VERSION!=409  /* disabled on GCC 4.9, as it generates invalid
opcode (crash) */                                                               
 272     if (LZ4_UNALIGNED_ACCESS)                                              
 273     {                                                                      
 274         if (LZ4_64bits())                                                  
 275             *(U64*)dstPtr = *(U64*)srcPtr;                                 
 276         else                                                               
 277             ((U32*)dstPtr)[0] = ((U32*)srcPtr)[0],                         
 278             ((U32*)dstPtr)[1] = ((U32*)srcPtr)[1];                         
 279         return;                                                            
 280     }                                                                      
 281 #endif                                                                     
 282     memcpy(dstPtr, srcPtr, 8);                                             
 283 }


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (3 preceding siblings ...)
  2015-04-09  7:29 ` trippels at gcc dot gnu.org
@ 2015-04-09  8:02 ` rguenth at gcc dot gnu.org
  2015-04-09  8:27 ` trippels at gcc dot gnu.org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-09  8:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
   Target Milestone|---                         |5.0

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
*(U64*)dstPtr = *(U64*)srcPtr; 

makes GCC assume that dstPtr and srcPtr are suitably aligned for U64, if they
are not then you invoke undefined behavior.  As x86 doesn't trap on unaligned
accesses unless they are from vectorized code this shows up only when the
vectorizer exploits that alignment info.

Thus I'd say this bug is invalid.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (4 preceding siblings ...)
  2015-04-09  8:02 ` rguenth at gcc dot gnu.org
@ 2015-04-09  8:27 ` trippels at gcc dot gnu.org
  2015-04-09  8:29 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-04-09  8:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Markus Trippelsdorf <trippels at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |INVALID

--- Comment #6 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
> *(U64*)dstPtr = *(U64*)srcPtr; 
> 
> makes GCC assume that dstPtr and srcPtr are suitably aligned for U64, if they
> are not then you invoke undefined behavior.  As x86 doesn't trap on
> unaligned accesses unless they are from vectorized code this shows up only
> when the
> vectorizer exploits that alignment info.
> 
> Thus I'd say this bug is invalid.

Agreed, closing.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (5 preceding siblings ...)
  2015-04-09  8:27 ` trippels at gcc dot gnu.org
@ 2015-04-09  8:29 ` jakub at gcc dot gnu.org
  2015-04-09 21:04 ` yann.collet.73 at gmail dot com
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-04-09  8:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 35271
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35271&action=edit
gcc5-pr65709.patch

For the bad wording in the -fsanitize=alignment diagnostics, here is a fix. 
The type of MEM_REF's argument isn't relevant, it can be anything, what really
matters is the type of the MEM_REF, which is the access type.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (6 preceding siblings ...)
  2015-04-09  8:29 ` jakub at gcc dot gnu.org
@ 2015-04-09 21:04 ` yann.collet.73 at gmail dot com
  2015-04-09 21:30 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-04-09 21:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Yann Collet <yann.collet.73 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yann.collet.73 at gmail dot com

--- Comment #9 from Yann Collet <yann.collet.73 at gmail dot com> ---
While the issue can be easily fixed from an LZ4 perspective, 
the main topic here is to analyze a GCC 4.9+ vectorizer choice.

The piece of code that it tried to optimize can be summarized as follows (once
removed all the garbage) :

static void LZ4_copy8(void* dstPtr, const void* srcPtr)
{
   *(U64*)dstPtr = *(U64*)srcPtr;
}

Pretty simple.
Let's assume for the rest of the post that both pointers are correctly aligned,
so it's not a problem anymore.

Looking at the assembler generated, we see that GCC generates a MOVDQA
instruction for it.
> movdqa (%rdi,%rax,1),%xmm0
> $rdi=0x7fffea4b53e6
> $rax=0x0

This seems wrong on 2 levels :

- The function only wants to copy 8 bytes. MOVDQA works on a full SSE register,
which is 16 bytes. This spell troubles, if only for buffer boundaries checks :
the algorithm uses 8 bytes because it knows it can safely read/write that size
without crossing buffer limits. With 16 bytes, no such guarantee.

- MOVDQA requires both positions to be aligned.
I read it as being SSE size aligned, which means 16-bytes aligned.
But they are not, these pointers are supposed to be 8-bytes aligned only.

(A bit off topic, but from a general perspective, I don't understand the use of
MOVDQA, which requires such a strong alignment condition, while there is also
MOVDQU available, which works fine at any memory address, while suffering no
performance penalty on aligned memory addresses. MOVDQU looks like a better
choice in every circumstances.)

Anyway, the core of the issue is rather above :
this is just an 8-bytes copy operation, replacing by a 16-bytes one looks
suspicious. Maybe it would deserve a look.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (7 preceding siblings ...)
  2015-04-09 21:04 ` yann.collet.73 at gmail dot com
@ 2015-04-09 21:30 ` jakub at gcc dot gnu.org
  2015-04-10 12:02 ` yann.collet.73 at gmail dot com
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-04-09 21:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Yann Collet from comment #9)
> Looking at the assembler generated, we see that GCC generates a MOVDQA
> instruction for it.
> > movdqa (%rdi,%rax,1),%xmm0
> > $rdi=0x7fffea4b53e6
> > $rax=0x0
> 
> This seems wrong on 2 levels :
> 
> - The function only wants to copy 8 bytes. MOVDQA works on a full SSE
> register, which is 16 bytes. This spell troubles, if only for buffer
> boundaries checks : the algorithm uses 8 bytes because it knows it can
> safely read/write that size without crossing buffer limits. With 16 bytes,
> no such guarantee.

The function has been inlined into the callers, like:
      do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e);
and this loop is then vectorized.  The vectorization prologue of course has to
adjust if s is not 16 byte aligned, but it can assume that both s and d are 8
byte aligned (otherwise it is undefined behavior).  So, if they aren't 8 byte
aligned, you could get crashes etc.  The load is then performed as aligned,
because the vectorization prologue ensured it is aligned (unless the program
has undefined behavior), while the stores as done using movups because it is
possible the pointers aren't both aligned the same.

> - MOVDQA requires both positions to be aligned.
> I read it as being SSE size aligned, which means 16-bytes aligned.
> But they are not, these pointers are supposed to be 8-bytes aligned only.
> 
> (A bit off topic, but from a general perspective, I don't understand the use
> of MOVDQA, which requires such a strong alignment condition, while there is
> also MOVDQU available, which works fine at any memory address, while
> suffering no performance penalty on aligned memory addresses. MOVDQU looks
> like a better choice in every circumstances.)

On most CPUs there is a significant performance difference between the two,
even if you use MOVDQU with aligned addresses.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (8 preceding siblings ...)
  2015-04-09 21:30 ` jakub at gcc dot gnu.org
@ 2015-04-10 12:02 ` yann.collet.73 at gmail dot com
  2015-04-10 12:05 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-04-10 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #11 from Yann Collet <yann.collet.73 at gmail dot com> ---
Does that answer also takes into consideration "buffer boundaries check" ?

The algorithm is supposed to pay attention to this property when moving 8-bytes
at a time. For 16-bytes at a time, it seems this core assumption is broken, and
can result in buffer overflow.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (9 preceding siblings ...)
  2015-04-10 12:02 ` yann.collet.73 at gmail dot com
@ 2015-04-10 12:05 ` jakub at gcc dot gnu.org
  2015-04-10 14:06 ` yann.collet.73 at gmail dot com
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-04-10 12:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Of course, if a vectorized loop doesn't have compile time known number of
iterations divisible by vectorization factor, or if some early iterations are
performed as scalar to align some pointer, the compiler emits an epilogue
(loop, possibly unrolled) that performs the last few iterations if needed.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (10 preceding siblings ...)
  2015-04-10 12:05 ` jakub at gcc dot gnu.org
@ 2015-04-10 14:06 ` yann.collet.73 at gmail dot com
  2015-07-13 11:02 ` noloader at gmail dot com
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-04-10 14:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #13 from Yann Collet <yann.collet.73 at gmail dot com> ---
Clear enough. Thanks for feedback.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (11 preceding siblings ...)
  2015-04-10 14:06 ` yann.collet.73 at gmail dot com
@ 2015-07-13 11:02 ` noloader at gmail dot com
  2015-07-13 11:38 ` noloader at gmail dot com
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: noloader at gmail dot com @ 2015-07-13 11:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Jeffrey Walton <noloader at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |noloader at gmail dot com

--- Comment #14 from Jeffrey Walton <noloader at gmail dot com> ---
(In reply to Jakub Jelinek from comment #10)
> (In reply to Yann Collet from comment #9)
> > Looking at the assembler generated, we see that GCC generates a MOVDQA
> > instruction for it.
> > > movdqa (%rdi,%rax,1),%xmm0
> > > $rdi=0x7fffea4b53e6
> > > $rax=0x0
> > 
> > This seems wrong on 2 levels :
> > 
> > - The function only wants to copy 8 bytes. MOVDQA works on a full SSE
> > register, which is 16 bytes. This spell troubles, if only for buffer
> > boundaries checks : the algorithm uses 8 bytes because it knows it can
> > safely read/write that size without crossing buffer limits. With 16 bytes,
> > no such guarantee.
> 
> The function has been inlined into the callers, like:
>       do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e);
> and this loop is then vectorized.  The vectorization prologue of course has
> to adjust if s is not 16 byte aligned, but it can assume that both s and d
> are 8 byte aligned (otherwise it is undefined behavior)...
Forgive my barging in Jakub. I was referred to this issue and comment from
another issue.

Its not clear to me where the leap is made that its OK use vmovdqa. Are you
stating (unequivocally, for folks like me) that is does *not* matter what the
alignment is in the C-sources; and that the prologue ensures both 's' and 's'
are eventually 16-byte aligned when vmovdqa is invoked. That is, when we see
vmovdqa used, we know the alignment is correct (and at 16-bytes).

Sorry to have to ask.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (12 preceding siblings ...)
  2015-07-13 11:02 ` noloader at gmail dot com
@ 2015-07-13 11:38 ` noloader at gmail dot com
  2015-07-13 12:06 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: noloader at gmail dot com @ 2015-07-13 11:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #16 from Jeffrey Walton <noloader at gmail dot com> ---
(In reply to Jakub Jelinek from comment #15)
> I'm saying that if the program does not trigger undefined behavior (e.g. by
> accessing misaligned integers without telling the compiler about it (by
> using memcpy, or packed attribute or pragma), then it would be a compiler
> bug to use an instruction requiring higher alignment than guaranteed in the
> source, without ensuring such alignment (through realigning arrays,
> introducing a loop for aligning pointers before the vectorized loop, peeling
> a few iterations needed to align the pointer(s), or using instructions that
> don't require such high alignment).
> No testcase has been provided here without having undefined behavior in them
> that would show a bug on the compiler side.

OK, so you'll have to forgive my ignorance again.

So you are saying that it may be a bug to use vmovdqa if the source and/or
destination are not 16-byte aligned; but all the user code you have seen has
undefined behavior so you're not going to answer. Is that correct?

(My apologies if its too sharp a point. I'm just trying to figure out what the
preconditions are for vmovdqa, and if it should be used when source or
destination is 8-byte aligned).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (13 preceding siblings ...)
  2015-07-13 11:38 ` noloader at gmail dot com
@ 2015-07-13 12:06 ` jakub at gcc dot gnu.org
  2015-08-12 18:24 ` yann.collet.73 at gmail dot com
  2015-08-12 18:37 ` noloader at gmail dot com
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-07-13 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #17 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Jeffrey Walton from comment #16)
> OK, so you'll have to forgive my ignorance again.
> 
> So you are saying that it may be a bug to use vmovdqa if the source and/or
> destination are not 16-byte aligned; but all the user code you have seen has
> undefined behavior so you're not going to answer. Is that correct?
> 
> (My apologies if its too sharp a point. I'm just trying to figure out what
> the preconditions are for vmovdqa, and if it should be used when source or
> destination is 8-byte aligned).

I'm saying we as the compiler writers know what we are doing, and the various
cases like using unaligned accesses or peeling for alignment or versioning for
alignment, or realigning arrays are handled in the compiler.
They do assume that the source is valid and does not trigger undefined
behavior.
If you e.g. compile on x86_64 with -O3 -mavx2
void
foo (int *a, int *b)
{
  int i;
  for (i = 0; i < 64; i++)
    a[i] = 2 * b[i];
}
you'll see that compiler decided to peel for alignment of b pointer and you can
see an (unrolled) scalar loop first that handles first few iterations to align
b if it is not already aligned, and then the main vector loop uses
vmovdqa for loads and vmovups for stores (because the a pointer modulo 32 might
not be the same as b pointer modulo 32).  If you compile with -O2
-ftree-vectorize -mavx2, you'll see that peeling for alignment isn't performed,
as it enlarges the code, and vmovdqu is used for the loads instead.
The peeling for alignment assumes that there is no undefined behavior
originally, so if you call this with (uintptr) b % sizeof (int) != 0, it will
not work properly, but that is a bug in the code, not in the compiler.
So, if you have some testcase where there is no undefined behavior triggered
(try various sanitizers, inspect the code yourself, read the C standard), and
you are convinced the compiler introduces a bug where there isn't originally
(i.e. miscompiles the code), feel free to file a new bugreport.
Nothing like that has been presented in this PR.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (14 preceding siblings ...)
  2015-07-13 12:06 ` jakub at gcc dot gnu.org
@ 2015-08-12 18:24 ` yann.collet.73 at gmail dot com
  2015-08-12 18:37 ` noloader at gmail dot com
  16 siblings, 0 replies; 18+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-08-12 18:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #18 from Yann Collet <yann.collet.73 at gmail dot com> ---
This issue makes me wonder : how to efficiently access unaligned memory ?

The case in point is ARM cpu.
They don't support SSE/AVX, so they seem unaffected by this specific issue,
but this issue force writing the source code in a certain way, to remain
compatible with vectorizer assumtion.
Therefore, for portable code, it becomes an issue :
how to write a code which is both portable and efficient on both targets ?

Since apparently, writing : u32 = *(U32*)ptr;
is forbidden if ptr is not guaranteed to be aligned on 4-bytes boundaries
as the compiler will then be authorized to assume ptr is properly aligned,
how to efficiently load 4-bytes from memory at unaligned position ?

I know 3 ways :

1) byte by byte : secure, but slow ==> not efficient

2) using memcpy : memcpy(&u32, ptr, sizeof(u32));
It works. It's safe, and on x86/x64 it's correctly translated into a single mov
instruction, so it's also fast.
Alas, on ARM target, this get translated into much more complex /cautious
sequence, depending on optimization settings.
This is not a small difference :
at -O3 settings, we get a x2 performance difference.
at -O2 settings, it becomes x5 (unaligned code is slower).

3) using __packed instruction : Basically, feature the same benefits and
problems than memcpy() method above

The problem is therefore for newer ARM CPU, which efficiently support unaligned
memory.
Accessing this performance is not possible using memcpy() nor __packed.
And it seems the only way to get it is to do : u32 = *(U32*)ptr;

The difference in performance is really huge, in fact it totally changes the
application, so it can't be ignored.

The question is :
Is there a way to access this performance without violating the principle which
has been stated into this thread, 
that is : it's not authorized to write : u32 = *(U32*)ptr; if ptr is not
guaranteed to be properly aligned on 4-bytes boundaries.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
  2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
                   ` (15 preceding siblings ...)
  2015-08-12 18:24 ` yann.collet.73 at gmail dot com
@ 2015-08-12 18:37 ` noloader at gmail dot com
  16 siblings, 0 replies; 18+ messages in thread
From: noloader at gmail dot com @ 2015-08-12 18:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #19 from Jeffrey Walton <noloader at gmail dot com> ---
(In reply to Yann Collet from comment #18)
> This issue makes me wonder : how to efficiently access unaligned memory ?
> 
> 
> The case in point is ARM cpu.
> They don't support SSE/AVX, so they seem unaffected by this specific issue,
> but this issue force writing the source code in a certain way, to remain
> compatible with vectorizer assumtion.

Just one comment here (sorry for speaking out of turn)....

Modern ARM has __ARM_FEATURE_UNALIGNED, which means the processor tolerates
unaligned access. However, I believe it runs afoul of the C/C++ standard and
GCC aliasing rules.

> Therefore, for portable code, it becomes an issue :
> how to write a code which is both portable and efficient on both targets ?

I've been relying on intrinsics to side step C/C++ requirements. In the absence
of intrinsics, I use inline assembler to avoid the C/C++ language rules.

Now, I could be totally wrong, but I don't feel I'm violating the C/C++
language rules until a write a C/C++ statement that performs the violation.
Hence the reason I use intrinsics or drop into assembler.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-08-12 18:37 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-09  6:22 [Bug tree-optimization/65709] New: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64 evan@coeus-group.com
2015-04-09  6:22 ` [Bug tree-optimization/65709] " evan@coeus-group.com
2015-04-09  6:35 ` trippels at gcc dot gnu.org
2015-04-09  6:41 ` trippels at gcc dot gnu.org
2015-04-09  7:29 ` trippels at gcc dot gnu.org
2015-04-09  8:02 ` rguenth at gcc dot gnu.org
2015-04-09  8:27 ` trippels at gcc dot gnu.org
2015-04-09  8:29 ` jakub at gcc dot gnu.org
2015-04-09 21:04 ` yann.collet.73 at gmail dot com
2015-04-09 21:30 ` jakub at gcc dot gnu.org
2015-04-10 12:02 ` yann.collet.73 at gmail dot com
2015-04-10 12:05 ` jakub at gcc dot gnu.org
2015-04-10 14:06 ` yann.collet.73 at gmail dot com
2015-07-13 11:02 ` noloader at gmail dot com
2015-07-13 11:38 ` noloader at gmail dot com
2015-07-13 12:06 ` jakub at gcc dot gnu.org
2015-08-12 18:24 ` yann.collet.73 at gmail dot com
2015-08-12 18:37 ` noloader at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).