public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/100363] New: gcc generating wider load/store than warranted at -O3
@ 2021-04-30 20:09 vgupta at synopsys dot com
  2021-04-30 21:43 ` [Bug middle-end/100363] " pinskia at gcc dot gnu.org
                   ` (20 more replies)
  0 siblings, 21 replies; 22+ messages in thread
From: vgupta at synopsys dot com @ 2021-04-30 20:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

            Bug ID: 100363
           Summary: gcc generating wider load/store than warranted at -O3
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vgupta at synopsys dot com
  Target Milestone: ---

Created attachment 50722
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50722&action=edit
test case with an additional nop to annotate codegen

In Linux kernel's initramfs gzip inflate code, an inner copy loop using
unsigned short pointers (src/dst) is generated with wider 8 or 16-byte at a
time (vs. 2 bytes at a time) causing extra/unintended bytes to be copied -
leading to corruption of inflated files on target.

The showed up on upstream v5.6 Linux kernel built for ARC (defaults to -O3).
Issue doesn't happen at -O2.

Full test case attached, but the gist of it is:

    lib/zlib_inflate/inffast.c

    if (dist > 2) {
        unsigned short *sfrom;

        sfrom = (unsigned short *)(from);
        loops = len >> 1;
        do
            *sout++ = *sfrom++;

        while (--loops);
        out = (unsigned char *)sout;
        from = (unsigned char *)sfrom;
    }
    ...

@sfrom and @sout are unsigned short pointers and thus expected to work on 2
bytes. However at -O3 gcc is generating wide loads (8-byte LDD/STD on ARCv2,
16-byte LDR q0 on aarch64.

For aarch64, it seems there's code generated for 16-byte access as well as
2-byte, and I haven't verified if it elides the 16-byte code based on size etc
- but the code is generated nonetheless. For ARC 8-byte loop is certainly
executed causing bad things as described

The issue was originally seen with mainline gcc 10.2 (again both ARC and
aarch64) at -O3 and I can confirm it exists in gcc 9.3 as well.

Attaching preprocessed source file is from ARC linux build (but builds for
aarch64 too since non of arch specific functions are used here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug middle-end/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
@ 2021-04-30 21:43 ` pinskia at gcc dot gnu.org
  2021-04-30 21:44 ` pinskia at gcc dot gnu.org
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-04-30 21:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-30
             Status|UNCONFIRMED                 |WAITING

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The loop gets vectorized, I don't see the problem really.

Also I don't see the preprocessed source.  Can you attach that?

Is the problem that the loads have to be done in 2 bytes always from the
hardware?
If so then you need to mark the pointer as volatile.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug middle-end/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
  2021-04-30 21:43 ` [Bug middle-end/100363] " pinskia at gcc dot gnu.org
@ 2021-04-30 21:44 ` pinskia at gcc dot gnu.org
  2021-04-30 21:56 ` vgupta at synopsys dot com
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-04-30 21:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note in the tar file there is only:
inffast2.s  inffast2.s.aarch64.gcc10.O3  inffast2.s.aarch64.gcc9.O3 
inffast2.s.arc.gcc10.O3

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug middle-end/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
  2021-04-30 21:43 ` [Bug middle-end/100363] " pinskia at gcc dot gnu.org
  2021-04-30 21:44 ` pinskia at gcc dot gnu.org
@ 2021-04-30 21:56 ` vgupta at synopsys dot com
  2021-04-30 22:02 ` torvalds@linux-foundation.org
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: vgupta at synopsys dot com @ 2021-04-30 21:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #3 from Vineet Gupta <vgupta at synopsys dot com> ---
Created attachment 50723
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50723&action=edit
preprocessed source file (with extra nop annotation)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug middle-end/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (2 preceding siblings ...)
  2021-04-30 21:56 ` vgupta at synopsys dot com
@ 2021-04-30 22:02 ` torvalds@linux-foundation.org
  2021-04-30 22:19 ` [Bug tree-optimization/100363] " pinskia at gcc dot gnu.org
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: torvalds@linux-foundation.org @ 2021-04-30 22:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Linus Torvalds <torvalds@linux-foundation.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |torvalds@linux-foundation.o
                   |                            |rg

--- Comment #4 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Andrew Pinski from comment #1)
> The loop gets vectorized, I don't see the problem really.


See

   
https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/issues/372

and in particular the comment

   "In the first 8-byte copy, src and dst overlap"

so apparently gcc has decided that they can't overlap, despite the two pointers
being literally generated from the same base pointer.

But I don't real arc assembly, so I'll have to take Vineet's word for it.

Vineet, have you been able to generate a smaller test-case?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (3 preceding siblings ...)
  2021-04-30 22:02 ` torvalds@linux-foundation.org
@ 2021-04-30 22:19 ` pinskia at gcc dot gnu.org
  2021-04-30 22:20 ` pinskia at gcc dot gnu.org
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-04-30 22:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |tree-optimization
           Keywords|                            |wrong-code

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
On the trunk, on aarch64:
There should be an aliasing check 

  sfrom_289 = from_176 + 18446744073709551615;
  _871 = _843 + 18446744073709551615;
  _872 = _871 > 6;
  _873 = prephitmp_803 + 2;
  _874 = from_176 + 3;
  _875 = _873 - _874;
  _876 = (sizetype) _875;
  _877 = _876 > 12;
  _878 = _872 & _877;
  if (_878 != 0)
    goto <bb 116>; [80.00%]
  else
    goto <bb 126>; [20.00%]


_873 is the sout


In GCC 10 branch we get something similar:
  sfrom_289 = from_176 + 18446744073709551615;
  _859 = _823 + 18446744073709551615;
  _860 = _859 > 8;
  _861 = prephitmp_783 + 2;
  _862 = from_176 + 3;
  _863 = _861 - _862;
  _864 = (sizetype) _863;
  _865 = _864 > 12;
  _866 = _860 & _865;
  if (_866 != 0)
    goto <bb 116>; [80.00%]
  else
    goto <bb 126>; [20.00%]

But I Notice 8 vs 6 here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (4 preceding siblings ...)
  2021-04-30 22:19 ` [Bug tree-optimization/100363] " pinskia at gcc dot gnu.org
@ 2021-04-30 22:20 ` pinskia at gcc dot gnu.org
  2021-04-30 22:33 ` vgupta at synopsys dot com
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-04-30 22:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |UNCONFIRMED
     Ever confirmed|1                           |0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (5 preceding siblings ...)
  2021-04-30 22:20 ` pinskia at gcc dot gnu.org
@ 2021-04-30 22:33 ` vgupta at synopsys dot com
  2021-05-01 22:54 ` amonakov at gcc dot gnu.org
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: vgupta at synopsys dot com @ 2021-04-30 22:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #6 from Vineet Gupta <vgupta at synopsys dot com> ---
(In reply to Linus Torvalds from comment #4)
> (In reply to Andrew Pinski from comment #1)
> > The loop gets vectorized, I don't see the problem really.
> 
> 
> See
> 
>    
> https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/issues/372
> 
> and in particular the comment
> 
>    "In the first 8-byte copy, src and dst overlap"
> 
> so apparently gcc has decided that they can't overlap, despite the two
> pointers being literally generated from the same base pointer.

Exactly:

> But I don't real arc assembly, so I'll have to take Vineet's word for it.

fwiw:
LDD.a [base, off] is 8-byte load with pre-incr : eff addr = base + offset
STD.ab [base, off] is 8-byte store with post-incr: eff addr = base


> Vineet, have you been able to generate a smaller test-case?

No I'm afraid not.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (6 preceding siblings ...)
  2021-04-30 22:33 ` vgupta at synopsys dot com
@ 2021-05-01 22:54 ` amonakov at gcc dot gnu.org
  2021-05-01 23:09 ` torvalds@linux-foundation.org
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: amonakov at gcc dot gnu.org @ 2021-05-01 22:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
The github issue has a more relevant code quote:

#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS     <-- this is enabled for ARCv2
279:                        PUP(sout) = PUP(sfrom);
#else
                            PUP(sout) = UP_UNALIGNED(sfrom);
#endif


Most likely the issue is that sout/sfrom are misaligned at runtime, while the
vectorized code somewhere relies on them being sufficiently aligned for a
'short'.

It is unsafe to dereference a misaligned pointer. The pointed-to-type must have
reduced alignment:

typedef unsigned short u16_u __attribute__((aligned(1)));

u16_u *sout = ...

u16_u *sfrom = (void *)(from - OFF);

(without -ffreestanding, memcpy/memmove is a portable way to express a
misaligned access)

https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointer-accesses/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (7 preceding siblings ...)
  2021-05-01 22:54 ` amonakov at gcc dot gnu.org
@ 2021-05-01 23:09 ` torvalds@linux-foundation.org
  2021-05-03  7:41 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: torvalds@linux-foundation.org @ 2021-05-01 23:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #8 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Alexander Monakov from comment #7)
> 
> Most likely the issue is that sout/sfrom are misaligned at runtime, while
> the vectorized code somewhere relies on them being sufficiently aligned for
> a 'short'.

They absolutely are.

And we build the kernel with -Wno-strict-aliasing exactly to make sure the
compiler doesn't think that "oh, I can make aliasing decisions based on type
information".

Because we have those kinds of issues all over, and we know which architectures
support unaligned loads etc, and all the tricks with "memcpy()" and unions make
for entirely unreadable code.

So please fix the aliasing logic to not be type-based when people explicitly
tell you not to do that.

Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (8 preceding siblings ...)
  2021-05-01 23:09 ` torvalds@linux-foundation.org
@ 2021-05-03  7:41 ` rguenth at gcc dot gnu.org
  2021-05-03 16:03 ` torvalds@linux-foundation.org
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-03  7:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Linus Torvalds from comment #8)
> (In reply to Alexander Monakov from comment #7)
> > 
> > Most likely the issue is that sout/sfrom are misaligned at runtime, while
> > the vectorized code somewhere relies on them being sufficiently aligned for
> > a 'short'.
> 
> They absolutely are.
> 
> And we build the kernel with -Wno-strict-aliasing exactly to make sure the
> compiler doesn't think that "oh, I can make aliasing decisions based on type
> information".
> 
> Because we have those kinds of issues all over, and we know which
> architectures support unaligned loads etc, and all the tricks with
> "memcpy()" and unions make for entirely unreadable code.
> 
> So please fix the aliasing logic to not be type-based when people explicitly
> tell you not to do that.
> 
> Linus

Note alignment has nothing to do with strict-aliasing (-fno-strict-aliasing you
mean btw).

One thing we do is (I'm not 50% sure this explains the observed issue) assume
that if you have two accesses with type 'short' and they are aligned
according to this type then they will not partly overlap.  Note this has
nothing to do with C strict aliasing rules but is basic pointer math when
you know lower zero bits.

I suggest to try the fix suggested in comment#7 and report back if that
fixes the observed issue.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (9 preceding siblings ...)
  2021-05-03  7:41 ` rguenth at gcc dot gnu.org
@ 2021-05-03 16:03 ` torvalds@linux-foundation.org
  2021-05-03 16:18 ` torvalds@linux-foundation.org
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: torvalds@linux-foundation.org @ 2021-05-03 16:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #10 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Richard Biener from comment #9)
> 
> Note alignment has nothing to do with strict-aliasing (-fno-strict-aliasing
> you mean btw).

I obviously meant -fno-strict-aliasing, yes.

But I think it's actually essentially the same issue, just in a different
guise:

> One thing we do is (I'm not 50% sure this explains the observed issue) assume
> that if you have two accesses with type 'short' and they are aligned
> according to this type then they will not partly overlap.  Note this has
> nothing to do with C strict aliasing rules but is basic pointer math when
> you know lower zero bits.

Well, the thing is, you have two situations:

 (a) you can statically see that the two do not alias, because the offset
arithmetic is either constant or you have some range logic that can tell that
they are sufficiently far apart.

 (b) you can't.

Now, everybody is ok with the static aliasing situation in (a). If you can tell
that two addresses don't alias, your'e done, they are independent, there's no
question  about it.

But that's not the situation here. So we're in (b). And what I find personally
so annoying is that gcc has actually *done* that distance check, but apparently
intentionally done it badly based on type information.

And the reason I think this is similar to -fno-strict-aliasing is that it's
that same (b) case, and it looks like a very similar "do a bad job of doing
actual run-time alias analysis based on type information".

It seems to be literally an off-by-one error, not because it generates better
code, but because the compiler has decided to pointlessly make a bad range
comparison based on type.

But I've never worked with the gcc IR dumps, so Andrew Pinski's debug output in
#c5 doesn't actually make me go "ahh, there". Maybe it's that 8 vs 6 that he
pointed out. Did somebody notice that "offset > 8" was off-by-one, and should
have been "offset >= 8"? And then changed it to "offset > 6" which is
off-by-one in the other direction instead?

> I suggest to try the fix suggested in comment#7 and report back if that
> fixes the observed issue.

Vineet?

I still think gcc is doing the wrong thing, exactly because of that
"pointlessly using the wrong range check" issue. This particular code comes
from some old version of zlib, and I can't test because I don't have the ARC
background to make any sense of the generated code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (10 preceding siblings ...)
  2021-05-03 16:03 ` torvalds@linux-foundation.org
@ 2021-05-03 16:18 ` torvalds@linux-foundation.org
  2021-05-03 17:25 ` vgupta at synopsys dot com
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: torvalds@linux-foundation.org @ 2021-05-03 16:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #11 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Linus Torvalds from comment #10)
> 
>       This particular code comes
> from some old version of zlib, and I can't test because I don't have the ARC
> background to make any sense of the generated code.

Heh. We upgraded to a "recent version" of zlib back in 2006: 

   "Upgrade the zlib_inflate implementation in the kernel from a patched
    version 1.1.3/4 to a patched 1.2.3"

but it turns out that the "do things a 16-bit word at a time" was a
kernel-local optimization for some very slow old PowerPC microcontroller.

The code in upstream zlib actually looks rather better (which is not saying
much, admittedly), doesn't have any 16-bit accesses, and we probably should
just try to synchronize with that instead.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (11 preceding siblings ...)
  2021-05-03 16:18 ` torvalds@linux-foundation.org
@ 2021-05-03 17:25 ` vgupta at synopsys dot com
  2021-05-03 17:28 ` vgupta at synopsys dot com
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: vgupta at synopsys dot com @ 2021-05-03 17:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #12 from Vineet Gupta <vgupta at synopsys dot com> ---
Created attachment 50742
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50742&action=edit
kernel patch as proposed on comment #7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (12 preceding siblings ...)
  2021-05-03 17:25 ` vgupta at synopsys dot com
@ 2021-05-03 17:28 ` vgupta at synopsys dot com
  2021-05-03 17:47 ` torvalds@linux-foundation.org
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: vgupta at synopsys dot com @ 2021-05-03 17:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #13 from Vineet Gupta <vgupta at synopsys dot com> ---
Sorry the workaround proposed by Alexander doesn't seem to cure it (patch
attached), outcome is the same

        mov     lp_count,r13    ;5      #, bnd.65
        lp      @.L201  ; lp_count:@.L50->@.L201        #,
        .align 2
.L50:
# ../lib/zlib_inflate/inffast.c:288: PUP(sout) = PUP(sfrom);
  ldd.a r18,[r21,8] # MEM[base: _496, offset: 0B], MEM[base: _496, offset: 0B]

# ../lib/zlib_inflate/inffast.c:288:  PUP(sout) = PUP(sfrom);
  std.ab r18,[r22,8] # MEM[base: vectp_prephitmp.73_741, offset: 0B], MEM[base:
_496, offset: 0B]

        .align 2
.L201:
        ; ZOL_END, begins @.L50 #

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (13 preceding siblings ...)
  2021-05-03 17:28 ` vgupta at synopsys dot com
@ 2021-05-03 17:47 ` torvalds@linux-foundation.org
  2021-05-03 18:45 ` vgupta at synopsys dot com
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: torvalds@linux-foundation.org @ 2021-05-03 17:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #14 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Vineet Gupta from comment #13)
> Sorry the workaround proposed by Alexander doesn't seem to cure it (patch
> attached), outcome is the same

Vineet - it's not the ldd/std that is necessarily buggy, it's the earlier tests
of the address that guard that vectorized path. 

So your quoted parts of the code generation aren't necessarily the problematic
ones.

Did you actually test the code and check whether it has the same issue? Maybe
it changed the address limit guards before that ldd/std?

I also sent you a separate patch to test if just upgrading to a newer version
of the zlib code helps. Although that may be buggy for other reasons, it's not
like I actually tested the end result.. But it would be interesting to hear if
that one works for you (again, ldd/std might be a valid end result of trying to
vectorize that code assuming the aliasing tests are done correctly in the
vectorized loop headers).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (14 preceding siblings ...)
  2021-05-03 17:47 ` torvalds@linux-foundation.org
@ 2021-05-03 18:45 ` vgupta at synopsys dot com
  2021-05-03 19:43 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: vgupta at synopsys dot com @ 2021-05-03 18:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #15 from Vineet Gupta <vgupta at synopsys dot com> ---
(In reply to Linus Torvalds from comment #14)
> (In reply to Vineet Gupta from comment #13)
> > Sorry the workaround proposed by Alexander doesn't seem to cure it (patch
> > attached), outcome is the same
> 
> Vineet - it's not the ldd/std that is necessarily buggy, it's the earlier
> tests of the address that guard that vectorized path. 
> 
> So your quoted parts of the code generation aren't necessarily the
> problematic ones.

/me slaps myself. How can I be so stupid.

> Did you actually test the code and check whether it has the same issue?
> Maybe it changed the address limit guards before that ldd/std?

The problem is is indeed gone. I need to analyze the assembly fully how it
prevents the bad case. e.g. I'm still not comfortable seeing the loop entered
with following and it doing 8 byte ldd/std when we know it should only do 2 at
a time.

r21 = 0xbf178036  (pre-increment so 0x3e will be first src)
r22 = 0xbf1780b2
LPC = 4

80d9a360:       lp      12      ;80d9a36c <inflate_fast+0x2c0>
80d9a364:       ldd.a   r18r19,[r21,8]
80d9a368:       std.ab  r18r19,[r22,8]

> I also sent you a separate patch to test if just upgrading to a newer
> version of the zlib code helps. Although that may be buggy for other
> reasons, it's not like I actually tested the end result.. But it would be
> interesting to hear if that one works for you (again, ldd/std might be a
> valid end result of trying to vectorize that code assuming the aliasing
> tests are done correctly in the vectorized loop headers).

Thx for that. And this seems to boot as well.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (15 preceding siblings ...)
  2021-05-03 18:45 ` vgupta at synopsys dot com
@ 2021-05-03 19:43 ` pinskia at gcc dot gnu.org
  2021-05-04  6:24 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-05-03 19:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #16 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Vineet Gupta from comment #15)
> The problem is is indeed gone. I need to analyze the assembly fully how it
> prevents the bad case. e.g. I'm still not comfortable seeing the loop
> entered with following and it doing 8 byte ldd/std when we know it should
> only do 2 at a time.

Why?  It is called a "vectorization" optimization. Where we are vectorizing the
2 byte load/stores into a 4x2 vector load/stores.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (16 preceding siblings ...)
  2021-05-03 19:43 ` pinskia at gcc dot gnu.org
@ 2021-05-04  6:24 ` rguenth at gcc dot gnu.org
  2021-05-04 19:33 ` vgupta at synopsys dot com
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-04  6:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |INVALID

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
Not a bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (17 preceding siblings ...)
  2021-05-04  6:24 ` rguenth at gcc dot gnu.org
@ 2021-05-04 19:33 ` vgupta at synopsys dot com
  2021-05-05  6:32 ` rguenther at suse dot de
  2021-05-05 19:59 ` ndesaulniers at google dot com
  20 siblings, 0 replies; 22+ messages in thread
From: vgupta at synopsys dot com @ 2021-05-04 19:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #18 from Vineet Gupta <vgupta at synopsys dot com> ---
(In reply to Richard Biener from comment #9)
> (In reply to Linus Torvalds from comment #8)
> > (In reply to Alexander Monakov from comment #7)
> > > 
> > > Most likely the issue is that sout/sfrom are misaligned at runtime, while
> > > the vectorized code somewhere relies on them being sufficiently aligned for
> > > a 'short'.
> > 
> > They absolutely are.
> > 
> > And we build the kernel with -Wno-strict-aliasing exactly to make sure the
> > compiler doesn't think that "oh, I can make aliasing decisions based on type
> > information".
> > 
> > Because we have those kinds of issues all over, and we know which
> > architectures support unaligned loads etc, and all the tricks with
> > "memcpy()" and unions make for entirely unreadable code.
> > 
> > So please fix the aliasing logic to not be type-based when people explicitly
> > tell you not to do that.
> > 
> > Linus
> 
> Note alignment has nothing to do with strict-aliasing (-fno-strict-aliasing
> you mean btw).
> 
> One thing we do is (I'm not 50% sure this explains the observed issue) assume
> that if you have two accesses with type 'short' and they are aligned
> according to this type then they will not partly overlap.  Note this has
> nothing to do with C strict aliasing rules but is basic pointer math when
> you know lower zero bits.

OK, given that source code has type short, they will assume these things are
short aligned and thus won't overlap for short accesses. But then the code
actually generated by loop vectorizer assumes they are 8 bytes apart - since
that is what it is generating.


> 
> I suggest to try the fix suggested in comment#7 and report back if that
> fixes the observed issue.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (18 preceding siblings ...)
  2021-05-04 19:33 ` vgupta at synopsys dot com
@ 2021-05-05  6:32 ` rguenther at suse dot de
  2021-05-05 19:59 ` ndesaulniers at google dot com
  20 siblings, 0 replies; 22+ messages in thread
From: rguenther at suse dot de @ 2021-05-05  6:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #19 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 4 May 2021, vgupta at synopsys dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
> 
> --- Comment #18 from Vineet Gupta <vgupta at synopsys dot com> ---
> (In reply to Richard Biener from comment #9)
> > (In reply to Linus Torvalds from comment #8)
> > > (In reply to Alexander Monakov from comment #7)
> > > > 
> > > > Most likely the issue is that sout/sfrom are misaligned at runtime, while
> > > > the vectorized code somewhere relies on them being sufficiently aligned for
> > > > a 'short'.
> > > 
> > > They absolutely are.
> > > 
> > > And we build the kernel with -Wno-strict-aliasing exactly to make sure the
> > > compiler doesn't think that "oh, I can make aliasing decisions based on type
> > > information".
> > > 
> > > Because we have those kinds of issues all over, and we know which
> > > architectures support unaligned loads etc, and all the tricks with
> > > "memcpy()" and unions make for entirely unreadable code.
> > > 
> > > So please fix the aliasing logic to not be type-based when people explicitly
> > > tell you not to do that.
> > > 
> > > Linus
> > 
> > Note alignment has nothing to do with strict-aliasing (-fno-strict-aliasing
> > you mean btw).
> > 
> > One thing we do is (I'm not 50% sure this explains the observed issue) assume
> > that if you have two accesses with type 'short' and they are aligned
> > according to this type then they will not partly overlap.  Note this has
> > nothing to do with C strict aliasing rules but is basic pointer math when
> > you know lower zero bits.
> 
> OK, given that source code has type short, they will assume these things are
> short aligned and thus won't overlap for short accesses. But then the code
> actually generated by loop vectorizer assumes they are 8 bytes apart - since
> that is what it is generating.

That's guarded by a runtime check but this check again assumes the
accesses are aligned as short and thus will fail if not

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
  2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
                   ` (19 preceding siblings ...)
  2021-05-05  6:32 ` rguenther at suse dot de
@ 2021-05-05 19:59 ` ndesaulniers at google dot com
  20 siblings, 0 replies; 22+ messages in thread
From: ndesaulniers at google dot com @ 2021-05-05 19:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Nick Desaulniers <ndesaulniers at google dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ndesaulniers at google dot com

--- Comment #20 from Nick Desaulniers <ndesaulniers at google dot com> ---
(In reply to Alexander Monakov from comment #7)
>  Most likely the issue is that sout/sfrom are misaligned at runtime, while the > vectorized code somewhere relies on them being sufficiently aligned for a 'short'.
> It is unsafe to dereference a misaligned pointer. The pointed-to-type must
> have reduced alignment:

C 6.3.2.3p7 (N1548) says:

A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned) for the referenced type, the behavior is undefined.


===

We're working on adding diagnostics and UBSAN checks for these.  Perhaps with
those in place, we'd be able to spot such a case in the kernel's initramfs
decompression code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-05-05 19:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-30 20:09 [Bug c/100363] New: gcc generating wider load/store than warranted at -O3 vgupta at synopsys dot com
2021-04-30 21:43 ` [Bug middle-end/100363] " pinskia at gcc dot gnu.org
2021-04-30 21:44 ` pinskia at gcc dot gnu.org
2021-04-30 21:56 ` vgupta at synopsys dot com
2021-04-30 22:02 ` torvalds@linux-foundation.org
2021-04-30 22:19 ` [Bug tree-optimization/100363] " pinskia at gcc dot gnu.org
2021-04-30 22:20 ` pinskia at gcc dot gnu.org
2021-04-30 22:33 ` vgupta at synopsys dot com
2021-05-01 22:54 ` amonakov at gcc dot gnu.org
2021-05-01 23:09 ` torvalds@linux-foundation.org
2021-05-03  7:41 ` rguenth at gcc dot gnu.org
2021-05-03 16:03 ` torvalds@linux-foundation.org
2021-05-03 16:18 ` torvalds@linux-foundation.org
2021-05-03 17:25 ` vgupta at synopsys dot com
2021-05-03 17:28 ` vgupta at synopsys dot com
2021-05-03 17:47 ` torvalds@linux-foundation.org
2021-05-03 18:45 ` vgupta at synopsys dot com
2021-05-03 19:43 ` pinskia at gcc dot gnu.org
2021-05-04  6:24 ` rguenth at gcc dot gnu.org
2021-05-04 19:33 ` vgupta at synopsys dot com
2021-05-05  6:32 ` rguenther at suse dot de
2021-05-05 19:59 ` ndesaulniers at google dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).