public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/105930] New: Excessive stack spill generation on 32-bit x86
@ 2022-06-11 20:51 torvalds@linux-foundation.org
  2022-06-11 23:20 ` [Bug target/105930] " torvalds@linux-foundation.org
                   ` (28 more replies)
  0 siblings, 29 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-11 20:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

            Bug ID: 105930
           Summary: Excessive stack spill generation on 32-bit x86
           Product: gcc
           Version: 12.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: torvalds@linux-foundation.org
  Target Milestone: ---

Created attachment 53121
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53121&action=edit
Test-case extracted from the generic blake2b kernel code

Gcc-12 seems to generate a huge number of stack spills on this blake2b
test-case, to the point where it overflows the allowable kernel stack on 32-bit
x86.

This crypto thing has two 128-byte buffers, so a stack frame a bit larger than
256 is expected when the dataset doesn't fit in the register set.

Just as an example, on this code, clang-.14.0.0 generates a stack frame that is
296 bytes. 

In contrast, gcc-12.1.1 generates a stack frame that is almost an order of
magnitude(!) larger, at 2620 bytes.

The trivial Makefile I used for this test-case is

   # The kernel cannot just randomly use FP/MMX/AVX
    CFLAGS := -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx
    CFLAGS += -m32
    CFLAGS += -O2

    test:
        gcc $(CFLAGS) -Wall -S blake2b.c
        grep "sub.*%[er]sp" blake2b.s

to easily test different flags and the end result, but as can be seen from
above, it really doesn't need any special flags except the ones that disable
MMX/AVX code generation.

And the generated code looks perfectly regular, except for the fact that it
uses almost 3kB of stack space.

Note that "-m32" is required to trigger this - the 64-bit case does much
better, presumably because it has more registers and this needs fewer spills.
It gets worse with some added debug flags we use in the kernel, but not that
kind of "order of magnitude" worse.

Using -O1 or -Os makes no real difference.

This is presumably due to some newly triggered optimization in gcc-12, but I
can't even begin to guess at what we'd need to disable (or enable) to avoid
this horrendous stack growth. Some very aggressive instruction scheduling thing
that spreads out all the calculations and always wants to spill-and-reload the
subepxressions that it CSE'd? I dunno. 

Pls advice. The excessive stack literally causes build failures due to us using
-Werror-frame-larger-than= to make sure stack use remains sanely bounded. The
kernel stack is a rather limited resource.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
@ 2022-06-11 23:20 ` torvalds@linux-foundation.org
  2022-06-11 23:54 ` [Bug target/105930] [12/13 Regression] " pinskia at gcc dot gnu.org
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-11 23:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #1 from Linus Torvalds <torvalds@linux-foundation.org> ---
Side note: it might be best to clarify that this is a regression specific to
gcc-12.

Gcc 11.3 doesn't have the problem, and generates code for this same test-case
with a stack frame of only 428 bytes. That's still bigger than clang, but not
"crazy bigger".

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
  2022-06-11 23:20 ` [Bug target/105930] " torvalds@linux-foundation.org
@ 2022-06-11 23:54 ` pinskia at gcc dot gnu.org
  2022-06-11 23:59 ` pinskia at gcc dot gnu.org
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-06-11 23:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |needs-bisection
   Target Milestone|---                         |12.2
            Summary|Excessive stack spill       |[12/13 Regression]
                   |generation on 32-bit x86    |Excessive stack spill
                   |                            |generation on 32-bit x86
             Target|                            |i?86-*-*

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
  2022-06-11 23:20 ` [Bug target/105930] " torvalds@linux-foundation.org
  2022-06-11 23:54 ` [Bug target/105930] [12/13 Regression] " pinskia at gcc dot gnu.org
@ 2022-06-11 23:59 ` pinskia at gcc dot gnu.org
  2022-06-12  0:20 ` torvalds@linux-foundation.org
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-06-11 23:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
thumb1 (which has 16 registers but really only 8 are GPRs) does not have this
issue in GCC 12, so I suspect a target specific change caused this.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (2 preceding siblings ...)
  2022-06-11 23:59 ` pinskia at gcc dot gnu.org
@ 2022-06-12  0:20 ` torvalds@linux-foundation.org
  2022-06-12  0:20 ` torvalds@linux-foundation.org
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-12  0:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #3 from Linus Torvalds <torvalds@linux-foundation.org> ---
Created attachment 53123
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53123&action=edit
Mindless revert that fixes things for me

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (3 preceding siblings ...)
  2022-06-12  0:20 ` torvalds@linux-foundation.org
@ 2022-06-12  0:20 ` torvalds@linux-foundation.org
  2022-06-12  0:27 ` torvalds@linux-foundation.org
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-12  0:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #4 from Linus Torvalds <torvalds@linux-foundation.org> ---
So hey, since you guys use git now, I thought I might as well just bisect this.

Now, I have no idea what the best and most efficient way is to generate only
"cc1", so my bisection run was this unholy mess of "just run configure, and
then run 'make -j128' for long enough that 'host-x86_64-pc-linux-gnu/gcc/cc1'
gets generated, and test that".

I'm  not proud of that hacky thing, but since gcc documentation is written in
sanskrit, and mere mortals can't figure it out, it's the best I could do.

And the problem bisects down to

  8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48 is the first bad commit
  commit 8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48
  Author: Roger Sayle <roger@nextmovesoftware.com>
  Date:   Sat Mar 5 08:50:45 2022 +0000

      PR 104732: Simplify/fix DI mode logic expansion/splitting on -m32.

so yes, this seems to be very much specific to the i386 target.

And yes, I also verified that reverting that commit on the current master
branch solves it for me.

Again: this was just a completely mindless bisection, with a "revert to verify"
on top of the current trunk, which for me happened to be commit cd02f15f1ae
("xtensa: Improve constant synthesis for both integer and floating-point").

I'm attaching the revert patch I used just so that you can see exactly what I
did. I probably shouldn't have actually removed the testsuite entry, but again:
ENTIRELY MINDLESS BISECTION RESULT.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (4 preceding siblings ...)
  2022-06-12  0:20 ` torvalds@linux-foundation.org
@ 2022-06-12  0:27 ` torvalds@linux-foundation.org
  2022-06-12  2:26 ` sneves at dei dot uc.pt
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-12  0:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #5 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Linus Torvalds from comment #4)
> 
> I'm  not proud of that hacky thing, but since gcc documentation is written
> in sanskrit, and mere mortals can't figure it out, it's the best I could do.

And bu 'sanskrit' I mean 'texi'. 

I understand that it's how GNU projects are supposed to work, but markdown (or
rst) is just *so* much more legible and you really can read it like text.

Anyway, that's my excuse for not knowing how to "just generate cc1" for a saner
git bisect run. What I did worked, but was just incredibly ugly. There must be
some better way gcc developers have when they want to bisect cc1 behavior.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (5 preceding siblings ...)
  2022-06-12  0:27 ` torvalds@linux-foundation.org
@ 2022-06-12  2:26 ` sneves at dei dot uc.pt
  2022-06-12  6:28 ` roger at nextmovesoftware dot com
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: sneves at dei dot uc.pt @ 2022-06-12  2:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Samuel Neves <sneves at dei dot uc.pt> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sneves at dei dot uc.pt

--- Comment #6 from Samuel Neves <sneves at dei dot uc.pt> ---
Based on that bisect commit, it is also possible to repro this issue in earlier
GCCs (11, 10, seems fine on <= 9) purely by taking away the -mno-sseX, which
triggers the same splitting as now on gcc-12: https://godbolt.org/z/KEcWGT9Yc

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (6 preceding siblings ...)
  2022-06-12  2:26 ` sneves at dei dot uc.pt
@ 2022-06-12  6:28 ` roger at nextmovesoftware dot com
  2022-06-12 16:18 ` torvalds@linux-foundation.org
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: roger at nextmovesoftware dot com @ 2022-06-12  6:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-06-12

--- Comment #7 from Roger Sayle <roger at nextmovesoftware dot com> ---
Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and on
godbolt the number of assembler lines reduces from 6952 to 6203).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (7 preceding siblings ...)
  2022-06-12  6:28 ` roger at nextmovesoftware dot com
@ 2022-06-12 16:18 ` torvalds@linux-foundation.org
  2022-06-12 16:35 ` torvalds@linux-foundation.org
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-12 16:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #8 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Roger Sayle from comment #7)
> Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and
> on godbolt the number of assembler lines reduces from 6952 to 6203).

Thanks. Using -mno-stv may well be a good workaround for the kernel for now,
except I don't have a clue what it means. I will google it to figure out
whether it's appropriate for our use.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (8 preceding siblings ...)
  2022-06-12 16:18 ` torvalds@linux-foundation.org
@ 2022-06-12 16:35 ` torvalds@linux-foundation.org
  2022-06-12 17:30 ` torvalds@linux-foundation.org
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-12 16:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #9 from Linus Torvalds <torvalds@linux-foundation.org> ---
Looks like STV is "scalar to vector" and it should have been disabled
automatically by the -mno-avx flag anyway.

And the excessive stack usage was perhaps due to GCC preparing all those stack
slots for integer -> vector movement that then never happens exactly because
the kernel uses -mno-avx?

So if I understand this right, the kernel can indeed just add -mno-stv to work
around this.

?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (9 preceding siblings ...)
  2022-06-12 16:35 ` torvalds@linux-foundation.org
@ 2022-06-12 17:30 ` torvalds@linux-foundation.org
  2022-06-13  9:29 ` jakub at gcc dot gnu.org
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-12 17:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #10 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Roger Sayle from comment #7)
> Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and
> on godbolt the number of assembler lines reduces from 6952 to 6203).

So now that I actually tried that, '-mno-stv' does nothing for me. I still see
a frame size of 2620.

I wonder what the difference in our setups is. I tested with plain gcc-12.1.1
from Fedora 36, maybe you tested with current trunk or something?

Anyway, it seems -mno-stv isn't the solution at least for the released version
;(

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (10 preceding siblings ...)
  2022-06-12 17:30 ` torvalds@linux-foundation.org
@ 2022-06-13  9:29 ` jakub at gcc dot gnu.org
  2022-06-13 17:39 ` torvalds@linux-foundation.org
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-13  9:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org

--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Confirmed it is r12-7502-g8ea4a34bd0b0a462 on our bisect seed.
Perhaps for !TARGET_STV || !TARGET_SSE2 we could keep the optabs, but split
right away during expansion?
Anyway, I think we need to understand what makes it spill that much more, and
unfortunately the testcase is too large to find that out easily, I think we
should investigate commenting out some rounds.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (11 preceding siblings ...)
  2022-06-13  9:29 ` jakub at gcc dot gnu.org
@ 2022-06-13 17:39 ` torvalds@linux-foundation.org
  2022-06-13 18:23 ` sneves at dei dot uc.pt
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-13 17:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #12 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Jakub Jelinek from comment #11)
> Anyway, I think we need to understand what makes it spill that much more,
> and unfortunately the testcase is too large to find that out easily, I think
> we should investigate commenting out some rounds.

I think you can comment out all the rounds except for one, and still see the
effects of this.

At least when I try it on godbolt with just ROUND(0) in place and rounds 1-12
deleted, for gcc-11.3 I get a stack frame of 388.

And with gcc-12.1 I get a stack frame of 516. Obviously that's not nearly as
huge, but it's still a fairly significant expansion and is presumably due to
the same basic issue.

I used godbolt just because I no longer have 11.3 installed to compare against,
so it was the easiest way to verify that the stack expansion is still there.

Just to clarify: that's still with

   -O2 -m32 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx

I'm sure it depends on the flags.

The code is still not exactly trivial with just one round in place. It's a
cryptographic hash, after all. But when compiling for 64-bit mode it all looks
fairly straightforward, it really is that DImode stuff that makes it really
ungainly when using -m32.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (12 preceding siblings ...)
  2022-06-13 17:39 ` torvalds@linux-foundation.org
@ 2022-06-13 18:23 ` sneves at dei dot uc.pt
  2022-06-13 18:31 ` torvalds@linux-foundation.org
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: sneves at dei dot uc.pt @ 2022-06-13 18:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #13 from Samuel Neves <sneves at dei dot uc.pt> ---
Something simple like this -- https://godbolt.org/z/61orYdjK7 -- already
exhibits the effect. 

Furthermore, and this also applies to the full BLAKE2b compression function, if
you replace all the xors in the rounds by adds, the stack size goes back to
normal. 

On the other hand, replacing the adds by xors, making the whole round function
bitwise ops, does not have any effect on stack size.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (13 preceding siblings ...)
  2022-06-13 18:23 ` sneves at dei dot uc.pt
@ 2022-06-13 18:31 ` torvalds@linux-foundation.org
  2022-06-14  7:57 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-13 18:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #14 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Samuel Neves from comment #13)
> Something simple like this -- https://godbolt.org/z/61orYdjK7 -- already
> exhibits the effect. 

Yup.

That's a much better test-case. I think you should attach it to the gcc
bugzilla, I don't know how long godbolt remembers those links.

But godbolt here is good for another thing: select clang-14 as the compiler,
and you realize how even gcc-11 is actually doing a really really bad job.

And I'm not saying that to shame anybody: I think it's more a sign that maybe
the issue is actually much deeper, and that the problem already exists in
gcc-11, but some small perturbation then just makes an already existing problem
much worse in gcc-12.

So that commit I bisected to is probably almost entirely unrelated, and it is
just the thing that turns up the bad behavior to 11.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (14 preceding siblings ...)
  2022-06-13 18:31 ` torvalds@linux-foundation.org
@ 2022-06-14  7:57 ` rguenth at gcc dot gnu.org
  2022-06-14  8:50 ` jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-06-14  7:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we feed DImode rotates into RA which constrains register allocation
enough to require spills (all 4 DImode vals are live across the kernel,
not even -fschedule-insn can do anything here).  I wonder if it ever makes
sense to not split wide ops before reload.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (15 preceding siblings ...)
  2022-06-14  7:57 ` rguenth at gcc dot gnu.org
@ 2022-06-14  8:50 ` jakub at gcc dot gnu.org
  2022-06-14  9:42 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-14  8:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Though, ix86_rot{l,r}di3_doubleword define_insn_and_split patterns were split
only after reload both before and after Roger's change, so somehow whether we
emit it as SImode from the beginning or only split it before reload affects the
RA decisions.
unsigned long long foo (unsigned long long x, int y, unsigned long long z) { x
^= z; return (x << 24) | (x >> (-24 & 63)); }
is too simplified, the difference with that is just that we used to emit
setting of the DImode pseudo to 0 before setting its halves with xor, while now
we don't, so it must be something else.

I believe as post-reload splitters the doubleword rotates have been introduced
already in PR17886.
Rewriting those into pre-reload splitters from post-reload splitters would be
certainly possible, I will try that, the question is whether it would cure this
and what effects it would have on other code.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (16 preceding siblings ...)
  2022-06-14  8:50 ` jakub at gcc dot gnu.org
@ 2022-06-14  9:42 ` jakub at gcc dot gnu.org
  2022-06-14  9:50 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-14  9:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|needs-bisection             |

--- Comment #17 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, I've tried:
--- gcc/config/i386/i386.md.jj  2022-06-13 10:53:26.739290704 +0200
+++ gcc/config/i386/i386.md     2022-06-14 11:09:24.467024047 +0200
@@ -13734,14 +13734,13 @@
 ;; shift instructions and a scratch register.

 (define_insn_and_split "ix86_rotl<dwi>3_doubleword"
- [(set (match_operand:<DWI> 0 "register_operand" "=r")
-       (rotate:<DWI> (match_operand:<DWI> 1 "register_operand" "0")
-                    (match_operand:QI 2 "<shift_immediate_operand>" "<S>")))
-  (clobber (reg:CC FLAGS_REG))
-  (clobber (match_scratch:DWIH 3 "=&r"))]
- ""
+ [(set (match_operand:<DWI> 0 "register_operand")
+       (rotate:<DWI> (match_operand:<DWI> 1 "register_operand")
+                    (match_operand:QI 2 "<shift_immediate_operand>")))
+  (clobber (reg:CC FLAGS_REG))]
+ "ix86_pre_reload_split ()"
  "#"
- "reload_completed"
+ "&& 1"
  [(set (match_dup 3) (match_dup 4))
   (parallel
    [(set (match_dup 4)
@@ -13764,6 +13763,7 @@
                                                       (match_dup 6)))) 0)))
     (clobber (reg:CC FLAGS_REG))])]
 {
+  operands[3] = gen_reg_rtx (<MODE>mode);
   operands[6] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode) - 1);
   operands[7] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));

@@ -13771,14 +13771,13 @@
 })

 (define_insn_and_split "ix86_rotr<dwi>3_doubleword"
- [(set (match_operand:<DWI> 0 "register_operand" "=r")
-       (rotatert:<DWI> (match_operand:<DWI> 1 "register_operand" "0")
-                      (match_operand:QI 2 "<shift_immediate_operand>" "<S>")))
-  (clobber (reg:CC FLAGS_REG))
-  (clobber (match_scratch:DWIH 3 "=&r"))]
- ""
+ [(set (match_operand:<DWI> 0 "register_operand")
+       (rotatert:<DWI> (match_operand:<DWI> 1 "register_operand")
+                      (match_operand:QI 2 "<shift_immediate_operand>")))
+  (clobber (reg:CC FLAGS_REG))]
+ "ix86_pre_reload_split ()"
  "#"
- "reload_completed"
+ "&& 1"
  [(set (match_dup 3) (match_dup 4))
   (parallel
    [(set (match_dup 4)
@@ -13801,6 +13800,7 @@
                                                     (match_dup 6)))) 0)))
     (clobber (reg:CC FLAGS_REG))])]
 {
+  operands[3] = gen_reg_rtx (<MODE>mode);
   operands[6] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode) - 1);
   operands[7] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));


On the #c0 test with -O2 -m32 -mno-mmx -mno-sse it makes some difference, but
not as much as one would hope for:
Numbers from gcc 11.3.1 20220614, 11.3.1 20220614 with the patch, 13.0.0
20220610, 13.0.0 20220614 with the patch:
sub on %esp    428      2556      2620      2556
fn size in B 21657     23186     28413     23534
.s lines      6199      3942      7260      4198
So, trunk patched with the above patch results in significantly fewer
instructions, but larger (more of them use 32-bit immediates, mostly in form of
whatever(%esp) memory source operand).
And the stack usage is high.

I think the patch is still a good idea, it gives the RA more options, but we
should investigate why it consumes so much more stack and results in larger
code.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (17 preceding siblings ...)
  2022-06-14  9:42 ` jakub at gcc dot gnu.org
@ 2022-06-14  9:50 ` jakub at gcc dot gnu.org
  2022-06-16 10:06 ` arnd at linaro dot org
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-14  9:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Of course, size comparisons of -O2 code aren't the most important, for -O2 it
is more important how fast the code is.
When comparing -Os -m32 -mno-mmx -mno-sse, the numbers are
sub on %esp    412      2564      2620      2564
fn size in B 27535     20508     35036     20416
.s lines      5816      3590      7251      3544
So in the -Os case, the patched functions are both smaller and fewer
instructions (significantly so), but compared to gcc 11 still significantly
higher stack usage).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (18 preceding siblings ...)
  2022-06-14  9:50 ` jakub at gcc dot gnu.org
@ 2022-06-16 10:06 ` arnd at linaro dot org
  2022-06-24  6:17 ` cvs-commit at gcc dot gnu.org
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: arnd at linaro dot org @ 2022-06-16 10:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Arnd Bergmann <arnd at linaro dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |arnd at linaro dot org

--- Comment #19 from Arnd Bergmann <arnd at linaro dot org> ---
I checked the other target architectures that are supported by the kernel to
see if anything else is affected. Apparently only sparc32 has a similar issue
with a frame size of 2280 bytes using gcc-10 or higher, compared to 600 to 800
bytes for gcc-4 through gcc-9.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (19 preceding siblings ...)
  2022-06-16 10:06 ` arnd at linaro dot org
@ 2022-06-24  6:17 ` cvs-commit at gcc dot gnu.org
  2022-06-24 16:45 ` torvalds@linux-foundation.org
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-06-24  6:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #20 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:3b8794302b52a819ca3ea78238e9b5025d1c56dd

commit r13-1239-g3b8794302b52a819ca3ea78238e9b5025d1c56dd
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Jun 24 07:15:08 2022 +0100

    PR target/105930: Split *xordi3_doubleword after reload on x86.

    This patch addresses PR target/105930 which is an ia32 stack frame size
    regression in high-register pressure XOR-rich cryptography functions
    reported by Linus Torvalds.  The underlying problem is once the limited
    number of registers on the x86 are exhausted, the register allocator
    has to decide which to spill, where some eviction choices lead to much
    poorer code, but these consequences are difficult to predict in advance.

    The patch below, which splits xordi3_doubleword and iordi3_doubleword
    after reload (instead of before), significantly reduces the amount of
    spill code and stack frame size, in what might appear to be an arbitrary
    choice.

    My explanation of this behaviour is that the mixing of pre-reload split
    SImode instructions and post-reload split DImode instructions is
    confusing some of the heuristics used by reload.  One might think
    that splitting early gives the register allocator more freedom to
    use available registers, but in practice the constraint that double
    word values occupy consecutive registers (when ultimately used as a
    DImode value) is the greater constraint.  Instead, I believe in this
    case, the pseudo registers used in memory addressing, appear to be
    double counted for split SImode instructions compared to unsplit
    DImode instructions.  For the reduced test case in comment #13, this
    leads to %eax being used to hold the long-lived argument pointer "v",
    blocking the use of the ax:dx pair for processing double word values.
    The important lines are at the very top of the assembly output:

    GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
            sub     esp, 24
            mov     ecx, DWORD PTR [esp+40]

    GCC 12 [use %eax to address memory, require a 44-byte stack frame]
            sub     esp, 44
            mov     eax, DWORD PTR [esp+64]

    2022-06-24  Roger Sayle  <roger@nextmovesoftware.com>
                Uroš Bizjak  <ubizjak@gmail.com>

    gcc/ChangeLog
            PR target/105930
            * config/i386/i386.md (*<any_or>di3_doubleword): Split after
            reload.  Use rtx_equal_p to avoid creating memory-to-memory moves,
            and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (20 preceding siblings ...)
  2022-06-24  6:17 ` cvs-commit at gcc dot gnu.org
@ 2022-06-24 16:45 ` torvalds@linux-foundation.org
  2022-06-24 16:55 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-24 16:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #21 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to CVS Commits from comment #20)
>
>         One might think
>     that splitting early gives the register allocator more freedom to
>     use available registers, but in practice the constraint that double
>     word values occupy consecutive registers (when ultimately used as a
>     DImode value) is the greater constraint. 

Whee.

Why does gcc have that constraint, btw? I tried to look at the clang code
generation once more, and I don't *think* clang has the same constraint, and
maybe that is why it does so much better?

Yes, x86 itself inherently has a couple of forced register pairings (notably
%edx:%eax for 64-bit multiplication and division), and obviously the whole
calling convention requires well-defined pairings, but in the general case it
seems to be a mistake to keep DImode values as DImode values and force them to
be consecutive registers when used.

Maybe I misunderstand. But now that this comes up I have this dim memory of
actually having had a discussion like this before on bugzilla, where gcc
generated horrible DImode code.

>     GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
>             sub     esp, 24
>             mov     ecx, DWORD PTR [esp+40]
>     
>     GCC 12 [use %eax to address memory, require a 44-byte stack frame]
>             sub     esp, 44
>             mov     eax, DWORD PTR [esp+64]

I just checked the current git -tip, and this does seem to fix the original
case too, with the old horrid 2620 bytes of stack frame now being a *much*
improved 404 bytes!

So your patch - or other changes - does fix it for me, unless I did something
wrong in my testing (which is possible).

Thanks. I'm not sure what the gcc policy on closing the bug is (and I don't
even know if I am allowed), so I'm not marking this closed, but it seems to be
fixed as far as I am concerned, and I hope it gets released as a dot-release
for the gcc-12 series.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (21 preceding siblings ...)
  2022-06-24 16:45 ` torvalds@linux-foundation.org
@ 2022-06-24 16:55 ` jakub at gcc dot gnu.org
  2022-06-24 17:06 ` torvalds@linux-foundation.org
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-24 16:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #22 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Linus Torvalds from comment #21)
> Whee.
> 
> Why does gcc have that constraint, btw? I tried to look at the clang code
> generation once more, and I don't *think* clang has the same constraint, and
> maybe that is why it does so much better?

Registers in RTL have just a single register number and mode (ok, it has some
extra info, but not a set of registers).  When it is a pseudo register, that
doesn't constrain anything, it is just
(reg:DI 175).
But when it is a hard register, it still has just a single register number,
so there is no way to express through that non-consecutive set of registers,
so
(reg:DI 4)
needs to be di:si pair etc.
If the wider registers are narrowed before register allocation, it is just
a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere.
If we wanted RA to allocate non-consecutive registers, we'd need to represent
that differently (say as concatenation of SImode registers), but then it
wouldn't be accepted by constraints and predicates of most of the
define_insn_and_split patterns.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (22 preceding siblings ...)
  2022-06-24 16:55 ` jakub at gcc dot gnu.org
@ 2022-06-24 17:06 ` torvalds@linux-foundation.org
  2022-06-24 17:19 ` torvalds@linux-foundation.org
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-24 17:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #23 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Jakub Jelinek from comment #22)
> 
> If the wider registers are narrowed before register allocation, it is just
> a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere.

That was more what I was thinking - why is the DImode information being kept so
long?

I realize that you want to do a lot of the early CSE etc operations at that
higher level, but by the time you are actually allocating registers and
thinking about spilling them, why is it still a DImode thing?

And this now brings back my memory of the earlier similar discussion - it
wasn't about DImode code generation, it was about bitfield code generation
being horrendous, where gcc was keeping the whole "this is a bitfield"
information around for a long time and as a result generating truly horrendous
code. When it looked like it instead should just have turned it into a load and
shift early, and then doing all the sane optimizations at that level (ie
rewriting simple bitfield code to just do loads and shifts generated *much*
better code than using bitfields).

But this is just my personal curiosity at this point - it looks like Roger
Sayle's patch has fixed the immediate problem, so the big issue is solved. And
maybe the fact that clang is doing so much better is due to something else
entirely - it just _looks_ like it might be this artificial constraint by gcc
that makes it do bad register and spill choices.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (23 preceding siblings ...)
  2022-06-24 17:06 ` torvalds@linux-foundation.org
@ 2022-06-24 17:19 ` torvalds@linux-foundation.org
  2022-06-24 17:33 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-06-24 17:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #24 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Linus Torvalds from comment #23)
> 
> And this now brings back my memory of the earlier similar discussion - it
> wasn't about DImode code generation, it was about bitfield code generation
> being horrendous,

Searching around, it's this one from 2011:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

not really related to this issue, apart from the superficially similar issue
with oddly bad code generation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (24 preceding siblings ...)
  2022-06-24 17:19 ` torvalds@linux-foundation.org
@ 2022-06-24 17:33 ` jakub at gcc dot gnu.org
  2022-06-30 10:03 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-24 17:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #25 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Linus Torvalds from comment #23)
> (In reply to Jakub Jelinek from comment #22)
> > 
> > If the wider registers are narrowed before register allocation, it is just
> > a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere.
> 
> That was more what I was thinking - why is the DImode information being kept
> so long?

This is what is being discussed here.
Some possibilities are lower these multi-word operations during expansion from
GIMPLE to RTL (after all, the generic code usually does that without anything
special needed on the backend side unless one declares the backend can do that
better), one counter-argument to that is the x86 STV pass which uses vector
operations for 2 word operations when possible and it won't really work when it
is lowered during expansion.

Another is splitting those before register allocation, which is what some
patterns did and what other patterns didn't.

Or it can be split after register allocation.

My understanding was that Roger tried to change some patterns from splitting
after RA to before RA and it didn't improve this testcase, so in the end
changed some other patterns from splitting before RA to after RA.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (25 preceding siblings ...)
  2022-06-24 17:33 ` jakub at gcc dot gnu.org
@ 2022-06-30 10:03 ` cvs-commit at gcc dot gnu.org
  2022-07-10  8:01 ` roger at nextmovesoftware dot com
  2022-07-10 16:41 ` torvalds@linux-foundation.org
  28 siblings, 0 replies; 30+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-06-30 10:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #26 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:00193676a5a3e7e50e1fa6646bb5abb5a7b2acbb

commit r13-1362-g00193676a5a3e7e50e1fa6646bb5abb5a7b2acbb
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Thu Jun 30 11:00:03 2022 +0100

    Use xchg for DImode double word rotate by 32 bits with -m32 on x86.

    This patch was motivated by the investigation of Linus Torvalds' spill
    heavy cryptography kernels in PR 105930.  The <any_rotate>di3 expander
    handles all rotations by an immediate constant for 1..63 bits with the
    exception of 32 bits, which FAILs and is then split by the middle-end.
    This patch makes these 32-bit doubleword rotations consistent with the
    other DImode rotations during reload, which results in reduced register
    pressure, fewer instructions and the use of x86's xchg instruction
    when appropriate.  In theory, xchg can be handled by register renaming,
    but even on micro-architectures where it's implemented by 3 uops (no
    worse than a three instruction shuffle), avoiding nominating a
    "temporary" register, reduces user-visible register pressure (and
    has obvious code size benefits).

    The effects are best shown with the new testcase:

    unsigned long long bar();
    unsigned long long foo()
    {
      unsigned long long x = bar();
      return (x>>32) | (x<<32);
    }

    for which GCC with -m32 -O2 currently generates:

            subl    $12, %esp
            call    bar
            addl    $12, %esp
            movl    %eax, %ecx
            movl    %edx, %eax
            movl    %ecx, %edx
            ret

    but with this patch now generates:

            subl    $12, %esp
            call    bar
            addl    $12, %esp
            xchgl   %edx, %eax
            ret

    With this patch, the number of lines of assembly language generated
    for the blake2b kernel (from the attachment to PR105930) decreases
    from 5626 to 5404. Although there's an impressive reduction in
    instruction count, there's no change/reduction in stack frame size.

    2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>
                Uroš Bizjak  <ubizjak@gmail.com>

    gcc/ChangeLog
            * config/i386/i386.md (swap_mode): Rename from *swap<mode> to
            provide gen_swapsi.
            (<any_rotate>di3): Handle !TARGET_64BIT rotations by 32 bits
            via new gen_<insn>32di2_doubleword below.
            (<anyrotate>32di2_doubleword): New define_insn_and_split
            that splits after reload as either a pair of move instructions
            or an xchgl (using gen_swapsi).

    gcc/testsuite/ChangeLog
            * gcc.target/i386/xchg-3.c: New test case.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (26 preceding siblings ...)
  2022-06-30 10:03 ` cvs-commit at gcc dot gnu.org
@ 2022-07-10  8:01 ` roger at nextmovesoftware dot com
  2022-07-10 16:41 ` torvalds@linux-foundation.org
  28 siblings, 0 replies; 30+ messages in thread
From: roger at nextmovesoftware dot com @ 2022-07-10  8:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #27 from Roger Sayle <roger at nextmovesoftware dot com> ---
This should now be fixed on both mainline and the GCC 12 release branch.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
  2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
                   ` (27 preceding siblings ...)
  2022-07-10  8:01 ` roger at nextmovesoftware dot com
@ 2022-07-10 16:41 ` torvalds@linux-foundation.org
  28 siblings, 0 replies; 30+ messages in thread
From: torvalds@linux-foundation.org @ 2022-07-10 16:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #28 from Linus Torvalds <torvalds@linux-foundation.org> ---
(In reply to Roger Sayle from comment #27)
> This should now be fixed on both mainline and the GCC 12 release branch.

Thanks everybody.

Looks like the xchg optimization isn't in the gcc-12 release branch, but the
stack use looks reasonable from my (very limited) testing.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-07-10 16:41 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-11 20:51 [Bug c/105930] New: Excessive stack spill generation on 32-bit x86 torvalds@linux-foundation.org
2022-06-11 23:20 ` [Bug target/105930] " torvalds@linux-foundation.org
2022-06-11 23:54 ` [Bug target/105930] [12/13 Regression] " pinskia at gcc dot gnu.org
2022-06-11 23:59 ` pinskia at gcc dot gnu.org
2022-06-12  0:20 ` torvalds@linux-foundation.org
2022-06-12  0:20 ` torvalds@linux-foundation.org
2022-06-12  0:27 ` torvalds@linux-foundation.org
2022-06-12  2:26 ` sneves at dei dot uc.pt
2022-06-12  6:28 ` roger at nextmovesoftware dot com
2022-06-12 16:18 ` torvalds@linux-foundation.org
2022-06-12 16:35 ` torvalds@linux-foundation.org
2022-06-12 17:30 ` torvalds@linux-foundation.org
2022-06-13  9:29 ` jakub at gcc dot gnu.org
2022-06-13 17:39 ` torvalds@linux-foundation.org
2022-06-13 18:23 ` sneves at dei dot uc.pt
2022-06-13 18:31 ` torvalds@linux-foundation.org
2022-06-14  7:57 ` rguenth at gcc dot gnu.org
2022-06-14  8:50 ` jakub at gcc dot gnu.org
2022-06-14  9:42 ` jakub at gcc dot gnu.org
2022-06-14  9:50 ` jakub at gcc dot gnu.org
2022-06-16 10:06 ` arnd at linaro dot org
2022-06-24  6:17 ` cvs-commit at gcc dot gnu.org
2022-06-24 16:45 ` torvalds@linux-foundation.org
2022-06-24 16:55 ` jakub at gcc dot gnu.org
2022-06-24 17:06 ` torvalds@linux-foundation.org
2022-06-24 17:19 ` torvalds@linux-foundation.org
2022-06-24 17:33 ` jakub at gcc dot gnu.org
2022-06-30 10:03 ` cvs-commit at gcc dot gnu.org
2022-07-10  8:01 ` roger at nextmovesoftware dot com
2022-07-10 16:41 ` torvalds@linux-foundation.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).