This patch fixes two issues with the handling of 128-bit TImode integer
constants in the x86_64 backend.  The main issue is that GCC always
tries to load 128-bit integer constants via broadcasts to vector SSE
registers, even if the result is required in general registers.  This
is seen in the two closely related functions below:

__int128 m;
#define CONST (((__int128)0x0123456789abcdefULL<<64) |
0x0123456789abcdefULL)
void foo() { m &= CONST; }
void bar() { m = CONST; }

When compiled with -O2 -mavx, we currently generate:

foo:    movabsq $81985529216486895, %rax
        vmovq   %rax, %xmm0
        vpunpcklqdq     %xmm0, %xmm0, %xmm0
        vmovq   %xmm0, %rax
        vpextrq $1, %xmm0, %rdx
        andq    %rax, m(%rip)
        andq    %rdx, m+8(%rip)
        ret

bar:    movabsq $81985529216486895, %rax
        vmovq   %rax, %xmm1
        vpunpcklqdq     %xmm1, %xmm1, %xmm0
        vpextrq $1, %xmm0, %rdx
        vmovq   %xmm0, m(%rip)
        movq    %rdx, m+8(%rip)
        ret

With this patch we defer the decision to use vector broadcast for
TImode until we know we need actually want a SSE register result,
by moving the call to ix86_convert_const_wide_int_to_broadcast from
the RTL expansion pass, to the scalar-to-vector (STV) pass.  With
this change (and a minor tweak described below) we now generate:

foo:    movabsq $81985529216486895, %rax
        andq    %rax, m(%rip)
        andq    %rax, m+8(%rip)
        ret

bar:    movabsq $81985529216486895, %rax
        vmovq   %rax, %xmm0
        vpunpcklqdq     %xmm0, %xmm0, %xmm0
        vmovdqa %xmm0, m(%rip)
        ret

showing that we now correctly use vector mode broadcasts (only)
where appropriate.

The one minor tweak mentioned above is to enable the un-cprop hi/lo
optimization, that I originally contributed back in September 2004
https://gcc.gnu.org/pipermail/gcc-patches/2004-September/148756.html
even when not optimizing for size.  Without this (and currently with
just -O2) the function foo above generates:

foo:    movabsq $81985529216486895, %rax
        movabsq $81985529216486895, %rdx
        andq    %rax, m(%rip)
        andq    %rdx, m+8(%rip)
        ret

I'm not sure why (back in 2004) I thought that avoiding the implicit
"movq %rax, %rdx" instead of a second load was faster, perhaps avoiding
a dependency to allow better scheduling, but nowadays "movq %rax, %rdx"
is either eliminated by GCC's hardreg cprop pass, or special cased by
modern hardware, making the first foo preferrable, not only shorter but
also faster.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
and with/without -march=cascadelake with no new failures.
Ok for mainline?


2023-12-18  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-expand.cc
        (ix86_convert_const_wide_int_to_broadcast): Remove static.
        (ix86_expand_move): Don't attempt to convert wide constants
        to SSE using ix86_convert_const_wide_int_to_broadcast here.
        (ix86_split_long_move): Always un-cprop multi-word constants.
        * config/i386/i386-expand.h
        (ix86_convert_const_wide_int_to_broadcast): Prototype here.
        * config/i386/i386-features.cc: Include i386-expand.h.
        (timode_scalar_chain::convert_insn): When converting TImode to
        v1TImode, try ix86_convert_const_wide_int_to_broadcast.

gcc/testsuite/ChangeLog
        * gcc.target/i386/movti-2.c: New test case.
        * gcc.target/i386/movti-3.c: Likewise.


Thanks in advance,
Roger
--