This patch fixes two issues with the handling of 128-bit TImode integer constants in the x86_64 backend. The main issue is that GCC always tries to load 128-bit integer constants via broadcasts to vector SSE registers, even if the result is required in general registers. This is seen in the two closely related functions below: __int128 m; #define CONST (((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL) void foo() { m &= CONST; } void bar() { m = CONST; } When compiled with -O2 -mavx, we currently generate: foo: movabsq $81985529216486895, %rax vmovq %rax, %xmm0 vpunpcklqdq %xmm0, %xmm0, %xmm0 vmovq %xmm0, %rax vpextrq $1, %xmm0, %rdx andq %rax, m(%rip) andq %rdx, m+8(%rip) ret bar: movabsq $81985529216486895, %rax vmovq %rax, %xmm1 vpunpcklqdq %xmm1, %xmm1, %xmm0 vpextrq $1, %xmm0, %rdx vmovq %xmm0, m(%rip) movq %rdx, m+8(%rip) ret With this patch we defer the decision to use vector broadcast for TImode until we know we need actually want a SSE register result, by moving the call to ix86_convert_const_wide_int_to_broadcast from the RTL expansion pass, to the scalar-to-vector (STV) pass. With this change (and a minor tweak described below) we now generate: foo: movabsq $81985529216486895, %rax andq %rax, m(%rip) andq %rax, m+8(%rip) ret bar: movabsq $81985529216486895, %rax vmovq %rax, %xmm0 vpunpcklqdq %xmm0, %xmm0, %xmm0 vmovdqa %xmm0, m(%rip) ret showing that we now correctly use vector mode broadcasts (only) where appropriate. The one minor tweak mentioned above is to enable the un-cprop hi/lo optimization, that I originally contributed back in September 2004 https://gcc.gnu.org/pipermail/gcc-patches/2004-September/148756.html even when not optimizing for size. Without this (and currently with just -O2) the function foo above generates: foo: movabsq $81985529216486895, %rax movabsq $81985529216486895, %rdx andq %rax, m(%rip) andq %rdx, m+8(%rip) ret I'm not sure why (back in 2004) I thought that avoiding the implicit "movq %rax, %rdx" instead of a second load was faster, perhaps avoiding a dependency to allow better scheduling, but nowadays "movq %rax, %rdx" is either eliminated by GCC's hardreg cprop pass, or special cased by modern hardware, making the first foo preferrable, not only shorter but also faster. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, and with/without -march=cascadelake with no new failures. Ok for mainline? 2023-12-18 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_convert_const_wide_int_to_broadcast): Remove static. (ix86_expand_move): Don't attempt to convert wide constants to SSE using ix86_convert_const_wide_int_to_broadcast here. (ix86_split_long_move): Always un-cprop multi-word constants. * config/i386/i386-expand.h (ix86_convert_const_wide_int_to_broadcast): Prototype here. * config/i386/i386-features.cc: Include i386-expand.h. (timode_scalar_chain::convert_insn): When converting TImode to v1TImode, try ix86_convert_const_wide_int_to_broadcast. gcc/testsuite/ChangeLog * gcc.target/i386/movti-2.c: New test case. * gcc.target/i386/movti-3.c: Likewise. Thanks in advance, Roger --