This patch is a follow-up to Hongtao's fix for PR target/105854.  That
fix is perfectly correct, but the thing that caught my eye was why is
the compiler generating a shift by zero at all.  Digging deeper it
turns out that we can easily optimize __builtin_ia32_palignr for
alignments of 0 and 64 respectively, which may be simplified to moves
from the highpart or lowpart.

After adding optimizations to simplify the 64-bit DImode palignr,
I started to add the corresponding optimizations for vpalignr (i.e.
128-bit).  The first oddity is that sse.md uses TImode and a special
SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
above SSESCALARMODE hints that this should be "dropped in favor of
VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
<ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically
using V1TImode instead of TImode for 128-bit palignr.

But it was only after I'd implemented this clean-up that I stumbled
across the strange semantics of 128-bit [v]palignr.  According to
https://www.felixcloutier.com/x86/palignr, the semantics are subtly
different based upon how the instruction is encoded.  PALIGNR leaves
the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
highpart, and (unless I'm mistaken) it looks like GCC currently uses
the exact same RTL/templates for both, treating one as an alternative
for the other.

Hence I thought I'd post what I have so far (part optimization and
part clean-up), to then ask the x86 experts for their opinions.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-,32},
with no new failures.  Ok for mainline?


2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
        CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
        * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
        and gen_ssse3_palignv1ti instead of TImode.
        * config/i386/sse.md (SSESCALARMODE): Delete.
        (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
        (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode
        iterator instead of SSESCALARMODE.

        (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
        using a single move instruction (if required).
        (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
        (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.

gcc/testsuite/ChangeLog
        * gcc.target/i386/ssse3-palignr-2.c: New test case.


Thanks in advance,
Roger
--