Rewrite ix86_expand_vecop_qihi2 to expand fo 2x-wider (e.g. V16QI -> V16HImode) instructions when available. Currently, the compiler generates following assembly for V16QImode multiplication (-mavx2): vpunpcklbw %xmm0, %xmm0, %xmm3 vpunpcklbw %xmm1, %xmm1, %xmm2 vpunpckhbw %xmm0, %xmm0, %xmm0 movl $255, %eax vpunpckhbw %xmm1, %xmm1, %xmm1 vpmullw %xmm3, %xmm2, %xmm2 vmovd %eax, %xmm3 vpmullw %xmm0, %xmm1, %xmm1 vpbroadcastw %xmm3, %xmm3 vpand %xmm2, %xmm3, %xmm0 vpand %xmm1, %xmm3, %xmm3 vpackuswb %xmm3, %xmm0, %xmm0 and only with -mavx512bw -mavx512vl generates: vpmovzxbw %xmm1, %ymm1 vpmovzxbw %xmm0, %ymm0 vpmullw %ymm1, %ymm0, %ymm0 vpmovwb %ymm0, %xmm0 Patched compiler generates more optimized code involving multiplication in 2x-wider mode in cases where missing truncate instruction has to be emulated with a permutation (-mavx2): vpmovzxbw %xmm0, %ymm0 vpmovzxbw %xmm1, %ymm1 movl $255, %eax vpmullw %ymm1, %ymm0, %ymm1 vmovd %eax, %xmm0 vpbroadcastw %xmm0, %ymm0 vpand %ymm1, %ymm0, %ymm0 vpackuswb %ymm0, %ymm0, %ymm0 vpermq $216, %ymm0, %ymm0 The patch also adjusts cost calculation of V*QImode emulations to account for generation of 2x-wider mode instructions. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Rewrite to expand to 2x-wider (e.g. V16QI -> V16HImode) instructions when available. Emulate truncation via ix86_expand_vec_perm_const_1 when native truncate insn is not available. (ix86_expand_vecop_qihi_partial) : Use pmovzx when available. Trivially rename some variables. (ix86_expand_vecop_qihi): Unconditionally call ix86_expand_vecop_qihi2. * config/i386/i386.cc (ix86_multiplication_cost): Rewrite cost calculation of V*QImode emulations to account for generation of 2x-wider mode instructions. (ix86_shift_rotate_cost): Update cost calculation of V*QImode emulations to account for generation of 2x-wider mode instructions. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512vl-pr95488-1.c: Revert 2023-05-18 change. Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. Uros.