From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id E596B3858D37; Sat, 22 Jan 2022 17:03:20 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E596B3858D37 From: "kvr000 at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/104188] New: gcc omitting AVX-512 broadcast instruction Date: Sat, 22 Jan 2022 17:03:20 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: kvr000 at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Jan 2022 17:03:21 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104188 Bug ID: 104188 Summary: gcc omitting AVX-512 broadcast instruction Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kvr000 at gmail dot com Target Milestone: --- Hi, there is a bug when generating AVX-512 instructions from intrinsics. The c= ode is generated correctly in gcc-10 but gcc-11 completely omits the vbroadcastf32x4 . gcc version: 11.2.0-7ubuntu2 - 11.2.0 Source code of minimal working example: // Matrix 4*4 multiplication: #ifndef NO_VECTORIZE #ifdef __x86_64__ #include #include #endif #ifdef __aarch64__ #include #endif #endif union Mat44 { float m[4][4]; #ifndef NO_VECTORIZE #ifdef __x86_64__ __m128 row[4]; __m256 rowDuet[2]; __m512 rowQuad; #endif #ifdef __aarch64__ float32x4_t row[4]; #endif #endif }; void matmult_avx512(union Mat44 *out, union Mat44 *a, union Mat44 *b) { __m512 a0123 =3D _mm512_loadu_ps(a->m[0]); __m512 b0000 =3D _mm512_broadcast_f32x4(b->row[0]); __m512 b1111 =3D _mm512_broadcast_f32x4(b->row[1]); __m512 b2222 =3D _mm512_broadcast_f32x4(b->row[2]); __m512 b3333 =3D _mm512_broadcast_f32x4(b->row[3]); __m512 result =3D _mm512_mul_ps(_mm512_permute_ps(a0123, 0x00), b00= 00); result =3D _mm512_fmadd_ps(_mm512_permute_ps(a0123, 0x55), b1111, result); result =3D _mm512_fmadd_ps(_mm512_permute_ps(a0123, 0xaa), b2222, result); result =3D _mm512_fmadd_ps(_mm512_permute_ps(a0123, 0xff), b3333, result); _mm512_storeu_ps(out->m[0], result); } gcc-10 (correct): endbr64 vmovups (%rsi), %zmm0 vbroadcastf32x4 (%rdx), %zmm6 // note here vpermilps $0, %zmm0, %zmm1 vmulps %zmm6, %zmm1, %zmm1 vbroadcastf32x4 16(%rdx), %zmm5 // note here vpermilps $85, %zmm0, %zmm2 vbroadcastf32x4 32(%rdx), %zmm4 // note here vbroadcastf32x4 48(%rdx), %zmm3 // note here vfmadd132ps %zmm5, %zmm1, %zmm2 vpermilps $170, %zmm0, %zmm1 vpermilps $255, %zmm0, %zmm0 vfmadd132ps %zmm4, %zmm2, %zmm1 vfmadd132ps %zmm3, %zmm1, %zmm0 vmovups %zmm0, (%rdi) vzeroupper ret gcc-11 (missing vbroadcasatf32x4) : endbr64 vmovups (%rsi), %zmm0 vpermilps $0, %zmm0, %zmm1 vmulps (%rdx){1to16}, %zmm1, %zmm1 vpermilps $85, %zmm0, %zmm2 vfmadd132ps 16(%rdx){1to16}, %zmm1, %zmm2 vpermilps $170, %zmm0, %zmm1 vpermilps $255, %zmm0, %zmm0 vfmadd132ps 32(%rdx){1to16}, %zmm2, %zmm1 vfmadd132ps 48(%rdx){1to16}, %zmm1, %zmm0 vmovups %zmm0, (%rdi) vzeroupper ret=