From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 90B5F3858C56; Tue, 27 Feb 2024 06:02:56 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 90B5F3858C56
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709013776;
	bh=q+yTCo03mt8/HHnGobXY7DBVSTkf04j8CsX93hC5DNI=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=yYpkoILGPe/BReGkBTqUDPyFV2Eiy10ElNvyW8h9Z2zfuU3wCMNOMf2pAy63yx3Q1
	 MVhcmXJ1kjffrFBV3RK8j4zCtIKsSGCklxfv0bjPpR9sAejOoqlLBittpJDgOYhsFb
	 YCbWqCCKrUa16jOOXu5CvbpSi1SQI9ZSp3QQG3p8=
From: "liuhongt at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/112325] Missed vectorization of reduction
 after unrolling
Date: Tue, 27 Feb 2024 06:02:56 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: liuhongt at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-112325-4-L1LMIcsL4c@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
References: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
--- Comment #9 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
The original case is a little different from the one in PR.
It comes from ggml

#include <stdint.h>
#include <string.h>

typedef uint16_t ggml_fp16_t;
static float table_f32_f16[1 << 16];

inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
    uint16_t s;
    memcpy(&s, &f, sizeof(uint16_t));
    return table_f32_f16[s];
}

typedef struct {
    ggml_fp16_t d;
    ggml_fp16_t m;
    uint8_t qh[4];
    uint8_t qs[32 / 2];
} block_q5_1;

typedef struct {
    float d;
    float s;
    int8_t qs[32];
} block_q8_1;

void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void *
restrict vx, const void * restrict vy) {
    const int qk =3D 32;
    const int nb =3D n / qk;

    const block_q5_1 * restrict x =3D vx;
    const block_q8_1 * restrict y =3D vy;

    float sumf =3D 0.0;

    for (int i =3D 0; i < nb; i++) {
        uint32_t qh;
        memcpy(&qh, x[i].qh, sizeof(qh));

        int sumi =3D 0;

        for (int j =3D 0; j < qk/2; ++j) {
            const uint8_t xh_0 =3D ((qh >> (j + 0)) << 4) & 0x10;
            const uint8_t xh_1 =3D ((qh >> (j + 12)) ) & 0x10;

            const int32_t x0 =3D (x[i].qs[j] & 0xF) | xh_0;
            const int32_t x1 =3D (x[i].qs[j] >> 4) | xh_1;

            sumi +=3D (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
        }

        sumf +=3D (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi +
ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
    }

    *s =3D sumf;
}=