From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
 by sourceware.org (Postfix) with ESMTPS id 55CB4385C017
 for <gcc-patches@gcc.gnu.org>; Fri, 16 Jul 2021 09:11:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 55CB4385C017
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out1.suse.de (Postfix) with ESMTP id 3E36722B14;
 Fri, 16 Jul 2021 09:11:54 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
 t=1626426714; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=GDV68JhmQKLtyVInkuFB9a84xXOqbJ8SFjV5yC/Fnn8=;
 b=aIaxwb9/0RL4wz2mUel0GizxNH3lBiVnIOImPpf22rPTfEFholWTSzg+1vyNS3UMTLdwcj
 FL6cTrqNcmMeJWYDeOSgUyLFH8G1eFomY4MZvTsK+eM1nUFWLkWvnFRHN3aJxM5X1cIaun
 eicdtdsnqHzjPZzk4uynH41bXwSWQXg=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1626426714;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=GDV68JhmQKLtyVInkuFB9a84xXOqbJ8SFjV5yC/Fnn8=;
 b=/O1O02AgrHa+TYC0d4eHuNI35umNLNQ/TJDNB9opzYJw4Tu199+rZNuyqSt1iLUN5hywFS
 Wz42kGCKtcDliQBg==
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id 2ED76A3BB4;
 Fri, 16 Jul 2021 09:11:54 +0000 (UTC)
Date: Fri, 16 Jul 2021 11:11:53 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
cc: Hongtao Liu <crazylht@gmail.com>, Hongtao Liu <hongtao.liu@intel.com>, 
 GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
In-Reply-To: <nycvar.YFH.7.76.2107151728500.10711@zhemvz.fhfr.qr>
Message-ID: <nycvar.YFH.7.76.2107161017050.10711@zhemvz.fhfr.qr>
References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr>
 <CAFiYyc2YyTZ-4+ab8oPAN_9z3OF52qenRwpf=Pifkm6r8r8NSA@mail.gmail.com>
 <CAMZc-bx6iRMqignBhUAqtS-O+d1SQMvERUcODyOWxfyFo6AD6Q@mail.gmail.com>
 <nycvar.YFH.7.76.2107151320060.10711@zhemvz.fhfr.qr>
 <mptv95baall.fsf@arm.com> <nycvar.YFH.7.76.2107151711080.10711@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2107151728500.10711@zhemvz.fhfr.qr>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Jul 2021 09:11:56 -0000

On Thu, 15 Jul 2021, Richard Biener wrote:

> On Thu, 15 Jul 2021, Richard Biener wrote:
>
> > OK, guess I was more looking at
> > 
> > #define N 32
> > int foo (unsigned long *a, unsigned long * __restrict b,
> >          unsigned int *c, unsigned int * __restrict d,
> >          int n)
> > {
> >   unsigned sum = 1;
> >   for (int i = 0; i < n; ++i)
> >     {
> >       b[i] += a[i];
> >       d[i] += c[i];
> >     }
> >   return sum;
> > }
> > 
> > where we on x86 AVX512 vectorize with V8DI and V16SI and we
> > generate two masks for the two copies of V8DI (VF is 16) and one
> > mask for V16SI.  With SVE I see
> > 
> >         punpklo p1.h, p0.b
> >         punpkhi p2.h, p0.b
> > 
> > that's sth I expected to see for AVX512 as well, using the V16SI
> > mask and unpacking that to two V8DI ones.  But I see
> > 
> >         vpbroadcastd    %eax, %ymm0
> >         vpaddd  %ymm12, %ymm0, %ymm0
> >         vpcmpud $6, %ymm0, %ymm11, %k3
> >         vpbroadcastd    %eax, %xmm0
> >         vpaddd  %xmm10, %xmm0, %xmm0
> >         vpcmpud $1, %xmm7, %xmm0, %k1
> >         vpcmpud $6, %xmm0, %xmm8, %k2
> >         kortestb        %k1, %k1
> >         jne     .L3
> > 
> > so three %k masks generated by vpcmpud.  I'll have to look what's
> > the magic for SVE and why that doesn't trigger for x86 here.
> 
> So answer myself, vect_maybe_permute_loop_masks looks for
> vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
> QImode so that doesn't play well here.  Not sure if there
> are proper mask instructions to use (I guess there's a shift
> and lopart is free).  This is QI:8 to two QI:4 (bits) mask
> conversion.  Not sure how to better ask the target here - again
> VnBImode might have been easier here.

So I've managed to "emulate" the unpack_lo/hi for the case of
!VECTOR_MODE_P masks by using sub-vector select (we're asking
to turn vector(8) <signed-boolean:1> into two
vector(4) <signed-boolean:1>) via BIT_FIELD_REF.  That then
produces the desired single mask producer and

  loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) <signed-boolean:1>>(loop_mask_54);
  loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>;

note for the lowpart we can just view-convert away the excess bits,
fully re-using the mask.  We generate surprisingly "good" code:

        kmovb   %k1, %edi
        shrb    $4, %dil
        kmovb   %edi, %k2

besides the lack of using kshiftrb.  I guess we're just lacking
a mask register alternative for

(insn 22 20 25 4 (parallel [
            (set (reg:QI 94 [ loop_mask_37 ])
                (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ])
                    (const_int 4 [0x4])))
            (clobber (reg:CC 17 flags))
        ]) 724 {*lshrqi3_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))

and so we reload.  For the above cited loop the AVX512 vectorization
with --param vect-partial-vector-usage=1 does look quite sensible
to me.  Instead of a SSE vectorized epilogue plus a scalar
epilogue we get a single fully masked AVX512 "iteration" for both.
I suppose it's still mostly a code-size optimization (384 bytes
with the masked epiloge vs. 474 bytes with trunk) since it will
be likely slower for very low iteration counts but it's good
for icache usage then and good for less branch predictor usage.

That said, I have to set up SPEC on a AVX512 machine to do
any meaningful measurements (I suspect with just AVX2 we're not
going to see any benefit from masking).  Hints/help how to fix
the missing kshiftrb appreciated.

Oh, and if there's only V4DImode and V16HImode data then
we don't go the vect_maybe_permute_loop_masks path - that is,
we don't generate the (not used) intermediate mask but end up
generating two while_ult parts.

Thanks,
Richard.