From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id 0573F3857802 for ; Thu, 15 Jul 2021 15:31:22 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0573F3857802 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id E6AE71FE43; Thu, 15 Jul 2021 15:31:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1626363080; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=adu9MuM/9pbX1d0wa48CmzQpWjLh6NqnMKLG3g33r0Y=; b=rVRpAxe8yypTEN7UnqRQwnEhyE9qrQrldguNldcijI3iVQye7GqeE+863m0OQ/xsRInwl6 +rGTEQcD1bwcrhu2xAuBvwESGBx7qCDEJeXGvqeYmUsMmwFWVC2W+Pu5oTFdvR9QwrAp/W fTIYJdarKPBuc3g5MwWteUQP4ye/y+8= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1626363080; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=adu9MuM/9pbX1d0wa48CmzQpWjLh6NqnMKLG3g33r0Y=; b=tniWocyuufw4YLg8QBkW/CJcaxlX2Eg+olumWJ/tdJ+b4ZLr3y4NLmNdP6aHryYqUviwrC 8vlhpJHYBt709IAw== Received: from murzim.suse.de (murzim.suse.de [10.160.4.192]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id D7EF3A3B8D; Thu, 15 Jul 2021 15:31:20 +0000 (UTC) Date: Thu, 15 Jul 2021 17:31:20 +0200 (CEST) From: Richard Biener To: Richard Sandiford cc: Hongtao Liu , Hongtao Liu , GCC Patches Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86 In-Reply-To: Message-ID: References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr> User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Jul 2021 15:31:24 -0000 On Thu, 15 Jul 2021, Richard Biener wrote: > On Thu, 15 Jul 2021, Richard Sandiford wrote: > > > Richard Biener writes: > > > On Thu, 15 Jul 2021, Hongtao Liu wrote: > > > > > >> On Thu, Jul 15, 2021 at 6:45 PM Richard Biener via Gcc-patches > > >> wrote: > > >> > > > >> > On Thu, Jul 15, 2021 at 12:30 PM Richard Biener wrote: > > >> > > > > >> > > The following extends the existing loop masking support using > > >> > > SVE WHILE_ULT to x86 by proving an alternate way to produce the > > >> > > mask using VEC_COND_EXPRs. So with --param vect-partial-vector-usage > > >> > > you can now enable masked vectorized epilogues (=1) or fully > > >> > > masked vector loops (=2). > > >> > > > > >> > > What's missing is using a scalar IV for the loop control > > >> > > (but in principle AVX512 can use the mask here - just the patch > > >> > > doesn't seem to work for AVX512 yet for some reason - likely > > >> > > expand_vec_cond_expr_p doesn't work there). What's also missing > > >> > > is providing more support for predicated operations in the case > > >> > > of reductions either via VEC_COND_EXPRs or via implementing > > >> > > some of the .COND_{ADD,SUB,MUL...} internal functions as mapping > > >> > > to masked AVX512 operations. > > >> > > > > >> > > For AVX2 and > > >> > > > > >> > > int foo (unsigned *a, unsigned * __restrict b, int n) > > >> > > { > > >> > > unsigned sum = 1; > > >> > > for (int i = 0; i < n; ++i) > > >> > > b[i] += a[i]; > > >> > > return sum; > > >> > > } > > >> > > > > >> > > we get > > >> > > > > >> > > .L3: > > >> > > vpmaskmovd (%rsi,%rax), %ymm0, %ymm3 > > >> > > vpmaskmovd (%rdi,%rax), %ymm0, %ymm1 > > >> > > addl $8, %edx > > >> > > vpaddd %ymm3, %ymm1, %ymm1 > > >> > > vpmaskmovd %ymm1, %ymm0, (%rsi,%rax) > > >> > > vmovd %edx, %xmm1 > > >> > > vpsubd %ymm15, %ymm2, %ymm0 > > >> > > addq $32, %rax > > >> > > vpbroadcastd %xmm1, %ymm1 > > >> > > vpaddd %ymm4, %ymm1, %ymm1 > > >> > > vpsubd %ymm15, %ymm1, %ymm1 > > >> > > vpcmpgtd %ymm1, %ymm0, %ymm0 > > >> > > vptest %ymm0, %ymm0 > > >> > > jne .L3 > > >> > > > > >> > > for the fully masked loop body and for the masked epilogue > > >> > > we see > > >> > > > > >> > > .L4: > > >> > > vmovdqu (%rsi,%rax), %ymm3 > > >> > > vpaddd (%rdi,%rax), %ymm3, %ymm0 > > >> > > vmovdqu %ymm0, (%rsi,%rax) > > >> > > addq $32, %rax > > >> > > cmpq %rax, %rcx > > >> > > jne .L4 > > >> > > movl %edx, %eax > > >> > > andl $-8, %eax > > >> > > testb $7, %dl > > >> > > je .L11 > > >> > > .L3: > > >> > > subl %eax, %edx > > >> > > vmovdqa .LC0(%rip), %ymm1 > > >> > > salq $2, %rax > > >> > > vmovd %edx, %xmm0 > > >> > > movl $-2147483648, %edx > > >> > > addq %rax, %rsi > > >> > > vmovd %edx, %xmm15 > > >> > > vpbroadcastd %xmm0, %ymm0 > > >> > > vpbroadcastd %xmm15, %ymm15 > > >> > > vpsubd %ymm15, %ymm1, %ymm1 > > >> > > vpsubd %ymm15, %ymm0, %ymm0 > > >> > > vpcmpgtd %ymm1, %ymm0, %ymm0 > > >> > > vpmaskmovd (%rsi), %ymm0, %ymm1 > > >> > > vpmaskmovd (%rdi,%rax), %ymm0, %ymm2 > > >> > > vpaddd %ymm2, %ymm1, %ymm1 > > >> > > vpmaskmovd %ymm1, %ymm0, (%rsi) > > >> > > .L11: > > >> > > vzeroupper > > >> > > > > >> > > compared to > > >> > > > > >> > > .L3: > > >> > > movl %edx, %r8d > > >> > > subl %eax, %r8d > > >> > > leal -1(%r8), %r9d > > >> > > cmpl $2, %r9d > > >> > > jbe .L6 > > >> > > leaq (%rcx,%rax,4), %r9 > > >> > > vmovdqu (%rdi,%rax,4), %xmm2 > > >> > > movl %r8d, %eax > > >> > > andl $-4, %eax > > >> > > vpaddd (%r9), %xmm2, %xmm0 > > >> > > addl %eax, %esi > > >> > > andl $3, %r8d > > >> > > vmovdqu %xmm0, (%r9) > > >> > > je .L2 > > >> > > .L6: > > >> > > movslq %esi, %r8 > > >> > > leaq 0(,%r8,4), %rax > > >> > > movl (%rdi,%r8,4), %r8d > > >> > > addl %r8d, (%rcx,%rax) > > >> > > leal 1(%rsi), %r8d > > >> > > cmpl %r8d, %edx > > >> > > jle .L2 > > >> > > addl $2, %esi > > >> > > movl 4(%rdi,%rax), %r8d > > >> > > addl %r8d, 4(%rcx,%rax) > > >> > > cmpl %esi, %edx > > >> > > jle .L2 > > >> > > movl 8(%rdi,%rax), %edx > > >> > > addl %edx, 8(%rcx,%rax) > > >> > > .L2: > > >> > > > > >> > > I'm giving this a little testing right now but will dig on why > > >> > > I don't get masked loops when AVX512 is enabled. > > >> > > > >> > Ah, a simple thinko - rgroup_controls vectypes seem to be > > >> > always VECTOR_BOOLEAN_TYPE_P and thus we can > > >> > use expand_vec_cmp_expr_p. The AVX512 fully masked > > >> > loop then looks like > > >> > > > >> > .L3: > > >> > vmovdqu32 (%rsi,%rax,4), %ymm2{%k1} > > >> > vmovdqu32 (%rdi,%rax,4), %ymm1{%k1} > > >> > vpaddd %ymm2, %ymm1, %ymm0 > > >> > vmovdqu32 %ymm0, (%rsi,%rax,4){%k1} > > >> > addq $8, %rax > > >> > vpbroadcastd %eax, %ymm0 > > >> > vpaddd %ymm4, %ymm0, %ymm0 > > >> > vpcmpud $6, %ymm0, %ymm3, %k1 > > >> > kortestb %k1, %k1 > > >> > jne .L3 > > >> > > > >> > I guess for x86 it's not worth preserving the VEC_COND_EXPR > > >> > mask generation but other archs may not provide all required vec_cmp > > >> > expanders. > > >> > > >> For the main loop, the full-masked loop's codegen seems much worse. > > >> Basically, we need at least 4 instructions to do what while_ult in arm does. > > >> > > >> vpbroadcastd %eax, %ymm0 > > >> vpaddd %ymm4, %ymm0, %ymm0 > > >> vpcmpud $6, %ymm0, %ymm3, %k1 > > >> kortestb %k1, %k1 > > >> vs > > >> whilelo(or some other while) > > >> > > >> more instructions are needed for avx2 since there's no direct > > >> instruction for .COND_{ADD,SUB..} > > >> > > >> original > > >> .L4: > > >> vmovdqu (%rcx,%rax), %ymm1 > > >> vpaddd (%rdi,%rax), %ymm1, %ymm0 > > >> vmovdqu %ymm0, (%rcx,%rax) > > >> addq $32, %rax > > >> cmpq %rax, %rsi > > >> jne .L4 > > >> > > >> vs > > >> avx512 full-masked loop > > >> .L3: > > >> vmovdqu32 (%rsi,%rax,4), %ymm2{%k1} > > >> vmovdqu32 (%rdi,%rax,4), %ymm1{%k1} > > >> vpaddd %ymm2, %ymm1, %ymm0 > > >> vmovdqu32 %ymm0, (%rsi,%rax,4){%k1} > > >> addq $8, %rax > > >> vpbroadcastd %eax, %ymm0 > > >> vpaddd %ymm4, %ymm0, %ymm0 > > >> vpcmpud $6, %ymm0, %ymm3, %k1 > > >> kortestb %k1, %k1 > > >> jne .L3 > > >> > > >> vs > > >> avx2 full-masked loop > > >> .L3: > > >> vpmaskmovd (%rsi,%rax), %ymm0, %ymm3 > > >> vpmaskmovd (%rdi,%rax), %ymm0, %ymm1 > > >> addl $8, %edx > > >> vpaddd %ymm3, %ymm1, %ymm1 > > >> vpmaskmovd %ymm1, %ymm0, (%rsi,%rax) > > >> vmovd %edx, %xmm1 > > >> vpsubd %ymm15, %ymm2, %ymm0 > > >> addq $32, %rax > > >> vpbroadcastd %xmm1, %ymm1 > > >> vpaddd %ymm4, %ymm1, %ymm1 > > >> vpsubd %ymm15, %ymm1, %ymm1 > > >> vpcmpgtd %ymm1, %ymm0, %ymm0 > > >> vptest %ymm0, %ymm0 > > >> jne .L3 > > >> > > >> vs arm64's code > > >> > > >> .L3: > > >> ld1w z1.s, p0/z, [x1, x3, lsl 2] > > >> ld1w z0.s, p0/z, [x0, x3, lsl 2] > > >> add z0.s, z0.s, z1.s > > >> st1w z0.s, p0, [x1, x3, lsl 2] > > >> add x3, x3, x4 > > >> whilelo p0.s, w3, w2 > > >> b.any .L3 > > > > > > Yes, that's true - it might still be OK for vectorizing epilogues > > > and thus --param vect-partial-vector-usage=1 > > > > > > Can AVX512 do any better than this? > > > > > > vpbroadcastd %eax, %ymm0 > > > vpaddd %ymm4, %ymm0, %ymm0 > > > vpcmpud $6, %ymm0, %ymm3, %k1 > > > > > > Note with multiple types involved things get even worse > > > since you need masks for each vector mode. But as far as > > > I can see that's the same for SVE (but there we have the > > > single-instruction whilelo). I guess we'll also generate > > > wrong code at the moment for the case where we need > > > multiple vectors to hold the full mask - vect_gen_while > > > doesn't seem to be prepared for this? > > > > > > So with > > > > > > int foo (unsigned long *a, unsigned * __restrict b, int n) > > > { > > > unsigned sum = 1; > > > for (int i = 0; i < n; ++i) > > > { > > > b[i] += a[i]; > > > } > > > return sum; > > > } > > > > > > SVE uses > > > > > > .L3: > > > ld1d z0.d, p0/z, [x1, x3, lsl 3] > > > ld1d z1.d, p0/z, [x0, x3, lsl 3] > > > adr z0.d, [z0.d, z1.d, lsl 2] > > > st1d z0.d, p0, [x1, x3, lsl 3] > > > add x3, x3, x4 > > > whilelo p0.d, w3, w2 > > > b.any .L3 > > > > > > so p0 vs. p0/z, whatever that means and it looks like > > > the vector add can somehow concatenate z0.d and z1.d. > > > Truly fascinating ;) > > > > It looks like this is from a version in which “b” was also long. > > For the loop above I get: > > > > .L3: > > ld1d z0.d, p0/z, [x0, x3, lsl 3] > > ld1w z1.d, p0/z, [x1, x3, lsl 2] > > add z0.s, z0.s, z1.s > > st1w z0.d, p0, [x1, x3, lsl 2] > > add x3, x3, x4 > > whilelo p0.d, w3, w2 > > b.any .L3 > > > > where the ld1w is an extending load and the st1w is a truncating store. > > > > But yeah, there's a hard-coded assumption that a mask for 2 SIs > > can also control 1 DI, etc. For example: > > > > void foo (unsigned long *a, unsigned * __restrict b, int n) > > { > > for (int i = 0; i < n; ++i) > > { > > a[i * 2] += 1; > > a[i * 2 + 1] += 2; > > b[i * 4] += 1; > > b[i * 4 + 1] += 2; > > b[i * 4 + 2] += 3; > > b[i * 4 + 3] += 4; > > } > > } > > > > becomes: > > > > .L3: > > ld1d z0.d, p0/z, [x0] > > add z0.d, z0.d, z2.d > > st1d z0.d, p0, [x0] > > ld1w z0.s, p0/z, [x1, x3, lsl 2] > > add z0.s, z0.s, z1.s > > st1w z0.s, p0, [x1, x3, lsl 2] > > add x0, x0, x4 > > add x3, x3, x5 > > whilelo p0.s, x3, x2 > > b.any .L3 > > > > This is explained more in the big comment above rgroup_controls > > (guess you've already seen it). > > OK, guess I was more looking at > > #define N 32 > int foo (unsigned long *a, unsigned long * __restrict b, > unsigned int *c, unsigned int * __restrict d, > int n) > { > unsigned sum = 1; > for (int i = 0; i < n; ++i) > { > b[i] += a[i]; > d[i] += c[i]; > } > return sum; > } > > where we on x86 AVX512 vectorize with V8DI and V16SI and we > generate two masks for the two copies of V8DI (VF is 16) and one > mask for V16SI. With SVE I see > > punpklo p1.h, p0.b > punpkhi p2.h, p0.b > > that's sth I expected to see for AVX512 as well, using the V16SI > mask and unpacking that to two V8DI ones. But I see > > vpbroadcastd %eax, %ymm0 > vpaddd %ymm12, %ymm0, %ymm0 > vpcmpud $6, %ymm0, %ymm11, %k3 > vpbroadcastd %eax, %xmm0 > vpaddd %xmm10, %xmm0, %xmm0 > vpcmpud $1, %xmm7, %xmm0, %k1 > vpcmpud $6, %xmm0, %xmm8, %k2 > kortestb %k1, %k1 > jne .L3 > > so three %k masks generated by vpcmpud. I'll have to look what's > the magic for SVE and why that doesn't trigger for x86 here. So answer myself, vect_maybe_permute_loop_masks looks for vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have QImode so that doesn't play well here. Not sure if there are proper mask instructions to use (I guess there's a shift and lopart is free). This is QI:8 to two QI:4 (bits) mask conversion. Not sure how to better ask the target here - again VnBImode might have been easier here. Sth makes the loop not handled at all (masked) with AVX2. Richard. > Richard. > > > > It looks like --param vect_partial_vector_usage defaults to 2, > > > power forces it to 1 (power10) or 0 otherwise. > > > > > > I think we'd need a target hook that toggles this per mode > > > so we could tune this dependent on AVX512 vectorization vs. AVX2. > > > > Yeah, keying it off the architecture made sense for Power because the > > only purpose of len_load/store is to support partial vectorisation. > > But I agree a hook is needed now that we have MASK_LOAD/STORE for > > general use and are emulating WHILE_ULT. > > > > Thanks, > > Richard > > > > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)