From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 187BE384843E for ; Tue, 13 Jul 2021 14:59:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 187BE384843E Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 16DEZLS8159231; Tue, 13 Jul 2021 10:59:15 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 39qrf89dck-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 13 Jul 2021 10:59:15 -0400 Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16DEZOIk159542; Tue, 13 Jul 2021 10:59:14 -0400 Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com [169.51.49.102]) by mx0a-001b2d01.pphosted.com with ESMTP id 39qrf89dbg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 13 Jul 2021 10:59:14 -0400 Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1]) by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 16DEwuGL026584; Tue, 13 Jul 2021 14:59:12 GMT Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by ppma06ams.nl.ibm.com with ESMTP id 39q2th9bsk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 13 Jul 2021 14:59:12 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 16DEx99622348264 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 13 Jul 2021 14:59:10 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9DAA011C064; Tue, 13 Jul 2021 14:59:09 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6DDB411C054; Tue, 13 Jul 2021 14:59:07 +0000 (GMT) Received: from KewenLins-MacBook-Pro.local (unknown [9.197.234.122]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Tue, 13 Jul 2021 14:59:06 +0000 (GMT) Subject: Re: [RFC/PATCH] vect: Recog mul_highpart pattern To: Richard Biener Cc: GCC Patches , Richard Sandiford , Segher Boessenkool , Bill Schmidt References: From: "Kewen.Lin" Message-ID: Date: Tue, 13 Jul 2021 22:59:03 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: Qbmm3IPmLZTuo5DYuLcZIhznEnnW0u7y X-Proofpoint-GUID: ShLcrO7QhJQu_hxvT2sLRipYBSXqyAqG X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-13_07:2021-07-13, 2021-07-13 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxscore=0 adultscore=0 mlxlogscore=999 phishscore=0 lowpriorityscore=0 spamscore=0 impostorscore=0 bulkscore=0 suspectscore=0 priorityscore=1501 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107130093 X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Jul 2021 14:59:21 -0000 on 2021/7/13 下午8:42, Richard Biener wrote: > On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin wrote: >> >> Hi Richi, >> >> Thanks for the comments! >> >> on 2021/7/13 下午5:35, Richard Biener wrote: >>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin wrote: >>>> >>>> Hi, >>>> >>>> When I added the support for Power10 newly introduced multiply >>>> highpart instrutions, I noticed that currently vectorizer >>>> doesn't try to vectorize multiply highpart pattern, I hope >>>> this isn't intentional? >>>> >>>> This patch is to extend the existing pattern mulhs handlings >>>> to cover multiply highpart. Another alternative seems to >>>> recog mul_highpart operation in a general place applied for >>>> scalar code when the target supports the optab for the scalar >>>> operation, it's based on the assumption that one target which >>>> supports vector version of multiply highpart should have the >>>> scalar version. I noticed that the function can_mult_highpart_p >>>> can check/handle mult_highpart well even without mul_highpart >>>> optab support, I think to recog this pattern in vectorizer >>>> is better. Is it on the right track? >>> >>> I think it's on the right track, using IFN_LAST is a bit awkward >>> in case yet another case pops up so maybe you can use >>> a code_helper instance instead which unifies tree_code, >>> builtin_code and internal_fn? >>> >> >> If there is one new requirement which doesn't have/introduce IFN >> stuffs but have one existing tree_code, can we add one more field >> with type tree_code, then for the IFN_LAST path we can check the >> different requirements under the guard with that tree_code variable? >> >>> I also notice that can_mult_highpart_p will return true if >>> only vec_widen_[us]mult_{even,odd,hi,lo} are available, >>> but then the result might be less optimal (or even not >>> handled later)? >>> >> >> I think it will be handled always? The expander calls >> >> rtx >> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1, >> rtx target, bool uns_p) >> >> which will further check with can_mult_highpart_p. >> >> For the below case, >> >> #define SHT_CNT 16 >> >> __attribute__ ((noipa)) void >> test () >> { >> for (int i = 0; i < N; i++) >> sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16; >> } >> >> Without this patch, it use widen_mult like below: >> >> vect__1.5_19 = MEM [(short int *)&sh_a + ivtmp.18_24 * 1]; >> vect__3.8_14 = MEM [(short int *)&sh_b + ivtmp.18_24 * 1]; >> vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR ; >> vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR ; >> vect__6.10_25 = vect_patt_22.9_13 >> 16; >> vect__6.10_26 = vect_patt_22.9_9 >> 16; >> vect__7.11_27 = VEC_PACK_TRUNC_EXPR ; >> MEM [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27; >> >> .L2: >> lxvx 33,7,9 >> lxvx 32,8,9 >> vmulosh 13,0,1 // widen mult >> vmulesh 0,0,1 >> xxmrglw 33,32,45 // merge >> xxmrghw 32,32,45 >> vsraw 1,1,12 // shift >> vsraw 0,0,12 >> vpkuwum 0,0,1 // pack >> stxvx 32,10,9 >> addi 9,9,16 >> bdnz .L2 >> >> >> With this patch, it ends up with: >> >> vect__1.5_14 = MEM [(short int *)&sh_a + ivtmp.17_24 * 1]; >> vect__3.8_8 = MEM [(short int *)&sh_b + ivtmp.17_24 * 1]; >> vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14; >> MEM [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25; > > Yes, so I'm curious what it ends up with/without the patch on x86_64 which > can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart. > For test case: ``` #define N 32 typedef signed int bigType; typedef signed short smallType; #define SH_CNT 16 extern smallType small_a[N], small_b[N], small_c[N]; __attribute__((noipa)) void test_si(int n) { for (int i = 0; i < n; i++) small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT; } ``` on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize 1) without this patch, the pattern isn't recognized, the IR looks like: [local count: 94607391]: bnd.5_34 = niters.4_25 >> 3; _13 = (sizetype) bnd.5_34; _29 = _13 * 16; [local count: 378429566]: # ivtmp.34_4 = PHI vect__1.10_40 = MEM [(short int *)&small_a + ivtmp.34_4 * 1]; vect__3.13_43 = MEM [(short int *)&small_b + ivtmp.34_4 * 1]; vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR ; vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR ; vect__6.15_46 = vect_patt_18.14_44 >> 16; vect__6.15_47 = vect_patt_18.14_45 >> 16; vect__7.16_48 = VEC_PACK_TRUNC_EXPR ; MEM [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48; ivtmp.34_5 = ivtmp.34_4 + 16; if (ivtmp.34_5 != _29) goto ; [75.00%] else goto ; [25.00%] ... *asm*: .L4: movdqu small_b(%rax), %xmm3 movdqu small_a(%rax), %xmm1 addq $16, %rax movdqu small_a-16(%rax), %xmm2 pmullw %xmm3, %xmm1 pmulhw %xmm3, %xmm2 movdqa %xmm1, %xmm0 punpcklwd %xmm2, %xmm0 punpckhwd %xmm2, %xmm1 psrad $16, %xmm1 psrad $16, %xmm0 movdqa %xmm0, %xmm2 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm2 movdqa %xmm0, %xmm1 punpckhwd %xmm2, %xmm1 punpcklwd %xmm2, %xmm0 punpcklwd %xmm1, %xmm0 movups %xmm0, small_c-16(%rax) cmpq %rdx, %rax jne .L4 movl %edi, %eax andl $-8, %eax testb $7, %dil je .L14 *insn dist in loop* 1 addq 3 movdqa 3 movdqu 1 movups 1 pmulhw 1 pmullw 2 psrad 3 punpckhwd 4 punpcklwd 2) with this patch but make the mul_highpart optab query return false, the IR looks like: [local count: 94607391]: bnd.5_37 = niters.4_22 >> 3; _13 = (sizetype) bnd.5_37; _32 = _13 * 16; [local count: 378429566]: # ivtmp.33_4 = PHI vect__1.10_43 = MEM [(short int *)&small_a + ivtmp.33_4 * 1]; vect__3.13_46 = MEM [(short int *)&small_b + ivtmp.33_4 * 1]; vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46; MEM [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47; ivtmp.33_5 = ivtmp.33_4 + 16; if (ivtmp.33_5 != _32) goto ; [75.00%] else goto ; [25.00%] *asm*: .L4: movdqu small_b(%rax), %xmm3 movdqu small_a(%rax), %xmm1 addq $16, %rax movdqu small_a-16(%rax), %xmm2 pmullw %xmm3, %xmm1 pmulhw %xmm3, %xmm2 movdqa %xmm1, %xmm0 punpcklwd %xmm2, %xmm0 punpckhwd %xmm2, %xmm1 movdqa %xmm0, %xmm2 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm2 movdqa %xmm0, %xmm1 punpckhwd %xmm2, %xmm1 punpcklwd %xmm2, %xmm0 punpckhwd %xmm1, %xmm0 movups %xmm0, small_c-16(%rax) cmpq %rdx, %rax jne .L4 movl %edi, %eax andl $-8, %eax testb $7, %dil je .L14 *insn dist*: 1 addq 3 movdqa 3 movdqu 1 movups 1 pmulhw 1 pmullw 4 punpckhwd 3 punpcklwd I know little on i386 asm, but this seems better from insn distribution, the one without this patch uses two more psrad (s). 3) FWIW with this patch as well as existing optab supports, the IR looks like: [local count: 94607391]: bnd.5_40 = niters.4_19 >> 3; _75 = (sizetype) bnd.5_40; _91 = _75 * 16; [local count: 378429566]: # ivtmp.48_53 = PHI vect__1.10_48 = MEM [(short int *)&small_a + ivtmp.48_53 * 1]; vect__3.13_51 = MEM [(short int *)&small_b + ivtmp.48_53 * 1]; vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51; MEM [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52; ivtmp.48_45 = ivtmp.48_53 + 16; if (ivtmp.48_45 != _91) goto ; [75.00%] else goto ; [25.00%] [local count: 94607391]: niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288; tmp.7_42 = (int) niters_vector_mult_vf.6_41; _61 = niters.4_19 & 7; if (_61 == 0) goto ; [12.50%] else goto ; [87.50%] [local count: 93293400]: # i_38 = PHI # _44 = PHI niters.18_60 = niters.4_19 - _44; _76 = niters.18_60 + 4294967295; if (_76 <= 2) goto ; [10.00%] else goto ; [90.00%] [local count: 167928121]: _85 = (sizetype) _44; _86 = _85 * 2; vectp_small_a.23_84 = &small_a + _86; vectp_small_b.26_90 = &small_b + _86; vectp_small_c.31_98 = &small_c + _86; vect__9.24_89 = MEM [(short int *)vectp_small_a.23_84]; vect__28.27_95 = MEM [(short int *)vectp_small_b.26_90]; vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95; MEM [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96; niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292; _82 = (int) niters_vector_mult_vf.20_80; tmp.21_81 = i_38 + _82; _74 = niters.18_60 & 3; if (_74 == 0) goto ; [25.00%] else goto ; [75.00%] ... *asm*: .L4: movdqu small_a(%rax), %xmm0 movdqu small_b(%rax), %xmm2 addq $16, %rax pmulhw %xmm2, %xmm0 movups %xmm0, small_c-16(%rax) cmpq %rdx, %rax jne .L4 movl %ecx, %edx andl $-8, %edx movl %edx, %eax testb $7, %cl je .L19 .L3: movl %ecx, %esi subl %edx, %esi leal -1(%rsi), %edi cmpl $2, %edi jbe .L6 movq small_a(%rdx,%rdx), %xmm0 movq small_b(%rdx,%rdx), %xmm1 pmulhw %xmm1, %xmm0 movq %xmm0, small_c(%rdx,%rdx) movl %esi, %edx andl $-4, %edx addl %edx, %eax andl $3, %esi je .L1 I guess the proposed IFN would be directly mapped for [us]mul_highpart? or do you expect it will also cover what we support in can_mult_highpart_p? BR, Kewen