From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 187BE384843E
 for <gcc-patches@gcc.gnu.org>; Tue, 13 Jul 2021 14:59:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 187BE384843E
Received: from pps.filterd (m0098399.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 16DEZLS8159231; Tue, 13 Jul 2021 10:59:15 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 39qrf89dck-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 13 Jul 2021 10:59:15 -0400
Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16DEZOIk159542;
 Tue, 13 Jul 2021 10:59:14 -0400
Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.102])
 by mx0a-001b2d01.pphosted.com with ESMTP id 39qrf89dbg-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 13 Jul 2021 10:59:14 -0400
Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1])
 by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 16DEwuGL026584;
 Tue, 13 Jul 2021 14:59:12 GMT
Received: from b06cxnps4075.portsmouth.uk.ibm.com
 (d06relay12.portsmouth.uk.ibm.com [9.149.109.197])
 by ppma06ams.nl.ibm.com with ESMTP id 39q2th9bsk-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 13 Jul 2021 14:59:12 +0000
Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com
 [9.149.105.61])
 by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 16DEx99622348264
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Tue, 13 Jul 2021 14:59:10 GMT
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 9DAA011C064;
 Tue, 13 Jul 2021 14:59:09 +0000 (GMT)
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 6DDB411C054;
 Tue, 13 Jul 2021 14:59:07 +0000 (GMT)
Received: from KewenLins-MacBook-Pro.local (unknown [9.197.234.122])
 by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Tue, 13 Jul 2021 14:59:06 +0000 (GMT)
Subject: Re: [RFC/PATCH] vect: Recog mul_highpart pattern
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
 Richard Sandiford <richard.sandiford@arm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 Bill Schmidt <wschmidt@linux.ibm.com>
References: <da469973-d874-1eb6-739f-048831e8a898@linux.ibm.com>
 <CAFiYyc11OKHBLoWG=Z3dbNY8cc=jvxochq1OgHn7DJyASvQPqg@mail.gmail.com>
 <ff874f39-b97f-652e-3d4f-a2b43b4d4892@linux.ibm.com>
 <CAFiYyc0_UsOt+_xSdKCe1hgW3M4pD=PQrRSu6wJQZ9f8c4zY2A@mail.gmail.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Message-ID: <b9af40e6-f65c-4f40-3c05-0536e28be097@linux.ibm.com>
Date: Tue, 13 Jul 2021 22:59:03 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.10.0
MIME-Version: 1.0
In-Reply-To: <CAFiYyc0_UsOt+_xSdKCe1hgW3M4pD=PQrRSu6wJQZ9f8c4zY2A@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: Qbmm3IPmLZTuo5DYuLcZIhznEnnW0u7y
X-Proofpoint-GUID: ShLcrO7QhJQu_hxvT2sLRipYBSXqyAqG
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-07-13_07:2021-07-13,
 2021-07-13 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 clxscore=1015 mlxscore=0
 adultscore=0 mlxlogscore=999 phishscore=0 lowpriorityscore=0 spamscore=0
 impostorscore=0 bulkscore=0 suspectscore=0 priorityscore=1501
 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2104190000 definitions=main-2107130093
X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,
 SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Jul 2021 14:59:21 -0000

on 2021/7/13 下午8:42, Richard Biener wrote:
> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richi,
>>
>> Thanks for the comments!
>>
>> on 2021/7/13 下午5:35, Richard Biener wrote:
>>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> When I added the support for Power10 newly introduced multiply
>>>> highpart instrutions, I noticed that currently vectorizer
>>>> doesn't try to vectorize multiply highpart pattern, I hope
>>>> this isn't intentional?
>>>>
>>>> This patch is to extend the existing pattern mulhs handlings
>>>> to cover multiply highpart.  Another alternative seems to
>>>> recog mul_highpart operation in a general place applied for
>>>> scalar code when the target supports the optab for the scalar
>>>> operation, it's based on the assumption that one target which
>>>> supports vector version of multiply highpart should have the
>>>> scalar version.  I noticed that the function can_mult_highpart_p
>>>> can check/handle mult_highpart well even without mul_highpart
>>>> optab support, I think to recog this pattern in vectorizer
>>>> is better.  Is it on the right track?
>>>
>>> I think it's on the right track, using IFN_LAST is a bit awkward
>>> in case yet another case pops up so maybe you can use
>>> a code_helper instance instead which unifies tree_code,
>>> builtin_code and internal_fn?
>>>
>>
>> If there is one new requirement which doesn't have/introduce IFN
>> stuffs but have one existing tree_code, can we add one more field
>> with type tree_code, then for the IFN_LAST path we can check the
>> different requirements under the guard with that tree_code variable?
>>
>>> I also notice that can_mult_highpart_p will return true if
>>> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
>>> but then the result might be less optimal (or even not
>>> handled later)?
>>>
>>
>> I think it will be handled always?  The expander calls
>>
>> rtx
>> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
>>                       rtx target, bool uns_p)
>>
>> which will further check with can_mult_highpart_p.
>>
>> For the below case,
>>
>> #define SHT_CNT 16
>>
>> __attribute__ ((noipa)) void
>> test ()
>> {
>>   for (int i = 0; i < N; i++)
>>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
>> }
>>
>> Without this patch, it use widen_mult like below:
>>
>>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
>>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
>>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
>>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
>>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
>>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
>>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
>>
>> .L2:
>>         lxvx 33,7,9
>>         lxvx 32,8,9
>>         vmulosh 13,0,1    // widen mult
>>         vmulesh 0,0,1
>>         xxmrglw 33,32,45  // merge
>>         xxmrghw 32,32,45
>>         vsraw 1,1,12      // shift
>>         vsraw 0,0,12
>>         vpkuwum 0,0,1     // pack
>>         stxvx 32,10,9
>>         addi 9,9,16
>>         bdnz .L2
>>
>>
>> With this patch, it ends up with:
>>
>>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
>>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
>>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;
> 
> Yes, so I'm curious what it ends up with/without the patch on x86_64 which
> can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.
> 

For test case:

```
#define N 32
typedef signed int bigType;
typedef signed short smallType;
#define SH_CNT 16

extern smallType small_a[N], small_b[N], small_c[N];

__attribute__((noipa)) void test_si(int n) {
  for (int i = 0; i < n; i++)
    small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT;
}

```

on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize

1) without this patch, the pattern isn't recognized, the IR looks like:

  <bb 4> [local count: 94607391]:
  bnd.5_34 = niters.4_25 >> 3;
  _13 = (sizetype) bnd.5_34;
  _29 = _13 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)>
  vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1];
  vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1];
  vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>;
  vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>;
  vect__6.15_46 = vect_patt_18.14_44 >> 16;
  vect__6.15_47 = vect_patt_18.14_45 >> 16;
  vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48;
  ivtmp.34_5 = ivtmp.34_4 + 16;
  if (ivtmp.34_5 != _29)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

...

*asm*:

.L4:
        movdqu  small_b(%rax), %xmm3
        movdqu  small_a(%rax), %xmm1
        addq    $16, %rax
        movdqu  small_a-16(%rax), %xmm2
        pmullw  %xmm3, %xmm1
        pmulhw  %xmm3, %xmm2
        movdqa  %xmm1, %xmm0
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm2, %xmm1
        psrad   $16, %xmm1
        psrad   $16, %xmm0
        movdqa  %xmm0, %xmm2
        punpcklwd       %xmm1, %xmm0
        punpckhwd       %xmm1, %xmm2
        movdqa  %xmm0, %xmm1
        punpckhwd       %xmm2, %xmm1
        punpcklwd       %xmm2, %xmm0
        punpcklwd       %xmm1, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %edi, %eax
        andl    $-8, %eax
        testb   $7, %dil
        je      .L14

*insn dist in loop*

      1 addq
      3 movdqa
      3 movdqu
      1 movups
      1 pmulhw
      1 pmullw
      2 psrad
      3 punpckhwd
      4 punpcklwd


2) with this patch but make the mul_highpart optab query return false, the IR looks
like:

  <bb 4> [local count: 94607391]:
  bnd.5_37 = niters.4_22 >> 3;
  _13 = (sizetype) bnd.5_37;
  _32 = _13 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)>
  vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1];
  vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1];
  vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47;
  ivtmp.33_5 = ivtmp.33_4 + 16;
  if (ivtmp.33_5 != _32)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

*asm*:

.L4:
        movdqu  small_b(%rax), %xmm3
        movdqu  small_a(%rax), %xmm1
        addq    $16, %rax
        movdqu  small_a-16(%rax), %xmm2
        pmullw  %xmm3, %xmm1
        pmulhw  %xmm3, %xmm2
        movdqa  %xmm1, %xmm0
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm2, %xmm1
        movdqa  %xmm0, %xmm2
        punpcklwd       %xmm1, %xmm0
        punpckhwd       %xmm1, %xmm2
        movdqa  %xmm0, %xmm1
        punpckhwd       %xmm2, %xmm1
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm1, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %edi, %eax
        andl    $-8, %eax
        testb   $7, %dil
        je      .L14

*insn dist*:

      1 addq
      3 movdqa
      3 movdqu
      1 movups
      1 pmulhw
      1 pmullw
      4 punpckhwd
      3 punpcklwd

I know little on i386 asm, but this seems better from insn distribution,
the one without this patch uses two more psrad (s).

3) FWIW with this patch as well as existing optab supports, the IR looks like:

  <bb 4> [local count: 94607391]:
  bnd.5_40 = niters.4_19 >> 3;
  _75 = (sizetype) bnd.5_40;
  _91 = _75 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)>
  vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1];
  vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1];
  vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52;
  ivtmp.48_45 = ivtmp.48_53 + 16;
  if (ivtmp.48_45 != _91)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

  <bb 6> [local count: 94607391]:
  niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288;
  tmp.7_42 = (int) niters_vector_mult_vf.6_41;
  _61 = niters.4_19 & 7;
  if (_61 == 0)
    goto <bb 12>; [12.50%]
  else
    goto <bb 7>; [87.50%]

  <bb 7> [local count: 93293400]:
  # i_38 = PHI <tmp.7_42(6), 0(3)>
  # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)>
  niters.18_60 = niters.4_19 - _44;
  _76 = niters.18_60 + 4294967295;
  if (_76 <= 2)
    goto <bb 9>; [10.00%]
  else
    goto <bb 8>; [90.00%]

  <bb 8> [local count: 167928121]:
  _85 = (sizetype) _44;
  _86 = _85 * 2;
  vectp_small_a.23_84 = &small_a + _86;
  vectp_small_b.26_90 = &small_b + _86;
  vectp_small_c.31_98 = &small_c + _86;
  vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84];
  vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90];
  vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95;
  MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96;
  niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292;
  _82 = (int) niters_vector_mult_vf.20_80;
  tmp.21_81 = i_38 + _82;
  _74 = niters.18_60 & 3;
  if (_74 == 0)
    goto <bb 11>; [25.00%]
  else
    goto <bb 9>; [75.00%]

...

*asm*:

.L4:
        movdqu  small_a(%rax), %xmm0
        movdqu  small_b(%rax), %xmm2
        addq    $16, %rax
        pmulhw  %xmm2, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %ecx, %edx
        andl    $-8, %edx
        movl    %edx, %eax
        testb   $7, %cl
        je      .L19
.L3:
        movl    %ecx, %esi
        subl    %edx, %esi
        leal    -1(%rsi), %edi
        cmpl    $2, %edi
        jbe     .L6
        movq    small_a(%rdx,%rdx), %xmm0
        movq    small_b(%rdx,%rdx), %xmm1
        pmulhw  %xmm1, %xmm0
        movq    %xmm0, small_c(%rdx,%rdx)
        movl    %esi, %edx
        andl    $-4, %edx
        addl    %edx, %eax
        andl    $3, %esi
        je      .L1

I guess the proposed IFN would be directly mapped for [us]mul_highpart?
or do you expect it will also cover what we support in can_mult_highpart_p?

BR,
Kewen