From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-381323-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 31073 invoked by alias); 21 Oct 2014 14:56:55 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 31057 invoked by uid 89); 21 Oct 2014 14:56:55 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-vc0-f181.google.com
Received: from mail-vc0-f181.google.com (HELO mail-vc0-f181.google.com) (209.85.220.181) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 21 Oct 2014 14:56:53 +0000
Received: by mail-vc0-f181.google.com with SMTP id le20so564721vcb.26        for <gcc-patches@gcc.gnu.org>; Tue, 21 Oct 2014 07:56:50 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.52.189.106 with SMTP id gh10mr2077535vdc.33.1413903410374; Tue, 21 Oct 2014 07:56:50 -0700 (PDT)
Received: by 10.31.179.66 with HTTP; Tue, 21 Oct 2014 07:56:50 -0700 (PDT)
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820E555490E3@szxema507-mbx.china.huawei.com>
References: <CAFc0fxw_SrT8c5y=Ly+X98UhN5_P1bAjMg3n0K06fxAv0ioebg@mail.gmail.com>	<CAGSvup-V3h1EoZ3D31BEc7Gq0F=YEYJD1GmH+_UipVjNCe=HyQ@mail.gmail.com>	<CAFc0fxzHg4OktbyTy9341Joubf+A6E3XRj-oCS8WzgVp4pw9mw@mail.gmail.com>	<CAFc0fxwh7Ey2XFRpbSVNQpQ7RUwcQgGCYHMFB=Yo=6ci261prQ@mail.gmail.com>	<DA41BE1DDCA941489001C7FBD7A8820E54A354E4@szxema507-mbx.china.huawei.com>	<CAGSvup9qUfTfYOm6FyUcDFeKp4dJtFMjWaC+XnjT6Ca+OqYguA@mail.gmail.com>	<CAFc0fxxXWa603e0r29WjkSGN7T5JAQ5WZq6iKW46zfvWhou7Dw@mail.gmail.com>	<CAFc0fxwXVkSf0---P42kdXWR7ibr02uOaiiz8L9Kc=fHTxWY1w@mail.gmail.com>	<CAGSvup9EP3Vy5f+G_jCKMHPG7t5k6+JZZxE+=HAetXYEusYOeg@mail.gmail.com>	<CAFc0fxyWfZdiLA7NxK2Zhj5bkdbSg9UOsVVHs5yOp5zSj3Kx0g@mail.gmail.com>	<CAFc0fxxtvgu39njZ76HAZwgExn7Q+60ycqR+6+toS=+JwaMe+g@mail.gmail.com>	<CAGSvup-Zd462JYy5kWPn46w28zYajzuaF7YcMS0-7ddyLHx+LQ@mail.gmail.com>	<DA41BE1DDCA941489001C7FBD7A8820E555490E3@szxema507-mbx.china.huawei.com>
Date: Tue, 21 Oct 2014 14:57:00 -0000
Message-ID: <CAGSvup9g0Wu1JNRVF-GOVJPYiM9wj3s5xsHcug=Bt4-_+uFv8g@mail.gmail.com>
Subject: Re: [PATCH] Add zero-overhead looping for xtensa backend
From: "augustine.sterling@gmail.com" <augustine.sterling@gmail.com>
To: "Yangfei (Felix)" <felix.yang@huawei.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, Felix Yang <fei.yang0953@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2014-10/txt/msg02098.txt.bz2

On Wed, Oct 15, 2014 at 7:10 PM, Yangfei (Felix) <felix.yang@huawei.com> wr=
ote:
> Hi Sterling,
>
>     Since the patch is delayed for a long time, I'm kind of pushing it. S=
orry for that.
>     Yeah, you are right. We have some performance issue here as GCC may u=
se one more general register in some cases with this patch.
>     Take the following arraysum testcase for example. In doloop optimizat=
ion, GCC figures out that the number of iterations is 1024 and creates a ne=
w pseudo 79 as the new trip count register.
>     The pseudo 79 is live throughout the loop, this makes the register pr=
essure in the loop higher. And it's possible that this new pseudo is spille=
d by reload when the register pressure is very high.
>     I know that the xtensa loop instruction copies the trip count registe=
r into the LCOUNT special register. And we need describe this hardware feat=
ure in GCC in order to free the trip count register.
>     But I find it difficult to do. Do you have any good suggestions on th=
is?

There are two issues related to the trip count, one I would like you
to solve now, one later.

1. Later: The trip count doesn't need to be updated at all inside
these loops, once the loop instruction executes. The code below
relates to this case.

2. Now: You should be able to use a loop instruction regardless of
whether the trip count is spilled. If you have an example where that
wouldn't work, I would love to see it.

>
> arraysum.c:
> int g[1024];
> int g_sum;
>
> void test_entry ()
> {
>         int i, Sum =3D 0;
>
>         for (i =3D 0; i < 1024; i++)
>           Sum =3D Sum + g[i];
>
>         g_sum =3D Sum;
> }
>
>
> 1. RTL before the doloop optimization pass(arraysum.c.193r.loop2_invarian=
t):
> (note 34 0 32 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
> (note 32 34 36 2 NOTE_INSN_FUNCTION_BEG)
> (insn 36 32 37 2 (set (reg:SI 72 [ ivtmp$8 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC2") [flags 0x2]) [2  S4 A32]))=
 29 {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g")  <var_decl 0x7f6eef5d62d0 =
g>)
>         (nil)))
> (insn 37 36 33 2 (set (reg/f:SI 76 [ D.1393 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC3") [flags 0x2]) [2  S4 A32]))=
 29 {movsi_internal}
>      (expr_list:REG_EQUAL (const:SI (plus:SI (symbol_ref:SI ("g")  <var_d=
ecl 0x7f6eef5d62d0 g>)
>                 (const_int 4096 [0x1000])))
>         (nil)))
> (insn 33 37 42 2 (set (reg/v:SI 74 [ Sum ])
>         (const_int 0 [0])) arraysum.c:6 29 {movsi_internal}
>      (nil))
> (code_label 42 33 38 3 2 "" [0 uses])
> (note 38 42 39 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
> (insn 39 38 40 3 (set (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (mem:SI (reg:SI 72 [ ivtmp$8 ]) [2 MEM[base: _14, offset: 0B]+0 S=
4 A32])) arraysum.c:9 29 {movsi_internal}
>      (nil))
> (insn 40 39 41 3 (set (reg/v:SI 74 [ Sum ])
>         (plus:SI (reg/v:SI 74 [ Sum ])
>             (reg:SI 77 [ MEM[base: _14, offset: 0B] ]))) arraysum.c:9 1 {=
addsi3}
>      (expr_list:REG_DEAD (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (nil)))
> (insn 41 40 43 3 (set (reg:SI 72 [ ivtmp$8 ])
>         (plus:SI (reg:SI 72 [ ivtmp$8 ])
>             (const_int 4 [0x4]))) 1 {addsi3}
>      (nil))
> (jump_insn 43 41 52 3 (set (pc)
>         (if_then_else (ne (reg:SI 72 [ ivtmp$8 ])
>                 (reg/f:SI 76 [ D.1393 ]))
>             (label_ref:SI 52)
>             (pc))) arraysum.c:8 39 {*btrue}
>      (int_list:REG_BR_PROB 9899 (nil))
>  -> 52)
> (code_label 52 43 51 5 3 "" [1 uses])
> (note 51 52 44 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
> (note 44 51 45 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 45 44 46 4 (set (reg/f:SI 78)
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC4") [flags 0x2]) [2  S4 A32]))=
 arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g_sum")  <var_decl 0x7f6eef5d6=
360 g_sum>)
>         (nil)))
> (insn 46 45 0 4 (set (mem/c:SI (reg/f:SI 78) [2 g_sum+0 S4 A32])
>         (reg/v:SI 74 [ Sum ])) arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_DEAD (reg/f:SI 78)
>         (expr_list:REG_DEAD (reg/v:SI 74 [ Sum ])
>             (nil))))
>
>
> 2. RTL after the doloop optimization pass(arraysum.c.195r.loop2_doloop):
> (note 34 0 32 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
> (note 32 34 36 2 NOTE_INSN_FUNCTION_BEG)
> (insn 36 32 37 2 (set (reg:SI 72 [ ivtmp$8 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC2") [flags 0x2]) [2  S4 A32]))=
 29 {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g")  <var_decl 0x7f6eef5d62d0 =
g>)
>         (nil)))
> (insn 37 36 33 2 (set (reg/f:SI 76 [ D.1393 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC3") [flags 0x2]) [2  S4 A32]))=
 29 {movsi_internal}
>      (expr_list:REG_EQUAL (const:SI (plus:SI (symbol_ref:SI ("g")  <var_d=
ecl 0x7f6eef5d62d0 g>)
>                 (const_int 4096 [0x1000])))
>         (nil)))
> (insn 33 37 54 2 (set (reg/v:SI 74 [ Sum ])
>         (const_int 0 [0])) arraysum.c:6 29 {movsi_internal}
>      (nil))
> (insn 54 33 42 2 (set (reg:SI 79)
>         (const_int 1024 [0x400])) arraysum.c:6 -1
>      (nil))
> (code_label 42 54 38 3 2 "" [0 uses])
> (note 38 42 39 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
> (insn 39 38 40 3 (set (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (mem:SI (reg:SI 72 [ ivtmp$8 ]) [2 MEM[base: _14, offset: 0B]+0 S=
4 A32])) arraysum.c:9 29 {movsi_internal}
>      (nil))
> (insn 40 39 41 3 (set (reg/v:SI 74 [ Sum ])
>         (plus:SI (reg/v:SI 74 [ Sum ])
>             (reg:SI 77 [ MEM[base: _14, offset: 0B] ]))) arraysum.c:9 1 {=
addsi3}
>      (expr_list:REG_DEAD (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (nil)))
> (insn 41 40 53 3 (set (reg:SI 72 [ ivtmp$8 ])
>         (plus:SI (reg:SI 72 [ ivtmp$8 ])
>             (const_int 4 [0x4]))) 1 {addsi3}
>      (nil))
> (jump_insn 53 41 52 3 (parallel [
>             (set (pc)
>                 (if_then_else (ne (reg:SI 79)
>                         (const_int 1 [0x1]))
>                     (label_ref 52)
>                     (pc)))
>             (set (reg:SI 79)
>                 (plus:SI (reg:SI 79)
>                     (const_int -1 [0xffffffffffffffff])))
>             (unspec [
>                     (const_int 0 [0])
>                 ] 13)
>             (clobber (scratch:SI))
>         ]) -1
>      (int_list:REG_BR_PROB 9899 (nil))
>  -> 52)
> (code_label 52 53 51 5 3 "" [1 uses])
> (note 51 52 44 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
> (note 44 51 45 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 45 44 46 4 (set (reg/f:SI 78)
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC4") [flags 0x2]) [2  S4 A32]))=
 arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g_sum")  <var_decl 0x7f6eef5d6=
360 g_sum>)
>         (nil)))
> (insn 46 45 0 4 (set (mem/c:SI (reg/f:SI 78) [2 g_sum+0 S4 A32])
>         (reg/v:SI 74 [ Sum ])) arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_DEAD (reg/f:SI 78)
>         (expr_list:REG_DEAD (reg/v:SI 74 [ Sum ])
>             (nil))))
>
>
>>
>> On Tue, Oct 14, 2014 at 8:39 AM, Felix Yang <fei.yang0953@gmail.com> wro=
te:
>> > PING=EF=BC=9F
>> > Cheers,
>> > Felix
>>
>> Felix,
>>
>> This isn't my day job, 24-hour pings are unproductive.
>>
>> You shouldn't need to worry about the trip count register getting spille=
d. It
>> makes no difference whatsoever to how the loop operates--the trip count =
is
>> dead with regards to the loop once the instruction executes. You don't n=
eed to
>> describe LCOUNT to gcc in order for this not to matter. It should be eno=
ugh to
>> describe the zcl as consuming the value in the same way a branch instruc=
tion
>> consumes a value.
>>
>> If you have a case where spilling it is causing a problem, then there is=
 a bug in
>> your code, papered over by dropping case when it is spilled. Similarly w=
ith
>> iter_reg_used_outside--it shouldn't affect whether or not a zcl is valid=
 here. If
>> you have a case where it does, there is likely a bug in your code.
>>
>> If the code is easier to write by maintaining trip_count up, then fine (=
for now);
>> you give up some performance (in fact, a lot of performance), but that d=
oesn't
>> matter as to the correctness.
>>
>>
>> >
>> >
>> > On Tue, Oct 14, 2014 at 12:30 AM, Felix Yang <fei.yang0953@gmail.com>
>> wrote:
>> >> Thanks for the comments.
>> >>
>> >> The patch checked the usage of teh trip count register, making sure
>> >> that it is not used in the loop body other than the doloop_end or
>> >> lives past the doloop_end instruction, as the following code snippet
>> >> shows:
>> >>
>> >> +  /* Scan all the blocks to make sure they don't use iter_reg.  */
>> >> + if (loop->iter_reg_used || loop->iter_reg_used_outside)
>> >> +    {
>> >> +      if (dump_file)
>> >> +        fprintf (dump_file, ";; loop %d uses iterator\n",
>> >> +                 loop->loop_no);
>> >> +      return false;
>> >> +    }
>> >>
>> >>     For the spill issue, I think we need to handle it. The reason is
>> >> that currently we are not telling GCC about the existence of the
>> >> LCOUNT register. Instead, we keep the trip count in a general
>> >> register and it's possible that this register can be spilled when
>> >> register pressure is high.
>> >>     It's a good idea to post another patch to describe the LCOUNT
>> >> register in GCC in order to free this general register. But I want
>> >> this patch applied as a first step, OK?
>> >>
>> >> Cheers,
>> >> Felix
>> >>
>> >>
>> >> On Tue, Oct 14, 2014 at 12:09 AM, augustine.sterling@gmail.com
>> >> <augustine.sterling@gmail.com> wrote:
>> >>> On Fri, Oct 10, 2014 at 6:59 AM, Felix Yang <fei.yang0953@gmail.com>
>> wrote:
>> >>>> Hi Sterling,
>> >>>>
>> >>>>     I made some improvement to the patch. Two changes:
>> >>>>     1. TARGET_LOOPS is now used as a condition of the doloop
>> >>>> related patterns, which is more elegant.
>> >>>
>> >>> Fine.
>> >>>
>> >>>>     2. As the trip count register of the zero-cost loop maybe
>> >>>> potentially spilled, we need to change the patterns in order to
>> >>>> handle this issue.
>> >>>
>> >>> Actually, for xtensa you don't. The trip count is copied into LCOUNT
>> >>> at the execution of the loop instruction, and therefore a spill or
>> >>> whatever doesn't matter--it won't affect the result. So as long as
>> >>> you have the trip count at the start of the loop, you are fine.
>> >>>
>> >>> This does bring up an issue of whether or not the trip count can be
>> >>> modified during the loop. (note that this is different than early
>> >>> exit.) If it can, you can't use a zero-overhead loop. Does your
>> >>> patch address this case.
>> >>>
>> >>> The solution is similar to that adapted by c6x backend.
>> >>>> Just turn the zero-cost loop into a regular loop when that happens
>> >>>> when reload is completed.
>> >>>>     Attached please find version 4 of the patch. Make check
>> >>>> regression tested with xtensa-elf-gcc/simulator.
>> >>>>     OK for trunk?