From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 31073 invoked by alias); 21 Oct 2014 14:56:55 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 31057 invoked by uid 89); 21 Oct 2014 14:56:55 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-vc0-f181.google.com Received: from mail-vc0-f181.google.com (HELO mail-vc0-f181.google.com) (209.85.220.181) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 21 Oct 2014 14:56:53 +0000 Received: by mail-vc0-f181.google.com with SMTP id le20so564721vcb.26 for ; Tue, 21 Oct 2014 07:56:50 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.52.189.106 with SMTP id gh10mr2077535vdc.33.1413903410374; Tue, 21 Oct 2014 07:56:50 -0700 (PDT) Received: by 10.31.179.66 with HTTP; Tue, 21 Oct 2014 07:56:50 -0700 (PDT) In-Reply-To: References: Date: Tue, 21 Oct 2014 14:57:00 -0000 Message-ID: Subject: Re: [PATCH] Add zero-overhead looping for xtensa backend From: "augustine.sterling@gmail.com" To: "Yangfei (Felix)" Cc: "gcc-patches@gcc.gnu.org" , Felix Yang Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2014-10/txt/msg02098.txt.bz2 On Wed, Oct 15, 2014 at 7:10 PM, Yangfei (Felix) wr= ote: > Hi Sterling, > > Since the patch is delayed for a long time, I'm kind of pushing it. S= orry for that. > Yeah, you are right. We have some performance issue here as GCC may u= se one more general register in some cases with this patch. > Take the following arraysum testcase for example. In doloop optimizat= ion, GCC figures out that the number of iterations is 1024 and creates a ne= w pseudo 79 as the new trip count register. > The pseudo 79 is live throughout the loop, this makes the register pr= essure in the loop higher. And it's possible that this new pseudo is spille= d by reload when the register pressure is very high. > I know that the xtensa loop instruction copies the trip count registe= r into the LCOUNT special register. And we need describe this hardware feat= ure in GCC in order to free the trip count register. > But I find it difficult to do. Do you have any good suggestions on th= is? There are two issues related to the trip count, one I would like you to solve now, one later. 1. Later: The trip count doesn't need to be updated at all inside these loops, once the loop instruction executes. The code below relates to this case. 2. Now: You should be able to use a loop instruction regardless of whether the trip count is spilled. If you have an example where that wouldn't work, I would love to see it. > > arraysum.c: > int g[1024]; > int g_sum; > > void test_entry () > { > int i, Sum =3D 0; > > for (i =3D 0; i < 1024; i++) > Sum =3D Sum + g[i]; > > g_sum =3D Sum; > } > > > 1. RTL before the doloop optimization pass(arraysum.c.193r.loop2_invarian= t): > (note 34 0 32 2 [bb 2] NOTE_INSN_BASIC_BLOCK) > (note 32 34 36 2 NOTE_INSN_FUNCTION_BEG) > (insn 36 32 37 2 (set (reg:SI 72 [ ivtmp$8 ]) > (mem/u/c:SI (symbol_ref/u:SI ("*.LC2") [flags 0x2]) [2 S4 A32]))= 29 {movsi_internal} > (expr_list:REG_EQUAL (symbol_ref:SI ("g") ) > (nil))) > (insn 37 36 33 2 (set (reg/f:SI 76 [ D.1393 ]) > (mem/u/c:SI (symbol_ref/u:SI ("*.LC3") [flags 0x2]) [2 S4 A32]))= 29 {movsi_internal} > (expr_list:REG_EQUAL (const:SI (plus:SI (symbol_ref:SI ("g") ) > (const_int 4096 [0x1000]))) > (nil))) > (insn 33 37 42 2 (set (reg/v:SI 74 [ Sum ]) > (const_int 0 [0])) arraysum.c:6 29 {movsi_internal} > (nil)) > (code_label 42 33 38 3 2 "" [0 uses]) > (note 38 42 39 3 [bb 3] NOTE_INSN_BASIC_BLOCK) > (insn 39 38 40 3 (set (reg:SI 77 [ MEM[base: _14, offset: 0B] ]) > (mem:SI (reg:SI 72 [ ivtmp$8 ]) [2 MEM[base: _14, offset: 0B]+0 S= 4 A32])) arraysum.c:9 29 {movsi_internal} > (nil)) > (insn 40 39 41 3 (set (reg/v:SI 74 [ Sum ]) > (plus:SI (reg/v:SI 74 [ Sum ]) > (reg:SI 77 [ MEM[base: _14, offset: 0B] ]))) arraysum.c:9 1 {= addsi3} > (expr_list:REG_DEAD (reg:SI 77 [ MEM[base: _14, offset: 0B] ]) > (nil))) > (insn 41 40 43 3 (set (reg:SI 72 [ ivtmp$8 ]) > (plus:SI (reg:SI 72 [ ivtmp$8 ]) > (const_int 4 [0x4]))) 1 {addsi3} > (nil)) > (jump_insn 43 41 52 3 (set (pc) > (if_then_else (ne (reg:SI 72 [ ivtmp$8 ]) > (reg/f:SI 76 [ D.1393 ])) > (label_ref:SI 52) > (pc))) arraysum.c:8 39 {*btrue} > (int_list:REG_BR_PROB 9899 (nil)) > -> 52) > (code_label 52 43 51 5 3 "" [1 uses]) > (note 51 52 44 5 [bb 5] NOTE_INSN_BASIC_BLOCK) > (note 44 51 45 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > (insn 45 44 46 4 (set (reg/f:SI 78) > (mem/u/c:SI (symbol_ref/u:SI ("*.LC4") [flags 0x2]) [2 S4 A32]))= arraysum.c:11 29 {movsi_internal} > (expr_list:REG_EQUAL (symbol_ref:SI ("g_sum") ) > (nil))) > (insn 46 45 0 4 (set (mem/c:SI (reg/f:SI 78) [2 g_sum+0 S4 A32]) > (reg/v:SI 74 [ Sum ])) arraysum.c:11 29 {movsi_internal} > (expr_list:REG_DEAD (reg/f:SI 78) > (expr_list:REG_DEAD (reg/v:SI 74 [ Sum ]) > (nil)))) > > > 2. RTL after the doloop optimization pass(arraysum.c.195r.loop2_doloop): > (note 34 0 32 2 [bb 2] NOTE_INSN_BASIC_BLOCK) > (note 32 34 36 2 NOTE_INSN_FUNCTION_BEG) > (insn 36 32 37 2 (set (reg:SI 72 [ ivtmp$8 ]) > (mem/u/c:SI (symbol_ref/u:SI ("*.LC2") [flags 0x2]) [2 S4 A32]))= 29 {movsi_internal} > (expr_list:REG_EQUAL (symbol_ref:SI ("g") ) > (nil))) > (insn 37 36 33 2 (set (reg/f:SI 76 [ D.1393 ]) > (mem/u/c:SI (symbol_ref/u:SI ("*.LC3") [flags 0x2]) [2 S4 A32]))= 29 {movsi_internal} > (expr_list:REG_EQUAL (const:SI (plus:SI (symbol_ref:SI ("g") ) > (const_int 4096 [0x1000]))) > (nil))) > (insn 33 37 54 2 (set (reg/v:SI 74 [ Sum ]) > (const_int 0 [0])) arraysum.c:6 29 {movsi_internal} > (nil)) > (insn 54 33 42 2 (set (reg:SI 79) > (const_int 1024 [0x400])) arraysum.c:6 -1 > (nil)) > (code_label 42 54 38 3 2 "" [0 uses]) > (note 38 42 39 3 [bb 3] NOTE_INSN_BASIC_BLOCK) > (insn 39 38 40 3 (set (reg:SI 77 [ MEM[base: _14, offset: 0B] ]) > (mem:SI (reg:SI 72 [ ivtmp$8 ]) [2 MEM[base: _14, offset: 0B]+0 S= 4 A32])) arraysum.c:9 29 {movsi_internal} > (nil)) > (insn 40 39 41 3 (set (reg/v:SI 74 [ Sum ]) > (plus:SI (reg/v:SI 74 [ Sum ]) > (reg:SI 77 [ MEM[base: _14, offset: 0B] ]))) arraysum.c:9 1 {= addsi3} > (expr_list:REG_DEAD (reg:SI 77 [ MEM[base: _14, offset: 0B] ]) > (nil))) > (insn 41 40 53 3 (set (reg:SI 72 [ ivtmp$8 ]) > (plus:SI (reg:SI 72 [ ivtmp$8 ]) > (const_int 4 [0x4]))) 1 {addsi3} > (nil)) > (jump_insn 53 41 52 3 (parallel [ > (set (pc) > (if_then_else (ne (reg:SI 79) > (const_int 1 [0x1])) > (label_ref 52) > (pc))) > (set (reg:SI 79) > (plus:SI (reg:SI 79) > (const_int -1 [0xffffffffffffffff]))) > (unspec [ > (const_int 0 [0]) > ] 13) > (clobber (scratch:SI)) > ]) -1 > (int_list:REG_BR_PROB 9899 (nil)) > -> 52) > (code_label 52 53 51 5 3 "" [1 uses]) > (note 51 52 44 5 [bb 5] NOTE_INSN_BASIC_BLOCK) > (note 44 51 45 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > (insn 45 44 46 4 (set (reg/f:SI 78) > (mem/u/c:SI (symbol_ref/u:SI ("*.LC4") [flags 0x2]) [2 S4 A32]))= arraysum.c:11 29 {movsi_internal} > (expr_list:REG_EQUAL (symbol_ref:SI ("g_sum") ) > (nil))) > (insn 46 45 0 4 (set (mem/c:SI (reg/f:SI 78) [2 g_sum+0 S4 A32]) > (reg/v:SI 74 [ Sum ])) arraysum.c:11 29 {movsi_internal} > (expr_list:REG_DEAD (reg/f:SI 78) > (expr_list:REG_DEAD (reg/v:SI 74 [ Sum ]) > (nil)))) > > >> >> On Tue, Oct 14, 2014 at 8:39 AM, Felix Yang wro= te: >> > PING=EF=BC=9F >> > Cheers, >> > Felix >> >> Felix, >> >> This isn't my day job, 24-hour pings are unproductive. >> >> You shouldn't need to worry about the trip count register getting spille= d. It >> makes no difference whatsoever to how the loop operates--the trip count = is >> dead with regards to the loop once the instruction executes. You don't n= eed to >> describe LCOUNT to gcc in order for this not to matter. It should be eno= ugh to >> describe the zcl as consuming the value in the same way a branch instruc= tion >> consumes a value. >> >> If you have a case where spilling it is causing a problem, then there is= a bug in >> your code, papered over by dropping case when it is spilled. Similarly w= ith >> iter_reg_used_outside--it shouldn't affect whether or not a zcl is valid= here. If >> you have a case where it does, there is likely a bug in your code. >> >> If the code is easier to write by maintaining trip_count up, then fine (= for now); >> you give up some performance (in fact, a lot of performance), but that d= oesn't >> matter as to the correctness. >> >> >> > >> > >> > On Tue, Oct 14, 2014 at 12:30 AM, Felix Yang >> wrote: >> >> Thanks for the comments. >> >> >> >> The patch checked the usage of teh trip count register, making sure >> >> that it is not used in the loop body other than the doloop_end or >> >> lives past the doloop_end instruction, as the following code snippet >> >> shows: >> >> >> >> + /* Scan all the blocks to make sure they don't use iter_reg. */ >> >> + if (loop->iter_reg_used || loop->iter_reg_used_outside) >> >> + { >> >> + if (dump_file) >> >> + fprintf (dump_file, ";; loop %d uses iterator\n", >> >> + loop->loop_no); >> >> + return false; >> >> + } >> >> >> >> For the spill issue, I think we need to handle it. The reason is >> >> that currently we are not telling GCC about the existence of the >> >> LCOUNT register. Instead, we keep the trip count in a general >> >> register and it's possible that this register can be spilled when >> >> register pressure is high. >> >> It's a good idea to post another patch to describe the LCOUNT >> >> register in GCC in order to free this general register. But I want >> >> this patch applied as a first step, OK? >> >> >> >> Cheers, >> >> Felix >> >> >> >> >> >> On Tue, Oct 14, 2014 at 12:09 AM, augustine.sterling@gmail.com >> >> wrote: >> >>> On Fri, Oct 10, 2014 at 6:59 AM, Felix Yang >> wrote: >> >>>> Hi Sterling, >> >>>> >> >>>> I made some improvement to the patch. Two changes: >> >>>> 1. TARGET_LOOPS is now used as a condition of the doloop >> >>>> related patterns, which is more elegant. >> >>> >> >>> Fine. >> >>> >> >>>> 2. As the trip count register of the zero-cost loop maybe >> >>>> potentially spilled, we need to change the patterns in order to >> >>>> handle this issue. >> >>> >> >>> Actually, for xtensa you don't. The trip count is copied into LCOUNT >> >>> at the execution of the loop instruction, and therefore a spill or >> >>> whatever doesn't matter--it won't affect the result. So as long as >> >>> you have the trip count at the start of the loop, you are fine. >> >>> >> >>> This does bring up an issue of whether or not the trip count can be >> >>> modified during the loop. (note that this is different than early >> >>> exit.) If it can, you can't use a zero-overhead loop. Does your >> >>> patch address this case. >> >>> >> >>> The solution is similar to that adapted by c6x backend. >> >>>> Just turn the zero-cost loop into a regular loop when that happens >> >>>> when reload is completed. >> >>>> Attached please find version 4 of the patch. Make check >> >>>> regression tested with xtensa-elf-gcc/simulator. >> >>>> OK for trunk?