* [PATCH] Atom: Scheduler improvements for better imul placement @ 2012-04-11 8:40 Igor Zamyatin 2012-04-11 13:38 ` Andi Kleen 0 siblings, 1 reply; 26+ messages in thread From: Igor Zamyatin @ 2012-04-11 8:40 UTC (permalink / raw) To: gcc-patches [-- Attachment #1: Type: text/plain, Size: 470 bytes --] Hi All! It is known that imul placement is rather critical for Atom processors and changes try to improve imul scheduling for Atom. This gives +5% performance on several tests from new OA 2.0 testsuite from EEMBC. Tested for i386 and x86-64, ok for trunk? ChangeLog: 2012-04-10 Yuri Rumyantsev <ysrumyan@gmail.com> * config/i386/i386.c (x86_sched_reorder): New function. Add new function x86_sched_reorder to take advantage of Atom pipelened IMUL execution. [-- Attachment #2: imul_sched.patch --] [-- Type: application/octet-stream, Size: 4088 bytes --] diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 8974ddc..32d031f 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -23721,6 +23721,115 @@ ix86_agi_dependent (rtx set_insn, rtx use_insn) return false; } +/* Try to reorder ready list to take advantage of Atom pipelined IMUL + execution. It is applied if + (1) IMUL instrction is on the top of list; + (2) There no others IMUL instructions in ready list; + (3) There exists the only producer of independent IMUL instruction in + ready list; + (4) Put found producer on the top of ready list. + Returns issue rate. */ + +static int +ix86_sched_reorder(FILE *dump, int sched_verbose, rtx *ready, int *pn_ready, + int clock_var ATTRIBUTE_UNUSED) +{ + static int issue_rate = -1; + int n_ready = *pn_ready; + int n_imul = 0; + rtx insn; + rtx insn1; + int i; + sd_iterator_def sd_it; + dep_t dep; + int index = -1; + + /* set up issue rate */ + if (issue_rate < 0) + issue_rate = ix86_issue_rate(); + + /* do reodering for Atom only */ + if (ix86_tune != PROCESSOR_ATOM) + return issue_rate; + /* nothing to do if ready list contains only 1 instruction */ + if (n_ready <= 1) + return issue_rate; + + /* check that IMUL instruction is on the top of ready list. */ + insn = PATTERN(ready[n_ready -1]); + if (GET_CODE (insn) == PARALLEL) + insn = XVECEXP (insn, 0, 0); + if (GET_CODE (insn) != SET) + return issue_rate; + if (!(GET_CODE (SET_SRC (insn)) == MULT + && GET_MODE (SET_SRC (insn)) == SImode)) + return issue_rate; + + /* count for number of IMUL instructions. */ + for (i = 0; i < n_ready; i++) + { + insn = PATTERN (ready[i]); + if (GET_CODE (insn) == PARALLEL) + insn = XVECEXP (insn, 0, 0); + + if (GET_CODE (insn) != SET) + continue; + if (!(GET_CODE (SET_SRC (insn)) == MULT + && GET_MODE (SET_SRC (insn)) == SImode)) + n_imul++; + } + /* do nothing if we have > 1 IMUL instrusction in ready list. */ + if (n_imul > 1) + return issue_rate; + /* search for producer of independent IMUL instruction. */ + for (i = n_ready -2; i>= 0; i--) + { + insn = ready[i]; + FOR_EACH_DEP (insn, SD_LIST_FORW, sd_it, dep) + { + rtx con; + con = DEP_CON (dep); + insn1 = PATTERN (con); + if (GET_CODE (insn1) == PARALLEL) + insn1 = XVECEXP (insn1, 0, 0); + if (GET_CODE (insn1) == SET + && GET_CODE (SET_SRC (insn1)) == MULT + && GET_MODE (SET_SRC (insn1)) == SImode) + { + sd_iterator_def sd_it1; + dep_t dep1; + /* check if there is no other dependee for IMUL. */ + index = i; + FOR_EACH_DEP (con, SD_LIST_BACK, sd_it1, dep1) + { + rtx pro; + pro = DEP_PRO (dep1); + if (pro != insn) + index = -1; + } + if (index >= 0) + break; + } + } + if (index >= 0) + break; + } + if (index < 0) + return issue_rate; /* didn't find IMUL producer. */ + + if (sched_verbose) + fprintf(dump, ";;\tatom sched_reorder: swap %d and %d insns\n", + INSN_UID (ready[index]), INSN_UID (ready[n_ready - 1])); + + /* put IMUL producer (ready[index]) at the top of ready list */ + insn1= ready[index]; + for (i = index; i < n_ready - 1; i++) + ready[i] = ready[i + 1]; + ready[n_ready - 1] = insn1; + + return issue_rate; +} + static int ix86_adjust_cost (rtx insn, rtx link, rtx dep_insn, int cost) { @@ -38072,6 +38181,9 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree) #undef TARGET_SCHED_REASSOCIATION_WIDTH #define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width +#undef TARGET_SCHED_REORDER +#define TARGET_SCHED_REORDER ix86_sched_reorder + /* The size of the dispatch window is the total number of bytes of object code allowed in a window. */ #define DISPATCH_WINDOW_SIZE 16 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-11 8:40 [PATCH] Atom: Scheduler improvements for better imul placement Igor Zamyatin @ 2012-04-11 13:38 ` Andi Kleen 2012-04-11 14:10 ` Richard Guenther 2012-04-12 10:18 ` Igor Zamyatin 0 siblings, 2 replies; 26+ messages in thread From: Andi Kleen @ 2012-04-11 13:38 UTC (permalink / raw) To: Igor Zamyatin; +Cc: gcc-patches Igor Zamyatin <izamyatin@gmail.com> writes: > Hi All! > > It is known that imul placement is rather critical for Atom processors > and changes try to improve imul scheduling for Atom. > > This gives +5% performance on several tests from new OA 2.0 testsuite > from EEMBC. > > Tested for i386 and x86-64, ok for trunk? Did you measure how much this slows down the compiler when compiling for Atom? The new pass looks rather slow. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-11 13:38 ` Andi Kleen @ 2012-04-11 14:10 ` Richard Guenther 2012-04-12 10:20 ` Igor Zamyatin 2012-04-12 10:18 ` Igor Zamyatin 1 sibling, 1 reply; 26+ messages in thread From: Richard Guenther @ 2012-04-11 14:10 UTC (permalink / raw) To: Andi Kleen; +Cc: Igor Zamyatin, gcc-patches On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen <andi@firstfloor.org> wrote: > Igor Zamyatin <izamyatin@gmail.com> writes: > >> Hi All! >> >> It is known that imul placement is rather critical for Atom processors >> and changes try to improve imul scheduling for Atom. >> >> This gives +5% performance on several tests from new OA 2.0 testsuite >> from EEMBC. >> >> Tested for i386 and x86-64, ok for trunk? > > Did you measure how much this slows down the compiler when compiling > for Atom? The new pass looks rather slow. Also please explain why adjusting the automaton for Atom is not a way to attack this issue. Richard. > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-11 14:10 ` Richard Guenther @ 2012-04-12 10:20 ` Igor Zamyatin 2012-04-12 10:45 ` Richard Guenther 0 siblings, 1 reply; 26+ messages in thread From: Igor Zamyatin @ 2012-04-12 10:20 UTC (permalink / raw) To: Richard Guenther; +Cc: Andi Kleen, gcc-patches On Wed, Apr 11, 2012 at 6:10 PM, Richard Guenther <richard.guenther@gmail.com> wrote: > On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen <andi@firstfloor.org> wrote: >> Igor Zamyatin <izamyatin@gmail.com> writes: >> >>> Hi All! >>> >>> It is known that imul placement is rather critical for Atom processors >>> and changes try to improve imul scheduling for Atom. >>> >>> This gives +5% performance on several tests from new OA 2.0 testsuite >>> from EEMBC. >>> >>> Tested for i386 and x86-64, ok for trunk? >> >> Did you measure how much this slows down the compiler when compiling >> for Atom? The new pass looks rather slow. > > Also please explain why adjusting the automaton for Atom is not a way to > attack this issue. If I understand the question correctly - it's a dynamic check and it's not clear how to describe this adjusting statically in machine description > > Richard. > >> -Andi >> >> -- >> ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 10:20 ` Igor Zamyatin @ 2012-04-12 10:45 ` Richard Guenther 2012-04-12 12:00 ` Alexander Monakov 0 siblings, 1 reply; 26+ messages in thread From: Richard Guenther @ 2012-04-12 10:45 UTC (permalink / raw) To: Igor Zamyatin; +Cc: Andi Kleen, gcc-patches On Thu, Apr 12, 2012 at 12:20 PM, Igor Zamyatin <izamyatin@gmail.com> wrote: > On Wed, Apr 11, 2012 at 6:10 PM, Richard Guenther > <richard.guenther@gmail.com> wrote: >> On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen <andi@firstfloor.org> wrote: >>> Igor Zamyatin <izamyatin@gmail.com> writes: >>> >>>> Hi All! >>>> >>>> It is known that imul placement is rather critical for Atom processors >>>> and changes try to improve imul scheduling for Atom. >>>> >>>> This gives +5% performance on several tests from new OA 2.0 testsuite >>>> from EEMBC. >>>> >>>> Tested for i386 and x86-64, ok for trunk? >>> >>> Did you measure how much this slows down the compiler when compiling >>> for Atom? The new pass looks rather slow. >> >> Also please explain why adjusting the automaton for Atom is not a way to >> attack this issue. > > If I understand the question correctly - it's a dynamic check and it's > not clear how to describe this adjusting statically in machine > description From reading the code (the comments are not clear to me) you seem to produce two adjacent IMUL instructions from within the ready list - but first check that there is only a single one. So I fail to see how it can work. Can atom execute two IMUL in parallel? Or what exactly is the pipeline behavior? You miss a testcase that would show the effect of your patch. Richard. >> >> Richard. >> >>> -Andi >>> >>> -- >>> ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 10:45 ` Richard Guenther @ 2012-04-12 12:00 ` Alexander Monakov 2012-04-12 12:24 ` Richard Guenther 0 siblings, 1 reply; 26+ messages in thread From: Alexander Monakov @ 2012-04-12 12:00 UTC (permalink / raw) To: Richard Guenther; +Cc: Igor Zamyatin, Andi Kleen, gcc-patches > Can atom execute two IMUL in parallel? Or what exactly is the pipeline > behavior? As I understand from Intel's optimization reference manual, the behavior is as follows: if the instruction immediately following IMUL has shorter latency, execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a 4-or-more cycles latency instruction can be issued after IMUL without a stall. In other words, IMUL is pipelined with respect to other long-latency instructions, but not to short-latency instructions. From reading the patch, I could not understand the link between pipeline behavior and what the patch appears to do. Hope that helps. Alexander ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 12:00 ` Alexander Monakov @ 2012-04-12 12:24 ` Richard Guenther 2012-04-12 12:36 ` Igor Zamyatin 0 siblings, 1 reply; 26+ messages in thread From: Richard Guenther @ 2012-04-12 12:24 UTC (permalink / raw) To: Alexander Monakov; +Cc: Igor Zamyatin, Andi Kleen, gcc-patches On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov <amonakov@ispras.ru> wrote: > >> Can atom execute two IMUL in parallel? Or what exactly is the pipeline >> behavior? > > As I understand from Intel's optimization reference manual, the behavior is as > follows: if the instruction immediately following IMUL has shorter latency, > execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a > 4-or-more cycles latency instruction can be issued after IMUL without a stall. > In other words, IMUL is pipelined with respect to other long-latency > instructions, but not to short-latency instructions. It seems to be modeled in the pipeline description though: ;;; imul insn has 5 cycles latency (define_reservation "atom-imul-32" "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, atom-port-0") ;;; imul instruction excludes other non-FP instructions. (exclusion_set "atom-eu-0, atom-eu-1" "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") at least from my very limited guessing of what the above does. So, did you analyze why the scheduler does not properly schedule things for you? Richard. > > From reading the patch, I could not understand the link between pipeline > behavior and what the patch appears to do. > > Hope that helps. > > Alexander ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 12:24 ` Richard Guenther @ 2012-04-12 12:36 ` Igor Zamyatin 2012-04-12 12:38 ` Richard Guenther 0 siblings, 1 reply; 26+ messages in thread From: Igor Zamyatin @ 2012-04-12 12:36 UTC (permalink / raw) To: Richard Guenther; +Cc: Alexander Monakov, Andi Kleen, gcc-patches On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther <richard.guenther@gmail.com> wrote: > On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov <amonakov@ispras.ru> wrote: >> >>> Can atom execute two IMUL in parallel? Or what exactly is the pipeline >>> behavior? >> >> As I understand from Intel's optimization reference manual, the behavior is as >> follows: if the instruction immediately following IMUL has shorter latency, >> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a >> 4-or-more cycles latency instruction can be issued after IMUL without a stall. >> In other words, IMUL is pipelined with respect to other long-latency >> instructions, but not to short-latency instructions. > > It seems to be modeled in the pipeline description though: > > ;;; imul insn has 5 cycles latency > (define_reservation "atom-imul-32" > "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, > atom-port-0") > > ;;; imul instruction excludes other non-FP instructions. > (exclusion_set "atom-eu-0, atom-eu-1" > "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") > The main idea is quite simple: If we are going to schedule IMUL instruction (it is on the top of ready list) we try to find out producer of other (independent) IMUL instruction that is in ready list too. The goal is try to schedule such a producer to get another IMUL in ready list and get scheduling of 2 successive IMUL instructions. And MD allows us only prefer scheduling of successive IMUL instructions, i.e. If IMUL was just scheduled and ready list contains another IMUL instruction then it will be chosen as the best candidate for scheduling. > at least from my very limited guessing of what the above does. So, did you > analyze why the scheduler does not properly schedule things for you? > > Richard. > >> >> From reading the patch, I could not understand the link between pipeline >> behavior and what the patch appears to do. >> >> Hope that helps. >> >> Alexander ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 12:36 ` Igor Zamyatin @ 2012-04-12 12:38 ` Richard Guenther 2012-04-12 13:02 ` Andrey Belevantsev 0 siblings, 1 reply; 26+ messages in thread From: Richard Guenther @ 2012-04-12 12:38 UTC (permalink / raw) To: Igor Zamyatin; +Cc: Alexander Monakov, Andi Kleen, gcc-patches On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin <izamyatin@gmail.com> wrote: > On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther > <richard.guenther@gmail.com> wrote: >> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov <amonakov@ispras.ru> wrote: >>> >>>> Can atom execute two IMUL in parallel? Or what exactly is the pipeline >>>> behavior? >>> >>> As I understand from Intel's optimization reference manual, the behavior is as >>> follows: if the instruction immediately following IMUL has shorter latency, >>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a >>> 4-or-more cycles latency instruction can be issued after IMUL without a stall. >>> In other words, IMUL is pipelined with respect to other long-latency >>> instructions, but not to short-latency instructions. >> >> It seems to be modeled in the pipeline description though: >> >> ;;; imul insn has 5 cycles latency >> (define_reservation "atom-imul-32" >> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >> atom-port-0") >> >> ;;; imul instruction excludes other non-FP instructions. >> (exclusion_set "atom-eu-0, atom-eu-1" >> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >> > > The main idea is quite simple: > > If we are going to schedule IMUL instruction (it is on the top of > ready list) we try to find out producer of other (independent) IMUL > instruction that is in ready list too. The goal is try to schedule > such a producer to get another IMUL in ready list and get scheduling > of 2 successive IMUL instructions. Why does that not happen without your patch? Does it never happen without your patch or does it merely not happen for one EEMBC benchmark (can you provide a testcase?)? > And MD allows us only prefer scheduling of successive IMUL instructions, i.e. > If IMUL was just scheduled and ready list contains another IMUL > instruction then it will be chosen as the best candidate for > scheduling. > > >> at least from my very limited guessing of what the above does. So, did you >> analyze why the scheduler does not properly schedule things for you? >> >> Richard. >> >>> >>> From reading the patch, I could not understand the link between pipeline >>> behavior and what the patch appears to do. >>> >>> Hope that helps. >>> >>> Alexander ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 12:38 ` Richard Guenther @ 2012-04-12 13:02 ` Andrey Belevantsev 2012-04-12 13:55 ` Richard Guenther 2012-04-13 10:18 ` Igor Zamyatin 0 siblings, 2 replies; 26+ messages in thread From: Andrey Belevantsev @ 2012-04-12 13:02 UTC (permalink / raw) To: Richard Guenther Cc: Igor Zamyatin, Alexander Monakov, Andi Kleen, gcc-patches On 12.04.2012 16:38, Richard Guenther wrote: > On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> wrote: >> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >> <richard.guenther@gmail.com> wrote: >>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> wrote: >>>> >>>>> Can atom execute two IMUL in parallel? Or what exactly is the pipeline >>>>> behavior? >>>> >>>> As I understand from Intel's optimization reference manual, the behavior is as >>>> follows: if the instruction immediately following IMUL has shorter latency, >>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a >>>> 4-or-more cycles latency instruction can be issued after IMUL without a stall. >>>> In other words, IMUL is pipelined with respect to other long-latency >>>> instructions, but not to short-latency instructions. >>> >>> It seems to be modeled in the pipeline description though: >>> >>> ;;; imul insn has 5 cycles latency >>> (define_reservation "atom-imul-32" >>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>> atom-port-0") >>> >>> ;;; imul instruction excludes other non-FP instructions. >>> (exclusion_set "atom-eu-0, atom-eu-1" >>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>> >> >> The main idea is quite simple: >> >> If we are going to schedule IMUL instruction (it is on the top of >> ready list) we try to find out producer of other (independent) IMUL >> instruction that is in ready list too. The goal is try to schedule >> such a producer to get another IMUL in ready list and get scheduling >> of 2 successive IMUL instructions. > > Why does that not happen without your patch? Does it never happen without > your patch or does it merely not happen for one EEMBC benchmark (can > you provide a testcase?)? It does not happen because the scheduler by itself does not do such specific reordering. That said, it is easy to imagine the cases where this patch will make things worse rather than better. Igor, why not try different subtler mechanisms like adjust_priority, which is get called when an insn is added to the ready list? E.g. increase the producer's priority. The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when you have more than one imul in the ready list? Don't you want to bump the priority of the other imul found? Andrey > >> And MD allows us only prefer scheduling of successive IMUL instructions, i.e. >> If IMUL was just scheduled and ready list contains another IMUL >> instruction then it will be chosen as the best candidate for >> scheduling. >> >> >>> at least from my very limited guessing of what the above does. So, did you >>> analyze why the scheduler does not properly schedule things for you? >>> >>> Richard. >>> >>>> >>>> From reading the patch, I could not understand the link between pipeline >>>> behavior and what the patch appears to do. >>>> >>>> Hope that helps. >>>> >>>> Alexander ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 13:02 ` Andrey Belevantsev @ 2012-04-12 13:55 ` Richard Guenther 2012-04-12 14:03 ` Andrey Belevantsev 2012-04-13 10:18 ` Igor Zamyatin 1 sibling, 1 reply; 26+ messages in thread From: Richard Guenther @ 2012-04-12 13:55 UTC (permalink / raw) To: Andrey Belevantsev Cc: Igor Zamyatin, Alexander Monakov, Andi Kleen, gcc-patches 2012/4/12 Andrey Belevantsev <abel@ispras.ru>: > On 12.04.2012 16:38, Richard Guenther wrote: >> >> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >> wrote: >>> >>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>> <richard.guenther@gmail.com> wrote: >>>> >>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>> wrote: >>>>> >>>>> >>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>> pipeline >>>>>> behavior? >>>>> >>>>> >>>>> As I understand from Intel's optimization reference manual, the >>>>> behavior is as >>>>> follows: if the instruction immediately following IMUL has shorter >>>>> latency, >>>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, >>>>> a >>>>> 4-or-more cycles latency instruction can be issued after IMUL without a >>>>> stall. >>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>> instructions, but not to short-latency instructions. >>>> >>>> >>>> It seems to be modeled in the pipeline description though: >>>> >>>> ;;; imul insn has 5 cycles latency >>>> (define_reservation "atom-imul-32" >>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>>> atom-port-0") >>>> >>>> ;;; imul instruction excludes other non-FP instructions. >>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>> >>> >>> The main idea is quite simple: >>> >>> If we are going to schedule IMUL instruction (it is on the top of >>> ready list) we try to find out producer of other (independent) IMUL >>> instruction that is in ready list too. The goal is try to schedule >>> such a producer to get another IMUL in ready list and get scheduling >>> of 2 successive IMUL instructions. >> >> >> Why does that not happen without your patch? Does it never happen without >> your patch or does it merely not happen for one EEMBC benchmark (can >> you provide a testcase?)? > > > It does not happen because the scheduler by itself does not do such specific > reordering. That said, it is easy to imagine the cases where this patch > will make things worse rather than better. That surprises me. What is so specific about this reordering? > Igor, why not try different subtler mechanisms like adjust_priority, which > is get called when an insn is added to the ready list? E.g. increase the > producer's priority. > > The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when > you have more than one imul in the ready list? Don't you want to bump the > priority of the other imul found? > > Andrey > > >> >>> And MD allows us only prefer scheduling of successive IMUL instructions, >>> i.e. >>> If IMUL was just scheduled and ready list contains another IMUL >>> instruction then it will be chosen as the best candidate for >>> scheduling. >>> >>> >>>> at least from my very limited guessing of what the above does. So, did >>>> you >>>> analyze why the scheduler does not properly schedule things for you? >>>> >>>> Richard. >>>> >>>>> >>>>> From reading the patch, I could not understand the link between >>>>> pipeline >>>>> behavior and what the patch appears to do. >>>>> >>>>> Hope that helps. >>>>> >>>>> Alexander > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 13:55 ` Richard Guenther @ 2012-04-12 14:03 ` Andrey Belevantsev 2012-04-12 14:22 ` Richard Guenther 0 siblings, 1 reply; 26+ messages in thread From: Andrey Belevantsev @ 2012-04-12 14:03 UTC (permalink / raw) To: Richard Guenther Cc: Igor Zamyatin, Alexander Monakov, Andi Kleen, gcc-patches On 12.04.2012 17:54, Richard Guenther wrote: > 2012/4/12 Andrey Belevantsev<abel@ispras.ru>: >> On 12.04.2012 16:38, Richard Guenther wrote: >>> >>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>> wrote: >>>> >>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>> <richard.guenther@gmail.com> wrote: >>>>> >>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>> wrote: >>>>>> >>>>>> >>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>> pipeline >>>>>>> behavior? >>>>>> >>>>>> >>>>>> As I understand from Intel's optimization reference manual, the >>>>>> behavior is as >>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>> latency, >>>>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, >>>>>> a >>>>>> 4-or-more cycles latency instruction can be issued after IMUL without a >>>>>> stall. >>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>> instructions, but not to short-latency instructions. >>>>> >>>>> >>>>> It seems to be modeled in the pipeline description though: >>>>> >>>>> ;;; imul insn has 5 cycles latency >>>>> (define_reservation "atom-imul-32" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>>>> atom-port-0") >>>>> >>>>> ;;; imul instruction excludes other non-FP instructions. >>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>> >>>> >>>> The main idea is quite simple: >>>> >>>> If we are going to schedule IMUL instruction (it is on the top of >>>> ready list) we try to find out producer of other (independent) IMUL >>>> instruction that is in ready list too. The goal is try to schedule >>>> such a producer to get another IMUL in ready list and get scheduling >>>> of 2 successive IMUL instructions. >>> >>> >>> Why does that not happen without your patch? Does it never happen without >>> your patch or does it merely not happen for one EEMBC benchmark (can >>> you provide a testcase?)? >> >> >> It does not happen because the scheduler by itself does not do such specific >> reordering. That said, it is easy to imagine the cases where this patch >> will make things worse rather than better. > > That surprises me. What is so specific about this reordering? I mean that the scheduler does things like "sort the ready list according to a number of heuristics and to the insn priority, then choose the insn that would allow the maximum of ready insns to be issued on the current cycle". The heuristics are rather general. The scheduler does not do things like "if some random insn is ready, then choose some other random insn from the ready list and schedule it" (which is what the patch does). This is what scheduler hooks are for, to account for some target-specific heuristic. The problem is that this particular implementation looks somewhat like an overkill and also motivated by a single benchmark. Testing on a wider set of benchmarks and checking compile-time hit would make the motivation more convincing. Andrey > >> Igor, why not try different subtler mechanisms like adjust_priority, which >> is get called when an insn is added to the ready list? E.g. increase the >> producer's priority. >> >> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when >> you have more than one imul in the ready list? Don't you want to bump the >> priority of the other imul found? >> >> Andrey >> >> >>> >>>> And MD allows us only prefer scheduling of successive IMUL instructions, >>>> i.e. >>>> If IMUL was just scheduled and ready list contains another IMUL >>>> instruction then it will be chosen as the best candidate for >>>> scheduling. >>>> >>>> >>>>> at least from my very limited guessing of what the above does. So, did >>>>> you >>>>> analyze why the scheduler does not properly schedule things for you? >>>>> >>>>> Richard. >>>>> >>>>>> >>>>>> From reading the patch, I could not understand the link between >>>>>> pipeline >>>>>> behavior and what the patch appears to do. >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> Alexander >> >> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 14:03 ` Andrey Belevantsev @ 2012-04-12 14:22 ` Richard Guenther 2012-04-13 7:16 ` Andrey Belevantsev 0 siblings, 1 reply; 26+ messages in thread From: Richard Guenther @ 2012-04-12 14:22 UTC (permalink / raw) To: Andrey Belevantsev Cc: Igor Zamyatin, Alexander Monakov, Andi Kleen, gcc-patches 2012/4/12 Andrey Belevantsev <abel@ispras.ru>: > On 12.04.2012 17:54, Richard Guenther wrote: >> >> 2012/4/12 Andrey Belevantsev<abel@ispras.ru>: >>> >>> On 12.04.2012 16:38, Richard Guenther wrote: >>>> >>>> >>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>>> wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>> <richard.guenther@gmail.com> wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>> pipeline >>>>>>>> behavior? >>>>>>> >>>>>>> >>>>>>> >>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>> behavior is as >>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>> latency, >>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>> otherwise, >>>>>>> a >>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>> a >>>>>>> stall. >>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>> instructions, but not to short-latency instructions. >>>>>> >>>>>> >>>>>> >>>>>> It seems to be modeled in the pipeline description though: >>>>>> >>>>>> ;;; imul insn has 5 cycles latency >>>>>> (define_reservation "atom-imul-32" >>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>> atom-imul-4, >>>>>> atom-port-0") >>>>>> >>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>> >>>>> >>>>> The main idea is quite simple: >>>>> >>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>> ready list) we try to find out producer of other (independent) IMUL >>>>> instruction that is in ready list too. The goal is try to schedule >>>>> such a producer to get another IMUL in ready list and get scheduling >>>>> of 2 successive IMUL instructions. >>>> >>>> >>>> >>>> Why does that not happen without your patch? Does it never happen >>>> without >>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>> you provide a testcase?)? >>> >>> >>> >>> It does not happen because the scheduler by itself does not do such >>> specific >>> reordering. That said, it is easy to imagine the cases where this patch >>> will make things worse rather than better. >> >> >> That surprises me. What is so specific about this reordering? > > > I mean that the scheduler does things like "sort the ready list according to > a number of heuristics and to the insn priority, then choose the insn that > would allow the maximum of ready insns to be issued on the current cycle". > The heuristics are rather general. The scheduler does not do things like > "if some random insn is ready, then choose some other random insn from the > ready list and schedule it" (which is what the patch does). This is what > scheduler hooks are for, to account for some target-specific heuristic. > > The problem is that this particular implementation looks somewhat like an > overkill and also motivated by a single benchmark. Testing on a wider set > of benchmarks and checking compile-time hit would make the motivation more > convincing. Yeah, and especially looking _why_ the generic heuristics are not working and if they could be improved. After all the issue seems to be properly modeled in the DFA. Richard. > Andrey > > >> >>> Igor, why not try different subtler mechanisms like adjust_priority, >>> which >>> is get called when an insn is added to the ready list? E.g. increase the >>> producer's priority. >>> >>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>> when >>> you have more than one imul in the ready list? Don't you want to bump >>> the >>> priority of the other imul found? >>> >>> Andrey >>> >>> >>>> >>>>> And MD allows us only prefer scheduling of successive IMUL >>>>> instructions, >>>>> i.e. >>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>> instruction then it will be chosen as the best candidate for >>>>> scheduling. >>>>> >>>>> >>>>>> at least from my very limited guessing of what the above does. So, >>>>>> did >>>>>> you >>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>> >>>>>> Richard. >>>>>> >>>>>>> >>>>>>> From reading the patch, I could not understand the link between >>>>>>> pipeline >>>>>>> behavior and what the patch appears to do. >>>>>>> >>>>>>> Hope that helps. >>>>>>> >>>>>>> Alexander >>> >>> >>> > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 14:22 ` Richard Guenther @ 2012-04-13 7:16 ` Andrey Belevantsev 0 siblings, 0 replies; 26+ messages in thread From: Andrey Belevantsev @ 2012-04-13 7:16 UTC (permalink / raw) To: Richard Guenther Cc: Igor Zamyatin, Alexander Monakov, Andi Kleen, gcc-patches On 12.04.2012 18:22, Richard Guenther wrote: > 2012/4/12 Andrey Belevantsev<abel@ispras.ru>: >> On 12.04.2012 17:54, Richard Guenther wrote: >>> >>> 2012/4/12 Andrey Belevantsev<abel@ispras.ru>: >>>> >>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>> <richard.guenther@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>> pipeline >>>>>>>>> behavior? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>> behavior is as >>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>> latency, >>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>> otherwise, >>>>>>>> a >>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>> a >>>>>>>> stall. >>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>> instructions, but not to short-latency instructions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> It seems to be modeled in the pipeline description though: >>>>>>> >>>>>>> ;;; imul insn has 5 cycles latency >>>>>>> (define_reservation "atom-imul-32" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>> atom-imul-4, >>>>>>> atom-port-0") >>>>>>> >>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>> >>>>>> >>>>>> The main idea is quite simple: >>>>>> >>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>> of 2 successive IMUL instructions. >>>>> >>>>> >>>>> >>>>> Why does that not happen without your patch? Does it never happen >>>>> without >>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>> you provide a testcase?)? >>>> >>>> >>>> >>>> It does not happen because the scheduler by itself does not do such >>>> specific >>>> reordering. That said, it is easy to imagine the cases where this patch >>>> will make things worse rather than better. >>> >>> >>> That surprises me. What is so specific about this reordering? >> >> >> I mean that the scheduler does things like "sort the ready list according to >> a number of heuristics and to the insn priority, then choose the insn that >> would allow the maximum of ready insns to be issued on the current cycle". >> The heuristics are rather general. The scheduler does not do things like >> "if some random insn is ready, then choose some other random insn from the >> ready list and schedule it" (which is what the patch does). This is what >> scheduler hooks are for, to account for some target-specific heuristic. >> >> The problem is that this particular implementation looks somewhat like an >> overkill and also motivated by a single benchmark. Testing on a wider set >> of benchmarks and checking compile-time hit would make the motivation more >> convincing. > > Yeah, and especially looking _why_ the generic heuristics are not working > and if they could be improved. After all the issue seems to be properly > modeled in the DFA. It is, but the DFA only models the actual stalls that you get when an imul is scheduled. What you need here is to increase priority for ready insns that have imuls in their critical paths, and those imuls should be close enough to fill in the gap (actual consumers in the case of the patch). The lookahead done in max_issue can help with this "dynamic" priority change in general, but it considers just the ready insns, not their immediate consumers. You need to make the deeper lookahead if you want to do this in a target independent way -- no-no from compile time POV. A number of already existing hooks can help: - sched_reorder used in this patch just sorts the ready list in any way the target wishes; - adjust_priority/adjust_cost helps to boost or to lower priority in such subtle cases; - is_costly_dependence (now rs6000 only) can guide the early queue insn removal and its addition to the ready list; - dfa_lookahead_guard (now ia64 only) can ban some insns from being issued on the current try. Using sched_reorder is existing practice, like in s390, spu, mips, etc. Usually though an implementation again only considers the actual ready insns and not the successors. I'd say that trying more tests would help, or other hooks can be tried, too, but after all this is Atom specific, so it's the decision of i386 maintainers. Andrey > > Richard. > >> Andrey >> >> >>> >>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>> which >>>> is get called when an insn is added to the ready list? E.g. increase the >>>> producer's priority. >>>> >>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>> when >>>> you have more than one imul in the ready list? Don't you want to bump >>>> the >>>> priority of the other imul found? >>>> >>>> Andrey >>>> >>>> >>>>> >>>>>> And MD allows us only prefer scheduling of successive IMUL >>>>>> instructions, >>>>>> i.e. >>>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>>> instruction then it will be chosen as the best candidate for >>>>>> scheduling. >>>>>> >>>>>> >>>>>>> at least from my very limited guessing of what the above does. So, >>>>>>> did >>>>>>> you >>>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>>> >>>>>>> Richard. >>>>>>> >>>>>>>> >>>>>>>> From reading the patch, I could not understand the link between >>>>>>>> pipeline >>>>>>>> behavior and what the patch appears to do. >>>>>>>> >>>>>>>> Hope that helps. >>>>>>>> >>>>>>>> Alexander >>>> >>>> >>>> >> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-12 13:02 ` Andrey Belevantsev 2012-04-12 13:55 ` Richard Guenther @ 2012-04-13 10:18 ` Igor Zamyatin 2012-04-13 11:00 ` Igor Zamyatin 2012-04-13 12:21 ` Andrey Belevantsev 1 sibling, 2 replies; 26+ messages in thread From: Igor Zamyatin @ 2012-04-13 10:18 UTC (permalink / raw) To: Andrey Belevantsev Cc: Richard Guenther, Alexander Monakov, Andi Kleen, gcc-patches On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev <abel@ispras.ru> wrote: > On 12.04.2012 16:38, Richard Guenther wrote: >> >> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >> wrote: >>> >>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>> <richard.guenther@gmail.com> wrote: >>>> >>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>> wrote: >>>>> >>>>> >>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>> pipeline >>>>>> behavior? >>>>> >>>>> >>>>> As I understand from Intel's optimization reference manual, the >>>>> behavior is as >>>>> follows: if the instruction immediately following IMUL has shorter >>>>> latency, >>>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, >>>>> a >>>>> 4-or-more cycles latency instruction can be issued after IMUL without a >>>>> stall. >>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>> instructions, but not to short-latency instructions. >>>> >>>> >>>> It seems to be modeled in the pipeline description though: >>>> >>>> ;;; imul insn has 5 cycles latency >>>> (define_reservation "atom-imul-32" >>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>>> atom-port-0") >>>> >>>> ;;; imul instruction excludes other non-FP instructions. >>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>> >>> >>> The main idea is quite simple: >>> >>> If we are going to schedule IMUL instruction (it is on the top of >>> ready list) we try to find out producer of other (independent) IMUL >>> instruction that is in ready list too. The goal is try to schedule >>> such a producer to get another IMUL in ready list and get scheduling >>> of 2 successive IMUL instructions. >> >> >> Why does that not happen without your patch? Does it never happen without >> your patch or does it merely not happen for one EEMBC benchmark (can >> you provide a testcase?)? > > > It does not happen because the scheduler by itself does not do such specific > reordering. That said, it is easy to imagine the cases where this patch > will make things worse rather than better. > > Igor, why not try different subtler mechanisms like adjust_priority, which > is get called when an insn is added to the ready list? E.g. increase the > producer's priority. > > The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when > you have more than one imul in the ready list? Don't you want to bump the > priority of the other imul found? Could you provide some examples when this patch would harm the performance? Sched_reorder was chosen since it is used in other ports and looks most suitable for such case, e.g. it provides access to the whole ready list. BTW, just increasing producer's priority seems to be more risky in performance sense - we can incorrectly start delaying some instructions. Thought ready list doesn't contain DEBUG_INSN... Is it so? If it contains them - this could be added easily The case with more than one imul in the ready list wasn't considered just because the goal was to catch the particular case when there is a risk to get the following picture: "imul-producer-imul" which is less effective than "producer-imul-imul" for Atom > > Andrey > > >> >>> And MD allows us only prefer scheduling of successive IMUL instructions, >>> i.e. >>> If IMUL was just scheduled and ready list contains another IMUL >>> instruction then it will be chosen as the best candidate for >>> scheduling. >>> >>> >>>> at least from my very limited guessing of what the above does. So, did >>>> you >>>> analyze why the scheduler does not properly schedule things for you? >>>> >>>> Richard. >>>> >>>>> >>>>> From reading the patch, I could not understand the link between >>>>> pipeline >>>>> behavior and what the patch appears to do. >>>>> >>>>> Hope that helps. >>>>> >>>>> Alexander > > Thanks, Igor ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-13 10:18 ` Igor Zamyatin @ 2012-04-13 11:00 ` Igor Zamyatin 2012-04-13 12:21 ` Andrey Belevantsev 1 sibling, 0 replies; 26+ messages in thread From: Igor Zamyatin @ 2012-04-13 11:00 UTC (permalink / raw) To: Andrey Belevantsev Cc: Richard Guenther, Alexander Monakov, Andi Kleen, gcc-patches On Fri, Apr 13, 2012 at 2:18 PM, Igor Zamyatin <izamyatin@gmail.com> wrote: > On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev <abel@ispras.ru> wrote: >> On 12.04.2012 16:38, Richard Guenther wrote: >>> >>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>> wrote: >>>> >>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>> <richard.guenther@gmail.com> wrote: >>>>> >>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>> wrote: >>>>>> >>>>>> >>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>> pipeline >>>>>>> behavior? >>>>>> >>>>>> >>>>>> As I understand from Intel's optimization reference manual, the >>>>>> behavior is as >>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>> latency, >>>>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, >>>>>> a >>>>>> 4-or-more cycles latency instruction can be issued after IMUL without a >>>>>> stall. >>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>> instructions, but not to short-latency instructions. >>>>> >>>>> >>>>> It seems to be modeled in the pipeline description though: >>>>> >>>>> ;;; imul insn has 5 cycles latency >>>>> (define_reservation "atom-imul-32" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>>>> atom-port-0") >>>>> >>>>> ;;; imul instruction excludes other non-FP instructions. >>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>> >>>> >>>> The main idea is quite simple: >>>> >>>> If we are going to schedule IMUL instruction (it is on the top of >>>> ready list) we try to find out producer of other (independent) IMUL >>>> instruction that is in ready list too. The goal is try to schedule >>>> such a producer to get another IMUL in ready list and get scheduling >>>> of 2 successive IMUL instructions. >>> >>> >>> Why does that not happen without your patch? Does it never happen without >>> your patch or does it merely not happen for one EEMBC benchmark (can >>> you provide a testcase?)? >> >> >> It does not happen because the scheduler by itself does not do such specific >> reordering. That said, it is easy to imagine the cases where this patch >> will make things worse rather than better. >> >> Igor, why not try different subtler mechanisms like adjust_priority, which >> is get called when an insn is added to the ready list? E.g. increase the >> producer's priority. >> >> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when >> you have more than one imul in the ready list? Don't you want to bump the >> priority of the other imul found? > > Could you provide some examples when this patch would harm the performance? > > Sched_reorder was chosen since it is used in other ports and looks > most suitable for such case, e.g. it provides access to the whole > ready list. > BTW, just increasing producer's priority seems to be more risky in > performance sense - we can incorrectly start delaying some > instructions. > > Thought ready list doesn't contain DEBUG_INSN... Is it so? If it > contains them - this could be added easily > > The case with more than one imul in the ready list wasn't considered > just because the goal was to catch the particular case when there is a > risk to get the following picture: "imul-producer-imul" which is less > effective than "producer-imul-imul" for Atom Actually, this constraint of one imul could be omitted. I'm going to try this change > >> >> Andrey >> >> >>> >>>> And MD allows us only prefer scheduling of successive IMUL instructions, >>>> i.e. >>>> If IMUL was just scheduled and ready list contains another IMUL >>>> instruction then it will be chosen as the best candidate for >>>> scheduling. >>>> >>>> >>>>> at least from my very limited guessing of what the above does. So, did >>>>> you >>>>> analyze why the scheduler does not properly schedule things for you? >>>>> >>>>> Richard. >>>>> >>>>>> >>>>>> From reading the patch, I could not understand the link between >>>>>> pipeline >>>>>> behavior and what the patch appears to do. >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> Alexander >> >> > > Thanks, > Igor ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-13 10:18 ` Igor Zamyatin 2012-04-13 11:00 ` Igor Zamyatin @ 2012-04-13 12:21 ` Andrey Belevantsev 2012-04-16 20:27 ` Igor Zamyatin 1 sibling, 1 reply; 26+ messages in thread From: Andrey Belevantsev @ 2012-04-13 12:21 UTC (permalink / raw) To: Igor Zamyatin Cc: Richard Guenther, Alexander Monakov, Andi Kleen, gcc-patches On 13.04.2012 14:18, Igor Zamyatin wrote: > On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<abel@ispras.ru> wrote: >> On 12.04.2012 16:38, Richard Guenther wrote: >>> >>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>> wrote: >>>> >>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>> <richard.guenther@gmail.com> wrote: >>>>> >>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>> wrote: >>>>>> >>>>>> >>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>> pipeline >>>>>>> behavior? >>>>>> >>>>>> >>>>>> As I understand from Intel's optimization reference manual, the >>>>>> behavior is as >>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>> latency, >>>>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, >>>>>> a >>>>>> 4-or-more cycles latency instruction can be issued after IMUL without a >>>>>> stall. >>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>> instructions, but not to short-latency instructions. >>>>> >>>>> >>>>> It seems to be modeled in the pipeline description though: >>>>> >>>>> ;;; imul insn has 5 cycles latency >>>>> (define_reservation "atom-imul-32" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>>>> atom-port-0") >>>>> >>>>> ;;; imul instruction excludes other non-FP instructions. >>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>> >>>> >>>> The main idea is quite simple: >>>> >>>> If we are going to schedule IMUL instruction (it is on the top of >>>> ready list) we try to find out producer of other (independent) IMUL >>>> instruction that is in ready list too. The goal is try to schedule >>>> such a producer to get another IMUL in ready list and get scheduling >>>> of 2 successive IMUL instructions. >>> >>> >>> Why does that not happen without your patch? Does it never happen without >>> your patch or does it merely not happen for one EEMBC benchmark (can >>> you provide a testcase?)? >> >> >> It does not happen because the scheduler by itself does not do such specific >> reordering. That said, it is easy to imagine the cases where this patch >> will make things worse rather than better. >> >> Igor, why not try different subtler mechanisms like adjust_priority, which >> is get called when an insn is added to the ready list? E.g. increase the >> producer's priority. >> >> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when >> you have more than one imul in the ready list? Don't you want to bump the >> priority of the other imul found? > > Could you provide some examples when this patch would harm the performance? I thought of the cases when the other ready insns can fill up the hole and that would be more beneficial because e.g. they would be on more critical paths than the producer of your second imul. I don't know enough of Atom to give an example -- maybe some long divisions? > > Sched_reorder was chosen since it is used in other ports and looks > most suitable for such case, e.g. it provides access to the whole > ready list. > BTW, just increasing producer's priority seems to be more risky in > performance sense - we can incorrectly start delaying some > instructions. Yes, but exactly because of the above example you can start incorrectly delaying other insns, too, as you force the insn to be the first in the list. While bumping priority still leaves the scheduler sorting heuristics in place and actually lowers that risk. > > Thought ready list doesn't contain DEBUG_INSN... Is it so? If it > contains them - this could be added easily It does, but I'm not sure the sched_reorder hook gets them or they are immediately removed -- I saw similar checks in one of the targets' hooks. Anyways, my main thought was that it is better to test on more benchmarks to alleviate the above concerns, so as long as the i386 maintainers are happy, I don't see major problems here. A good idea could be to generalize the patch to handle other long latency insns as second consumers, not only imuls, if this is relevant for Atom. Andrey > > The case with more than one imul in the ready list wasn't considered > just because the goal was to catch the particular case when there is a > risk to get the following picture: "imul-producer-imul" which is less > effective than "producer-imul-imul" for Atom > >> >> Andrey >> >> >>> >>>> And MD allows us only prefer scheduling of successive IMUL instructions, >>>> i.e. >>>> If IMUL was just scheduled and ready list contains another IMUL >>>> instruction then it will be chosen as the best candidate for >>>> scheduling. >>>> >>>> >>>>> at least from my very limited guessing of what the above does. So, did >>>>> you >>>>> analyze why the scheduler does not properly schedule things for you? >>>>> >>>>> Richard. >>>>> >>>>>> >>>>>> From reading the patch, I could not understand the link between >>>>>> pipeline >>>>>> behavior and what the patch appears to do. >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> Alexander >> >> > > Thanks, > Igor ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-13 12:21 ` Andrey Belevantsev @ 2012-04-16 20:27 ` Igor Zamyatin 2012-04-20 12:05 ` Igor Zamyatin 0 siblings, 1 reply; 26+ messages in thread From: Igor Zamyatin @ 2012-04-16 20:27 UTC (permalink / raw) To: Andrey Belevantsev Cc: Richard Guenther, Alexander Monakov, Andi Kleen, gcc-patches [-- Attachment #1: Type: text/plain, Size: 6248 bytes --] On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <abel@ispras.ru> wrote: > On 13.04.2012 14:18, Igor Zamyatin wrote: >> >> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<abel@ispras.ru> >> wrote: >>> >>> On 12.04.2012 16:38, Richard Guenther wrote: >>>> >>>> >>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>>> wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>> <richard.guenther@gmail.com> wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>> pipeline >>>>>>>> behavior? >>>>>>> >>>>>>> >>>>>>> >>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>> behavior is as >>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>> latency, >>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>> otherwise, >>>>>>> a >>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>> a >>>>>>> stall. >>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>> instructions, but not to short-latency instructions. >>>>>> >>>>>> >>>>>> >>>>>> It seems to be modeled in the pipeline description though: >>>>>> >>>>>> ;;; imul insn has 5 cycles latency >>>>>> (define_reservation "atom-imul-32" >>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>> atom-imul-4, >>>>>> atom-port-0") >>>>>> >>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>> >>>>> >>>>> The main idea is quite simple: >>>>> >>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>> ready list) we try to find out producer of other (independent) IMUL >>>>> instruction that is in ready list too. The goal is try to schedule >>>>> such a producer to get another IMUL in ready list and get scheduling >>>>> of 2 successive IMUL instructions. >>>> >>>> >>>> >>>> Why does that not happen without your patch? Does it never happen >>>> without >>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>> you provide a testcase?)? >>> >>> >>> >>> It does not happen because the scheduler by itself does not do such >>> specific >>> reordering. That said, it is easy to imagine the cases where this patch >>> will make things worse rather than better. >>> >>> Igor, why not try different subtler mechanisms like adjust_priority, >>> which >>> is get called when an insn is added to the ready list? E.g. increase the >>> producer's priority. >>> >>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>> when >>> you have more than one imul in the ready list? Don't you want to bump >>> the >>> priority of the other imul found? >> >> >> Could you provide some examples when this patch would harm the >> performance? > > > I thought of the cases when the other ready insns can fill up the hole and > that would be more beneficial because e.g. they would be on more critical > paths than the producer of your second imul. I don't know enough of Atom to > give an example -- maybe some long divisions? > > >> >> Sched_reorder was chosen since it is used in other ports and looks >> most suitable for such case, e.g. it provides access to the whole >> ready list. >> BTW, just increasing producer's priority seems to be more risky in >> performance sense - we can incorrectly start delaying some >> instructions. > > > Yes, but exactly because of the above example you can start incorrectly > delaying other insns, too, as you force the insn to be the first in the > list. While bumping priority still leaves the scheduler sorting heuristics > in place and actually lowers that risk. > >> >> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it >> contains them - this could be added easily > > > It does, but I'm not sure the sched_reorder hook gets them or they are > immediately removed -- I saw similar checks in one of the targets' hooks. Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached > > Anyways, my main thought was that it is better to test on more benchmarks to > alleviate the above concerns, so as long as the i386 maintainers are happy, > I don't see major problems here. A good idea could be to generalize the > patch to handle other long latency insns as second consumers, not only > imuls, if this is relevant for Atom. Yes, generalization of this approach is in plans. According to Atom Software optimization guide there are several headrooms left here. As for trying on more benchmarks - the particular case is indeed quite rare. I attached the example where patch helps to group imuls in pairs which is profitable for Atom. Such and similar codes are not very common. But hopefully this approach could help avoid this and other glassjaws. > > Andrey > > >> >> The case with more than one imul in the ready list wasn't considered >> just because the goal was to catch the particular case when there is a >> risk to get the following picture: "imul-producer-imul" which is less >> effective than "producer-imul-imul" for Atom >> >>> >>> Andrey >>> >>> >>>> >>>>> And MD allows us only prefer scheduling of successive IMUL >>>>> instructions, >>>>> i.e. >>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>> instruction then it will be chosen as the best candidate for >>>>> scheduling. >>>>> >>>>> >>>>>> at least from my very limited guessing of what the above does. So, >>>>>> did >>>>>> you >>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>> >>>>>> Richard. >>>>>> >>>>>>> >>>>>>> From reading the patch, I could not understand the link between >>>>>>> pipeline >>>>>>> behavior and what the patch appears to do. >>>>>>> >>>>>>> Hope that helps. >>>>>>> >>>>>>> Alexander >>> >>> >>> >> >> Thanks, >> Igor > > [-- Attachment #2: patch_imul.patch --] [-- Type: application/octet-stream, Size: 3956 bytes --] diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index abe3f1b..8ab25c9 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -23908,6 +23908,114 @@ ia32_multipass_dfa_lookahead (void) } } +/* Try to reorder ready list to take advantage of Atom pipelined IMUL + execution. It is applied if + (1) IMUL instrction is on the top of list; + (2) There exists the only producer of independent IMUL instruction in + ready list; + (3) Put found producer on the top of ready list. + Returns issue rate. */ + +static int +ix86_sched_reorder(FILE *dump, int sched_verbose, rtx *ready, int *pn_ready, + int clock_var ATTRIBUTE_UNUSED) +{ + static int issue_rate = -1; + int n_ready = *pn_ready; + rtx insn; + rtx insn1; + rtx insn2; + int i; + sd_iterator_def sd_it; + dep_t dep; + int index = -1; + + /* set up issue rate */ + if (issue_rate < 0) + issue_rate = ix86_issue_rate(); + + /* do reodering for Atom only */ + if (ix86_tune != PROCESSOR_ATOM) + return issue_rate; + /* nothing to do if ready list contains only 1 instruction */ + if (n_ready <= 1) + return issue_rate; + + /* check that IMUL instruction is on the top of ready list. */ + insn = ready[n_ready - 1]; + if (!NONDEBUG_INSN_P (insn)) + return issue_rate; + insn = PATTERN (insn); + if (GET_CODE (insn) == PARALLEL) + insn = XVECEXP (insn, 0, 0); + if (GET_CODE (insn) != SET) + return issue_rate; + if (!(GET_CODE (SET_SRC (insn)) == MULT + && GET_MODE (SET_SRC (insn)) == SImode)) + return issue_rate; + + /* search for producer of independent IMUL instruction. */ + for (i = n_ready - 2; i>= 0; i--) + { + insn = ready[i]; + if (!NONDEBUG_INSN_P (insn)) + continue; + /* skip IMUL instruction */ + insn2 = PATTERN (insn); + if (GET_CODE (insn2) == PARALLEL) + insn2 = XVECEXP (insn2, 0, 0); + if (GET_CODE (insn2) == SET + && GET_CODE (SET_SRC (insn2)) == MULT + && GET_MODE (SET_SRC (insn2)) == SImode) + continue; + + FOR_EACH_DEP (insn, SD_LIST_FORW, sd_it, dep) + { + rtx con; + con = DEP_CON (dep); + insn1 = PATTERN (con); + if (GET_CODE (insn1) == PARALLEL) + insn1 = XVECEXP (insn1, 0, 0); + + if (GET_CODE (insn1) == SET + && GET_CODE (SET_SRC (insn1)) == MULT + && GET_MODE (SET_SRC (insn1)) == SImode) + { + sd_iterator_def sd_it1; + dep_t dep1; + /* check if there is no other dependee for IMUL. */ + index = i; + FOR_EACH_DEP (con, SD_LIST_BACK, sd_it1, dep1) + { + rtx pro; + pro = DEP_PRO (dep1); + if (pro != insn) + index = -1; + } + + if (index >= 0) + break; + } + } + if (index >= 0) + break; + } + if (index < 0) + return issue_rate; /* didn't find IMUL producer. */ + + if (sched_verbose > 1) + fprintf(dump, ";;\tatom sched_reorder: swap %d and %d insns\n", + INSN_UID (ready[index]), INSN_UID (ready[n_ready - 1])); + + /* put IMUL producer (ready[index]) at the top of ready list */ + insn1 = ready[index]; + for (i = index; i < n_ready - 1; i++) + ready[i] = ready[i + 1]; + ready[n_ready - 1] = insn1; + + return issue_rate; +} + \f /* Model decoder of Core 2/i7. @@ -38071,6 +38179,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree) #define TARGET_SCHED_DISPATCH_DO do_dispatch #undef TARGET_SCHED_REASSOCIATION_WIDTH #define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width +#undef TARGET_SCHED_REORDER +#define TARGET_SCHED_REORDER ix86_sched_reorder /* The size of the dispatch window is the total number of bytes of object code allowed in a window. */ [-- Attachment #3: imul_atom_test.c --] [-- Type: text/x-csrc, Size: 291 bytes --] void foo(int i, int k, short *p) { int c; c = (short)i * (short)k; p[0] = c; c = (short)i * (short) (k>>16) + (c>>16); p[2] = c>>16; c = (short)(i>>16) * (short)k + p[1]; c = (short)(i>>16) * (short)(k>>16) + p[2] + (c>>16); p[3] = c>>16; } ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-16 20:27 ` Igor Zamyatin @ 2012-04-20 12:05 ` Igor Zamyatin 2012-05-06 7:27 ` Igor Zamyatin 0 siblings, 1 reply; 26+ messages in thread From: Igor Zamyatin @ 2012-04-20 12:05 UTC (permalink / raw) To: gcc-patches, Andrey Belevantsev Cc: Richard Guenther, Alexander Monakov, Andi Kleen On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin <izamyatin@gmail.com> wrote: > On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <abel@ispras.ru> wrote: >> On 13.04.2012 14:18, Igor Zamyatin wrote: >>> >>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<abel@ispras.ru> >>> wrote: >>>> >>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>> <richard.guenther@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>> pipeline >>>>>>>>> behavior? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>> behavior is as >>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>> latency, >>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>> otherwise, >>>>>>>> a >>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>> a >>>>>>>> stall. >>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>> instructions, but not to short-latency instructions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> It seems to be modeled in the pipeline description though: >>>>>>> >>>>>>> ;;; imul insn has 5 cycles latency >>>>>>> (define_reservation "atom-imul-32" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>> atom-imul-4, >>>>>>> atom-port-0") >>>>>>> >>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>> >>>>>> >>>>>> The main idea is quite simple: >>>>>> >>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>> of 2 successive IMUL instructions. >>>>> >>>>> >>>>> >>>>> Why does that not happen without your patch? Does it never happen >>>>> without >>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>> you provide a testcase?)? >>>> >>>> >>>> >>>> It does not happen because the scheduler by itself does not do such >>>> specific >>>> reordering. That said, it is easy to imagine the cases where this patch >>>> will make things worse rather than better. >>>> >>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>> which >>>> is get called when an insn is added to the ready list? E.g. increase the >>>> producer's priority. >>>> >>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>> when >>>> you have more than one imul in the ready list? Don't you want to bump >>>> the >>>> priority of the other imul found? >>> >>> >>> Could you provide some examples when this patch would harm the >>> performance? >> >> >> I thought of the cases when the other ready insns can fill up the hole and >> that would be more beneficial because e.g. they would be on more critical >> paths than the producer of your second imul. I don't know enough of Atom to >> give an example -- maybe some long divisions? >> >> >>> >>> Sched_reorder was chosen since it is used in other ports and looks >>> most suitable for such case, e.g. it provides access to the whole >>> ready list. >>> BTW, just increasing producer's priority seems to be more risky in >>> performance sense - we can incorrectly start delaying some >>> instructions. >> >> >> Yes, but exactly because of the above example you can start incorrectly >> delaying other insns, too, as you force the insn to be the first in the >> list. While bumping priority still leaves the scheduler sorting heuristics >> in place and actually lowers that risk. >> >>> >>> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it >>> contains them - this could be added easily >> >> >> It does, but I'm not sure the sched_reorder hook gets them or they are >> immediately removed -- I saw similar checks in one of the targets' hooks. > > Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached > >> >> Anyways, my main thought was that it is better to test on more benchmarks to >> alleviate the above concerns, so as long as the i386 maintainers are happy, >> I don't see major problems here. A good idea could be to generalize the >> patch to handle other long latency insns as second consumers, not only >> imuls, if this is relevant for Atom. > > Yes, generalization of this approach is in plans. According to Atom > Software optimization guide there are several headrooms left here. > As for trying on more benchmarks - the particular case is indeed quite > rare. I attached the example > where patch helps to group imuls in pairs which is profitable for > Atom. Such and similar codes are not very common. > But hopefully this approach could help avoid this and other glassjaws. BTW, this patch also helps some EEMBC tests when funroll-loops specified. So, any feedback from i386 maintainers about this? :) Changelog slightly changed 2012-04-10 Yuri Rumyantsev <yuri.s.rumyantsev@intel.com> * config/i386/i386.c (x86_sched_reorder): New function. Added new function x86_sched_reorder to take advantage of Atom pipelened IMUL execution. > >> >> Andrey >> >> >>> >>> The case with more than one imul in the ready list wasn't considered >>> just because the goal was to catch the particular case when there is a >>> risk to get the following picture: "imul-producer-imul" which is less >>> effective than "producer-imul-imul" for Atom >>> >>>> >>>> Andrey >>>> >>>> >>>>> >>>>>> And MD allows us only prefer scheduling of successive IMUL >>>>>> instructions, >>>>>> i.e. >>>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>>> instruction then it will be chosen as the best candidate for >>>>>> scheduling. >>>>>> >>>>>> >>>>>>> at least from my very limited guessing of what the above does. So, >>>>>>> did >>>>>>> you >>>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>>> >>>>>>> Richard. >>>>>>> >>>>>>>> >>>>>>>> From reading the patch, I could not understand the link between >>>>>>>> pipeline >>>>>>>> behavior and what the patch appears to do. >>>>>>>> >>>>>>>> Hope that helps. >>>>>>>> >>>>>>>> Alexander >>>> >>>> >>>> >>> >>> Thanks, >>> Igor >> >> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-20 12:05 ` Igor Zamyatin @ 2012-05-06 7:27 ` Igor Zamyatin 2012-05-28 20:20 ` Igor Zamyatin 0 siblings, 1 reply; 26+ messages in thread From: Igor Zamyatin @ 2012-05-06 7:27 UTC (permalink / raw) To: gcc-patches, Andrey Belevantsev Cc: Richard Guenther, Alexander Monakov, Andi Kleen Ping. Could x86 maintainer(s) look at these changes? Thanks, Igor On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin <izamyatin@gmail.com> wrote: > On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin <izamyatin@gmail.com> wrote: >> On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <abel@ispras.ru> wrote: >>> On 13.04.2012 14:18, Igor Zamyatin wrote: >>>> >>>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<abel@ispras.ru> >>>> wrote: >>>>> >>>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>>> <richard.guenther@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>>> pipeline >>>>>>>>>> behavior? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>>> behavior is as >>>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>>> latency, >>>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>>> otherwise, >>>>>>>>> a >>>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>>> a >>>>>>>>> stall. >>>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>>> instructions, but not to short-latency instructions. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It seems to be modeled in the pipeline description though: >>>>>>>> >>>>>>>> ;;; imul insn has 5 cycles latency >>>>>>>> (define_reservation "atom-imul-32" >>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>>> atom-imul-4, >>>>>>>> atom-port-0") >>>>>>>> >>>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>>> >>>>>>> >>>>>>> The main idea is quite simple: >>>>>>> >>>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>>> of 2 successive IMUL instructions. >>>>>> >>>>>> >>>>>> >>>>>> Why does that not happen without your patch? Does it never happen >>>>>> without >>>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>>> you provide a testcase?)? >>>>> >>>>> >>>>> >>>>> It does not happen because the scheduler by itself does not do such >>>>> specific >>>>> reordering. That said, it is easy to imagine the cases where this patch >>>>> will make things worse rather than better. >>>>> >>>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>>> which >>>>> is get called when an insn is added to the ready list? E.g. increase the >>>>> producer's priority. >>>>> >>>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>>> when >>>>> you have more than one imul in the ready list? Don't you want to bump >>>>> the >>>>> priority of the other imul found? >>>> >>>> >>>> Could you provide some examples when this patch would harm the >>>> performance? >>> >>> >>> I thought of the cases when the other ready insns can fill up the hole and >>> that would be more beneficial because e.g. they would be on more critical >>> paths than the producer of your second imul. I don't know enough of Atom to >>> give an example -- maybe some long divisions? >>> >>> >>>> >>>> Sched_reorder was chosen since it is used in other ports and looks >>>> most suitable for such case, e.g. it provides access to the whole >>>> ready list. >>>> BTW, just increasing producer's priority seems to be more risky in >>>> performance sense - we can incorrectly start delaying some >>>> instructions. >>> >>> >>> Yes, but exactly because of the above example you can start incorrectly >>> delaying other insns, too, as you force the insn to be the first in the >>> list. While bumping priority still leaves the scheduler sorting heuristics >>> in place and actually lowers that risk. >>> >>>> >>>> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it >>>> contains them - this could be added easily >>> >>> >>> It does, but I'm not sure the sched_reorder hook gets them or they are >>> immediately removed -- I saw similar checks in one of the targets' hooks. >> >> Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached >> >>> >>> Anyways, my main thought was that it is better to test on more benchmarks to >>> alleviate the above concerns, so as long as the i386 maintainers are happy, >>> I don't see major problems here. A good idea could be to generalize the >>> patch to handle other long latency insns as second consumers, not only >>> imuls, if this is relevant for Atom. >> >> Yes, generalization of this approach is in plans. According to Atom >> Software optimization guide there are several headrooms left here. >> As for trying on more benchmarks - the particular case is indeed quite >> rare. I attached the example >> where patch helps to group imuls in pairs which is profitable for >> Atom. Such and similar codes are not very common. >> But hopefully this approach could help avoid this and other glassjaws. > > BTW, this patch also helps some EEMBC tests when funroll-loops specified. > So, any feedback from i386 maintainers about this? :) > > Changelog slightly changed > > > 2012-04-10 Yuri Rumyantsev <yuri.s.rumyantsev@intel.com> > > * config/i386/i386.c (x86_sched_reorder): New function. > Added new function x86_sched_reorder > to take advantage of Atom pipelened IMUL execution. > > >> >>> >>> Andrey >>> >>> >>>> >>>> The case with more than one imul in the ready list wasn't considered >>>> just because the goal was to catch the particular case when there is a >>>> risk to get the following picture: "imul-producer-imul" which is less >>>> effective than "producer-imul-imul" for Atom >>>> >>>>> >>>>> Andrey >>>>> >>>>> >>>>>> >>>>>>> And MD allows us only prefer scheduling of successive IMUL >>>>>>> instructions, >>>>>>> i.e. >>>>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>>>> instruction then it will be chosen as the best candidate for >>>>>>> scheduling. >>>>>>> >>>>>>> >>>>>>>> at least from my very limited guessing of what the above does. So, >>>>>>>> did >>>>>>>> you >>>>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>>>> >>>>>>>> Richard. >>>>>>>> >>>>>>>>> >>>>>>>>> From reading the patch, I could not understand the link between >>>>>>>>> pipeline >>>>>>>>> behavior and what the patch appears to do. >>>>>>>>> >>>>>>>>> Hope that helps. >>>>>>>>> >>>>>>>>> Alexander >>>>> >>>>> >>>>> >>>> >>>> Thanks, >>>> Igor >>> >>> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-05-06 7:27 ` Igor Zamyatin @ 2012-05-28 20:20 ` Igor Zamyatin 0 siblings, 0 replies; 26+ messages in thread From: Igor Zamyatin @ 2012-05-28 20:20 UTC (permalink / raw) To: gcc-patches, Andrey Belevantsev Cc: Richard Guenther, Alexander Monakov, Andi Kleen Ping? On Sun, May 6, 2012 at 11:27 AM, Igor Zamyatin <izamyatin@gmail.com> wrote: > Ping. Could x86 maintainer(s) look at these changes? > > Thanks, > Igor > > On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin <izamyatin@gmail.com> wrote: >> On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin <izamyatin@gmail.com> wrote: >>> On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <abel@ispras.ru> wrote: >>>> On 13.04.2012 14:18, Igor Zamyatin wrote: >>>>> >>>>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<abel@ispras.ru> >>>>> wrote: >>>>>> >>>>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>>>> <richard.guenther@gmail.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>>>> pipeline >>>>>>>>>>> behavior? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>>>> behavior is as >>>>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>>>> latency, >>>>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>>>> otherwise, >>>>>>>>>> a >>>>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>>>> a >>>>>>>>>> stall. >>>>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>>>> instructions, but not to short-latency instructions. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> It seems to be modeled in the pipeline description though: >>>>>>>>> >>>>>>>>> ;;; imul insn has 5 cycles latency >>>>>>>>> (define_reservation "atom-imul-32" >>>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>>>> atom-imul-4, >>>>>>>>> atom-port-0") >>>>>>>>> >>>>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>>>> >>>>>>>> >>>>>>>> The main idea is quite simple: >>>>>>>> >>>>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>>>> of 2 successive IMUL instructions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Why does that not happen without your patch? Does it never happen >>>>>>> without >>>>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>>>> you provide a testcase?)? >>>>>> >>>>>> >>>>>> >>>>>> It does not happen because the scheduler by itself does not do such >>>>>> specific >>>>>> reordering. That said, it is easy to imagine the cases where this patch >>>>>> will make things worse rather than better. >>>>>> >>>>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>>>> which >>>>>> is get called when an insn is added to the ready list? E.g. increase the >>>>>> producer's priority. >>>>>> >>>>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>>>> when >>>>>> you have more than one imul in the ready list? Don't you want to bump >>>>>> the >>>>>> priority of the other imul found? >>>>> >>>>> >>>>> Could you provide some examples when this patch would harm the >>>>> performance? >>>> >>>> >>>> I thought of the cases when the other ready insns can fill up the hole and >>>> that would be more beneficial because e.g. they would be on more critical >>>> paths than the producer of your second imul. I don't know enough of Atom to >>>> give an example -- maybe some long divisions? >>>> >>>> >>>>> >>>>> Sched_reorder was chosen since it is used in other ports and looks >>>>> most suitable for such case, e.g. it provides access to the whole >>>>> ready list. >>>>> BTW, just increasing producer's priority seems to be more risky in >>>>> performance sense - we can incorrectly start delaying some >>>>> instructions. >>>> >>>> >>>> Yes, but exactly because of the above example you can start incorrectly >>>> delaying other insns, too, as you force the insn to be the first in the >>>> list. While bumping priority still leaves the scheduler sorting heuristics >>>> in place and actually lowers that risk. >>>> >>>>> >>>>> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it >>>>> contains them - this could be added easily >>>> >>>> >>>> It does, but I'm not sure the sched_reorder hook gets them or they are >>>> immediately removed -- I saw similar checks in one of the targets' hooks. >>> >>> Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached >>> >>>> >>>> Anyways, my main thought was that it is better to test on more benchmarks to >>>> alleviate the above concerns, so as long as the i386 maintainers are happy, >>>> I don't see major problems here. A good idea could be to generalize the >>>> patch to handle other long latency insns as second consumers, not only >>>> imuls, if this is relevant for Atom. >>> >>> Yes, generalization of this approach is in plans. According to Atom >>> Software optimization guide there are several headrooms left here. >>> As for trying on more benchmarks - the particular case is indeed quite >>> rare. I attached the example >>> where patch helps to group imuls in pairs which is profitable for >>> Atom. Such and similar codes are not very common. >>> But hopefully this approach could help avoid this and other glassjaws. >> >> BTW, this patch also helps some EEMBC tests when funroll-loops specified. >> So, any feedback from i386 maintainers about this? :) >> >> Changelog slightly changed >> >> >> 2012-04-10 Yuri Rumyantsev <yuri.s.rumyantsev@intel.com> >> >> * config/i386/i386.c (x86_sched_reorder): New function. >> Added new function x86_sched_reorder >> to take advantage of Atom pipelened IMUL execution. >> >> >>> >>>> >>>> Andrey >>>> >>>> >>>>> >>>>> The case with more than one imul in the ready list wasn't considered >>>>> just because the goal was to catch the particular case when there is a >>>>> risk to get the following picture: "imul-producer-imul" which is less >>>>> effective than "producer-imul-imul" for Atom >>>>> >>>>>> >>>>>> Andrey >>>>>> >>>>>> >>>>>>> >>>>>>>> And MD allows us only prefer scheduling of successive IMUL >>>>>>>> instructions, >>>>>>>> i.e. >>>>>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>>>>> instruction then it will be chosen as the best candidate for >>>>>>>> scheduling. >>>>>>>> >>>>>>>> >>>>>>>>> at least from my very limited guessing of what the above does. So, >>>>>>>>> did >>>>>>>>> you >>>>>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>>>>> >>>>>>>>> Richard. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> From reading the patch, I could not understand the link between >>>>>>>>>> pipeline >>>>>>>>>> behavior and what the patch appears to do. >>>>>>>>>> >>>>>>>>>> Hope that helps. >>>>>>>>>> >>>>>>>>>> Alexander >>>>>> >>>>>> >>>>>> >>>>> >>>>> Thanks, >>>>> Igor >>>> >>>> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-04-11 13:38 ` Andi Kleen 2012-04-11 14:10 ` Richard Guenther @ 2012-04-12 10:18 ` Igor Zamyatin 1 sibling, 0 replies; 26+ messages in thread From: Igor Zamyatin @ 2012-04-12 10:18 UTC (permalink / raw) To: Andi Kleen; +Cc: gcc-patches On Wed, Apr 11, 2012 at 5:38 PM, Andi Kleen <andi@firstfloor.org> wrote: > Igor Zamyatin <izamyatin@gmail.com> writes: > >> Hi All! >> >> It is known that imul placement is rather critical for Atom processors >> and changes try to improve imul scheduling for Atom. >> >> This gives +5% performance on several tests from new OA 2.0 testsuite >> from EEMBC. >> >> Tested for i386 and x86-64, ok for trunk? > > Did you measure how much this slows down the compiler when compiling > for Atom? The new pass looks rather slow. No, since this "pass" is applied rarely. Conditions are mentioned in the comments to ix86_sched_reorder > > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement
@ 2012-05-28 21:41 Uros Bizjak
[not found] ` <0EFAB2BDD0F67E4FB6CCC8B9F87D756915E05C45@IRSMSX101.ger.corp.intel.com>
0 siblings, 1 reply; 26+ messages in thread
From: Uros Bizjak @ 2012-05-28 21:41 UTC (permalink / raw)
To: gcc-patches
Cc: Andrey Belevantsev, Richard Guenther, Alexander Monakov, Andi Kleen
Hello!
> Ping?
Please at least add and URL to the patch, it took me some time to
found the latest version [1], I'm not even sure if it is the latest
version...
I assume that you cleared all issues with middle-end and scheduler
maintainers, it is not clear from the message.
+ (1) IMUL instrction is on the top of list;
Typo above.
+ static int issue_rate = -1;
+ int n_ready = *pn_ready;
+ rtx insn;
+ rtx insn1;
+ rtx insn2;
Please put three definitions above on the same line.
+ int i;
+ sd_iterator_def sd_it;
+ dep_t dep;
+ int index = -1;
+
+ /* set up issue rate */
+ if (issue_rate < 0)
+ issue_rate = ix86_issue_rate();
Please set issue_rate unconditionally here. Also, please follow the
GNU style of comments (Full sentence with two spaces after the dot)
everywhere, e.g:
/* Set up issue rate. */
+ if (!(GET_CODE (SET_SRC (insn)) == MULT
+ && GET_MODE (SET_SRC (insn)) == SImode))
+ return issue_rate;
Is it correct that only SImode multiplies are checked against SImode
multiplies? Can't we use DImode or HImode multiply (or other
long-latency insns) to put them into the shadow of the first multiply
insn?
As proposed in [2], there are many other fine-tuning approaches
proposed by the scheduler maintainer. OTOH, even the "big hammer"
approach in the proposed patch makes things better, so it is the step
in the right direction - and it is existing practice anyway.
Under this rationale, I think that the patch should be committed to
mainline. But please also consider proposed fine-tunings to refine the
scheduling accuracy.
So, OK for mainline, if there are no objections from other maintainers
in next two days.
[1] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00964.html
[2] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00806.html
Thanks,
Uros.
^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <0EFAB2BDD0F67E4FB6CCC8B9F87D756915E05C45@IRSMSX101.ger.corp.intel.com>]
* Re: [PATCH] Atom: Scheduler improvements for better imul placement [not found] ` <0EFAB2BDD0F67E4FB6CCC8B9F87D756915E05C45@IRSMSX101.ger.corp.intel.com> @ 2012-05-29 19:01 ` Igor Zamyatin 2012-06-01 13:46 ` H.J. Lu 2012-06-01 17:48 ` Eric Botcazou 0 siblings, 2 replies; 26+ messages in thread From: Igor Zamyatin @ 2012-05-29 19:01 UTC (permalink / raw) To: gcc-patches; +Cc: Uros Bizjak [-- Attachment #1: Type: text/plain, Size: 2354 bytes --] Hi, Uros! Sorry, I didn't realize that patch was missed. I attached new version. Changelog: 2012-05-29 Yuri Rumyantsev <yuri.s.rumyantsev@intel.com> * config/i386/i386.c (x86_sched_reorder): New function. Added new function x86_sched_reorder. As for multiply modes, currently we handled most frequent case for Atom and in the future this could be enhanced. Thanks, Igor > > Hello! > >> Ping? > > Please at least add and URL to the patch, it took me some time to found the latest version [1], I'm not even sure if it is the latest version... > > I assume that you cleared all issues with middle-end and scheduler maintainers, it is not clear from the message. > > + (1) IMUL instrction is on the top of list; > > Typo above. > > + static int issue_rate = -1; > + int n_ready = *pn_ready; > + rtx insn; > + rtx insn1; > + rtx insn2; > > Please put three definitions above on the same line. > > + int i; > + sd_iterator_def sd_it; > + dep_t dep; > + int index = -1; > + > + /* set up issue rate */ > + if (issue_rate < 0) > + issue_rate = ix86_issue_rate(); > > Please set issue_rate unconditionally here. Also, please follow the GNU style of comments (Full sentence with two spaces after the dot) everywhere, e.g: > > /* Set up issue rate. */ > > + if (!(GET_CODE (SET_SRC (insn)) == MULT > + && GET_MODE (SET_SRC (insn)) == SImode)) > + return issue_rate; > > Is it correct that only SImode multiplies are checked against SImode multiplies? Can't we use DImode or HImode multiply (or other long-latency insns) to put them into the shadow of the first multiply insn? > > As proposed in [2], there are many other fine-tuning approaches proposed by the scheduler maintainer. OTOH, even the "big hammer" > approach in the proposed patch makes things better, so it is the step in the right direction - and it is existing practice anyway. > > Under this rationale, I think that the patch should be committed to mainline. But please also consider proposed fine-tunings to refine the scheduling accuracy. > > So, OK for mainline, if there are no objections from other maintainers in next two days. > > [1] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00964.html > [2] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00806.html > > Thanks, > Uros. [-- Attachment #2: imul_reordering.patch --] [-- Type: application/octet-stream, Size: 3893 bytes --] diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9e4ada0..5e4dd02 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -23857,6 +23857,110 @@ ia32_multipass_dfa_lookahead (void) } } +/* Try to reorder ready list to take advantage of Atom pipelined IMUL + execution. It is applied if + (1) IMUL instruction is on the top of list; + (2) There exists the only producer of independent IMUL instruction in + ready list; + (3) Put found producer on the top of ready list. + Returns issue rate. */ + +static int +ix86_sched_reorder(FILE *dump, int sched_verbose, rtx *ready, int *pn_ready, + int clock_var ATTRIBUTE_UNUSED) +{ + static int issue_rate = -1; + int n_ready = *pn_ready; + rtx insn, insn1, insn2; + int i; + sd_iterator_def sd_it; + dep_t dep; + int index = -1; + + /* Set up issue rate. */ + issue_rate = ix86_issue_rate(); + + /* Do reodering for Atom only. */ + if (ix86_tune != PROCESSOR_ATOM) + return issue_rate; + /* Nothing to do if ready list contains only 1 instruction. */ + if (n_ready <= 1) + return issue_rate; + + /* Check that IMUL instruction is on the top of ready list. */ + insn = ready[n_ready - 1]; + if (!NONDEBUG_INSN_P (insn)) + return issue_rate; + insn = PATTERN (insn); + if (GET_CODE (insn) == PARALLEL) + insn = XVECEXP (insn, 0, 0); + if (GET_CODE (insn) != SET) + return issue_rate; + if (!(GET_CODE (SET_SRC (insn)) == MULT + && GET_MODE (SET_SRC (insn)) == SImode)) + return issue_rate; + + /* Search for producer of independent IMUL instruction. */ + for (i = n_ready - 2; i>= 0; i--) + { + insn = ready[i]; + if (!NONDEBUG_INSN_P (insn)) + continue; + /* Skip IMUL instruction. */ + insn2 = PATTERN (insn); + if (GET_CODE (insn2) == PARALLEL) + insn2 = XVECEXP (insn2, 0, 0); + if (GET_CODE (insn2) == SET + && GET_CODE (SET_SRC (insn2)) == MULT + && GET_MODE (SET_SRC (insn2)) == SImode) + continue; + + FOR_EACH_DEP (insn, SD_LIST_FORW, sd_it, dep) + { + rtx con; + con = DEP_CON (dep); + insn1 = PATTERN (con); + if (GET_CODE (insn1) == PARALLEL) + insn1 = XVECEXP (insn1, 0, 0); + + if (GET_CODE (insn1) == SET + && GET_CODE (SET_SRC (insn1)) == MULT + && GET_MODE (SET_SRC (insn1)) == SImode) + { + sd_iterator_def sd_it1; + dep_t dep1; + /* Check if there is no other dependee for IMUL. */ + index = i; + FOR_EACH_DEP (con, SD_LIST_BACK, sd_it1, dep1) + { + rtx pro; + pro = DEP_PRO (dep1); + if (pro != insn) + index = -1; + } + if (index >= 0) + break; + } + } + if (index >= 0) + break; + } + if (index < 0) + return issue_rate; /* Didn't find IMUL producer. */ + + if (sched_verbose > 1) + fprintf(dump, ";;\tatom sched_reorder: swap %d and %d insns\n", + INSN_UID (ready[index]), INSN_UID (ready[n_ready - 1])); + + /* Put IMUL producer (ready[index]) at the top of ready list. */ + insn1= ready[index]; + for (i = index; i < n_ready - 1; i++) + ready[i] = ready[i + 1]; + ready[n_ready - 1] = insn1; + + return issue_rate; +} + \f /* Model decoder of Core 2/i7. @@ -38500,6 +38604,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree) #define TARGET_SCHED_DISPATCH_DO do_dispatch #undef TARGET_SCHED_REASSOCIATION_WIDTH #define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width +#undef TARGET_SCHED_REORDER +#define TARGET_SCHED_REORDER ix86_sched_reorder /* The size of the dispatch window is the total number of bytes of object code allowed in a window. */ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-05-29 19:01 ` Igor Zamyatin @ 2012-06-01 13:46 ` H.J. Lu 2012-06-01 17:48 ` Eric Botcazou 1 sibling, 0 replies; 26+ messages in thread From: H.J. Lu @ 2012-06-01 13:46 UTC (permalink / raw) To: Igor Zamyatin, Yuri Rumyantsev; +Cc: gcc-patches, Uros Bizjak On Tue, May 29, 2012 at 12:01 PM, Igor Zamyatin <izamyatin@gmail.com> wrote: > Hi, Uros! > > Sorry, I didn't realize that patch was missed. I attached new version. > > Changelog: > > 2012-05-29 Yuri Rumyantsev <yuri.s.rumyantsev@intel.com> > > * config/i386/i386.c (x86_sched_reorder): New function. > Added new function x86_sched_reorder. > > As for multiply modes, currently we handled most frequent case for > Atom and in the future this could be enhanced. > This may have caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53555 -- H.J. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] Atom: Scheduler improvements for better imul placement 2012-05-29 19:01 ` Igor Zamyatin 2012-06-01 13:46 ` H.J. Lu @ 2012-06-01 17:48 ` Eric Botcazou 1 sibling, 0 replies; 26+ messages in thread From: Eric Botcazou @ 2012-06-01 17:48 UTC (permalink / raw) To: Igor Zamyatin; +Cc: gcc-patches, Uros Bizjak > Sorry, I didn't realize that patch was missed. I attached new version. > > Changelog: > > 2012-05-29 Yuri Rumyantsev <yuri.s.rumyantsev@intel.com> > > * config/i386/i386.c (x86_sched_reorder): New function. > Added new function x86_sched_reorder. Reading it, you get the impression that the new function is unused. Replace the second line with the description of the second hunk. -- Eric Botcazou ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2012-06-01 17:48 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-11 8:40 [PATCH] Atom: Scheduler improvements for better imul placement Igor Zamyatin 2012-04-11 13:38 ` Andi Kleen 2012-04-11 14:10 ` Richard Guenther 2012-04-12 10:20 ` Igor Zamyatin 2012-04-12 10:45 ` Richard Guenther 2012-04-12 12:00 ` Alexander Monakov 2012-04-12 12:24 ` Richard Guenther 2012-04-12 12:36 ` Igor Zamyatin 2012-04-12 12:38 ` Richard Guenther 2012-04-12 13:02 ` Andrey Belevantsev 2012-04-12 13:55 ` Richard Guenther 2012-04-12 14:03 ` Andrey Belevantsev 2012-04-12 14:22 ` Richard Guenther 2012-04-13 7:16 ` Andrey Belevantsev 2012-04-13 10:18 ` Igor Zamyatin 2012-04-13 11:00 ` Igor Zamyatin 2012-04-13 12:21 ` Andrey Belevantsev 2012-04-16 20:27 ` Igor Zamyatin 2012-04-20 12:05 ` Igor Zamyatin 2012-05-06 7:27 ` Igor Zamyatin 2012-05-28 20:20 ` Igor Zamyatin 2012-04-12 10:18 ` Igor Zamyatin 2012-05-28 21:41 Uros Bizjak [not found] ` <0EFAB2BDD0F67E4FB6CCC8B9F87D756915E05C45@IRSMSX101.ger.corp.intel.com> 2012-05-29 19:01 ` Igor Zamyatin 2012-06-01 13:46 ` H.J. Lu 2012-06-01 17:48 ` Eric Botcazou
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).