Re: c/3917: IA-64 assembler output shows erroneous cycle counting

public inbox for gcc-prs@sourceware.org
help / color / mirror / Atom feed

* Re: c/3917: IA-64 assembler output shows erroneous cycle counting
@ 2001-09-20 16:56 wilson
  0 siblings, 0 replies; 2+ messages in thread
From: wilson @ 2001-09-20 16:56 UTC (permalink / raw)
  To: gbv, gcc-bugs, gcc-prs, wilson

Synopsis: IA-64 assembler output shows erroneous cycle counting

State-Changed-From-To: analyzed->closed
State-Changed-By: wilson
State-Changed-When: Thu Sep 20 16:56:14 2001
State-Changed-Why:
    I looked at the issue of making better choices about
    putting padding nops into partially full bundles.  The
    info we need to make a better choice is not easily
    available at the place where we need to make the choice.
    I tried adding a quick hack to pad with a nop if a bundle
    has the first two slots full, and the split point was
    after the third slot.  I would expect this to show some
    performancce improvement, but not as much as we could do
    if we had better info available.
    Index: ia64.c
    ===================================================================
    RCS file: /cvs/gcc/gcc/gcc/config/ia64/ia64.c,v
    retrieving revision 1.120
    diff -p -r1.120 ia64.c
    *** ia64.c	2001/08/23 19:27:54	1.120
    --- ia64.c	2001/09/19 18:44:24
    *************** cycle_end_fill_slots (dump)
    *** 5508,5513 ****
    --- 5523,5537 ----
            slot++;
          }
      #endif
    + 
    +   if ((slot == 2 && sched_data.split >= 3)
    +       || (slot == 5 && sched_data.split == 6))
    +     {
    +       sched_data.types[slot] = packet->t[slot];
    +       sched_data.insns[slot] = 0;
    +       sched_data.stopbit[slot] = 0;
    +       slot++;
    +     }
      
        sched_data.first_slot = sched_data.cur = slot;
      }
    
    I tried running spec95 int and fp benchmarks, with no
    patch, the first patch, and both patches.  I also added
    a hack to disable Jan Hubicka's July 15 loop.c patch,
    which is causing performance regressions.  I get no
    observable performancec increase from the first patch,
    and I get a small performance decrease from the second
    patch.  From this I conclure that my first patch is
    desirable, but my second one is not, and that no further
    investigation is worthwhile at the moment for this
    problem.
    
    We will still have the problem that the scheduler will
    occasionally put two FP insns in different cycles when
    they could go in the same cycle, but in most cases the
    scheduler should now get this right, and I see no
    performance increase from trying to fix the remaining
    cases.
    
    I will be checking in my first patch.

http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view&pr=3917&database=gcc


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: c/3917: IA-64 assembler output shows erroneous cycle counting
@ 2001-09-17 22:24 wilson
  0 siblings, 0 replies; 2+ messages in thread
From: wilson @ 2001-09-17 22:24 UTC (permalink / raw)
  To: gbv, gcc-bugs, gcc-prs, nobody, wilson

Synopsis: IA-64 assembler output shows erroneous cycle counting

Responsible-Changed-From-To: unassigned->wilson
Responsible-Changed-By: wilson
Responsible-Changed-When: Mon Sep 17 22:23:58 2001
Responsible-Changed-Why:
    IA-64 maintainer
State-Changed-From-To: open->analyzed
State-Changed-By: wilson
State-Changed-When: Mon Sep 17 22:23:58 2001
State-Changed-Why:
    The cycle counts indicate what the scheduler thinks the
    hardware will do.  They will always be wrong to some
    extent, because perfect emulation of the hardware
    pipeline is difficult.  Also, current gcc infrastructure
    does not have any easy way to describe pipelines as
    complicated as the Itanium.  Major discrepancies should
    be fixed though.
    
    I need a testcase to make sure that we are talking about
    the same thing.  I have provided one of my own.
    
    double sub2 (double, double, double, double);
    
    double
    sub (double w, double x, double y, double z, double a, double b, double c, double d)
    {
      return sub2 (a + b, c + d, a + b, c + d);
    }
    
    With this testcase, I see that the 4 add instructions
    get scheduled in 4 different cycles for no apparent
    reason.
    
    You are correct that there is a bug in itanium_split_issue.
    It should allow 2 FP instructions per cycle.  When I tested
    this, I ran into a number of bugs, and those needed further
    bug fixes.  There was a problem where scheduling M M F I0
    instructions caused selection of MLX MFI bundles, requiring
    emitting a nop.x instruction, which we did not have
    support for.  Bundles MFI MFI would have been better.
    This is a problem with insn_matches_slot not knowing that
    an instruction filling LX slots uses FI issue slots,
    thus it thought with MLX MFI the I would issue to the I0
    unit which is not true.  With this fixed, I ran into another
    problem where scheduling a L instruction (mov.l) in one
    cycle and then a I instruction requiring unit I0 in the
    next cycle gave an abort.  This is because it doesn't
    know that L instructions take two slots, so it thought
    that the MLX bundle was not full yet, and tried to
    schedule in the I0 instruction without rotating out the
    MLX bundle which is impossible.  To fix this, I hacked
    in code to make L instructions take two slots, which
    forces the bundle rotation.  This works, but did not
    seem to be the right solution.  With this patch, the
    scheduler now puts the first two add instructions in
    the same cycle, but not the last two.  This is an
    improvement but we can do better.
    
    The remaining problem, which I have not fixed yet, is
    more involved.  It has to do with how the scheduler tries
    to schedule into two bundles at a time, matching the
    hardware issue rate.  When a bundle is partially filled,
    we need to decide whether to pad it with nops, or to
    try to continue filling it with instructions for the next
    cycle.  The current code here is suboptimal.  It always
    tries to fill with instructions from the next cycle.
    We would get better code in many cases if we padded with
    nops.  This will have to be done carefully to avoid
    slowing down the core with too many nops.  I am planning
    to continue working on this patch.
    
    The current version of the patch is included below.
    I get a slight performancce increase on specint95, I
    have not yet had time to test it on specfp which would
    be more interesting.
    
    Your patch by the way is backwards.  You should always
    do "diff oldfile newfile".

http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view&pr=3917&database=gcc


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2001-09-20 16:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-20 16:56 c/3917: IA-64 assembler output shows erroneous cycle counting wilson
  -- strict thread matches above, loose matches on Subject: below --
2001-09-17 22:24 wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).