public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* IA64 control speculation of loads
       [not found] <1094575209.18861549.1606140937179.JavaMail.zimbra@kalray.eu>
@ 2021-02-08 12:25 ` Benoît De Dinechin
  2021-02-08 14:44   ` Alexander Monakov
  0 siblings, 1 reply; 4+ messages in thread
From: Benoît De Dinechin @ 2021-02-08 12:25 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 1874 bytes --]

Hello, 

Is there a way to activate control speculation of loads in GCC, starting with the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: 

double 
list_sum(list_cell list) 
{ 
double result = 0.0; 
while (list->next) { 
list = list->next; 
result += list->payload; 
if (!list->next) break; 
list = list->next; 
result += list->payload; 
} 
return result; 
} 

Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use in a manycore processor it produces. These VLIW cores run Linux, and Kalray develops GCC and LLVM code generators for them (see kvx compilers on https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is critically dependent on the control speculation of loads. Being a Fischer-style VLIW, the kvx architecture has dismissable loads instead of control speculative loads, so there is no need to create speculation check with recovery code. 

I first tried in prepass scheduling with SCHED_RGN, hoping from various comments in the source file that it could move loads across blocks (sched-rgn.c:26 The first run performs interblock scheduling, moving insns between different blocks in the same "region"). SCHED_EBB is not available in prepass and SEL_SCHED does not work with control speculation: not only from experience with the kvx retargeting where it breaks dataflow invariants, but also as hinted by logic in ia64.c:ia64_set_sched_flags(). 

My question is whether GCC can or cannot do any control speculation of loads during prepass scheduling. From what I observed, enabling control speculation in region scheduling only enables the load instructions to get ready earlier in their home basic block, not being scheduled in a dominator basic block like expected to happen for improving performance in the above example. 

Thanks for any advice. 

Benoît Dinechin 


[-- Attachment #2: list_sum-ia64.s --]
[-- Type: application/octet-stream, Size: 944 bytes --]

	.file	"list_sum.c"
	.pred.safe_across_calls p1-p5,p16-p63
	.text
	.align 16
	.align 64
	.global list_sum#
	.type	list_sum#, @function
	.proc list_sum#
list_sum:
	.prologue
	.body
	.mmf
	ld8 r14 = [r32]
	nop 0
	mov f6 = f0
	;;
	.mmi
	cmp.eq p6, p7 = 0, r14
	nop 0
	adds r15 = 8, r14
	;;
	.mfb
	nop 0
	(p6) mov f8 = f0
	(p6) br.cond.dpnt .L1
	;;
	.mmi
	ldfd f8 = [r15]
	ld8 r14 = [r14]
	nop 0
	;;
	.mmi
	nop 0
	cmp.eq p6, p7 = 0, r14
	nop 0
	;;
	.mfb
	nop 0
	fadd.d f8 = f8, f6
	(p6) br.cond.dpnt .L1
	.align 32
.L3:
	.mmi
	adds r15 = 8, r14
	ld8 r14 = [r14]
	nop 0
	;;
	.mmi
	ldfd f6 = [r15]
	adds r16 = 8, r14
	cmp.ne p6, p7 = 0, r14
	;;
	.mfb
	nop 0
	fadd.d f8 = f8, f6
	(p7) br.cond.dpnt .L1
	;;
	.mmi
	ldfd f6 = [r16]
	ld8 r14 = [r14]
	nop 0
	;;
	.mmi
	nop 0
	cmp.eq p6, p7 = 0, r14
	nop 0
	;;
	.mfb
	nop 0
	fadd.d f8 = f8, f6
	(p7) br.cond.dptk .L3
.L1:
	.mib
	nop 0
	nop 0
	br.ret.sptk.many b0
	.endp list_sum#
	.ident	"GCC: (GNU) 7.5.0"

[-- Attachment #3: list_sum-kvx.s --]
[-- Type: application/octet-stream, Size: 759 bytes --]

	.file	"list_sum.c"
	.text

	.align 8
	.global list_sum
	.type	list_sum, @function
list_sum:
	ld $r0 = 0[$r0]
	make $r2 = 0x0000000000000000
	;;	(end cycle 0)
	cb.deqz $r0? .L5
	;;	(end cycle 4)
	ld $r1 = 0[$r0]
	;;	(end cycle 5)
	ld $r0 = 8[$r0]
	;;	(end cycle 6)
	faddd $r0 = $r0, $r2
	cb.dnez $r1? .L3
	;;	(end cycle 9)
	goto .L1
	;;	(end cycle 0)
.L4:
	ld $r1 = 0[$r2]
	;;	(end cycle 0)
	ld $r2 = 8[$r2]
	;;	(end cycle 1)
	faddd $r0 = $r0, $r2
	cb.deqz $r1? .L1
	;;	(end cycle 4)
.L3:
	ld $r2 = 0[$r1]
	;;	(end cycle 0)
	ld $r1 = 8[$r1]
	;;	(end cycle 1)
	faddd $r0 = $r0, $r1
	cb.dnez $r2? .L4
	;;	(end cycle 4)
.L1:
	ret
	;;	(end cycle 0)
.L5:
	make $r0 = 0x0000000000000000
	ret
	;;	(end cycle 0)
	.size	list_sum, .-list_sum
	.ident	"GCC: (GNU) 7.5.0"

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: IA64 control speculation of loads
  2021-02-08 12:25 ` IA64 control speculation of loads Benoît De Dinechin
@ 2021-02-08 14:44   ` Alexander Monakov
  2021-02-08 21:12     ` Benoît De Dinechin
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Monakov @ 2021-02-08 14:44 UTC (permalink / raw)
  To: Benoît De Dinechin; +Cc: gcc, Andrey Belevantsev


On Mon, 8 Feb 2021, Benoît De Dinechin wrote:

> Hello, 
> 
> Is there a way to activate control speculation of loads in GCC, starting with
> the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: 

I think in that loop cost modeling in sel-sched estimates that load speculation
would not be profitable. With a long-latency operation after the load, I do get
a speculative load at -O3 (for the 'payload' field, but not 'next'):

struct list {
  struct list *next;
  double payload;
};

double f(struct list *l)
{
  double result = 0;
  for (; l; l = l->next)
    result += 1 / l->payload;
  return result;
}

> Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use
> in a manycore processor it produces. These VLIW cores run Linux, and Kalray
> develops GCC and LLVM code generators for them (see kvx compilers on
> https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is
> critically dependent on the control speculation of loads. Being a
> Fischer-style VLIW, the kvx architecture has dismissable loads instead of
> control speculative loads, so there is no need to create speculation check
> with recovery code. 
> 
> I first tried in prepass scheduling with SCHED_RGN, hoping from various
> comments in the source file that it could move loads across blocks
> (sched-rgn.c:26 The first run performs interblock scheduling, moving insns
> between different blocks in the same "region"). SCHED_EBB is not available in
> prepass and SEL_SCHED does not work with control speculation: not only from
> experience with the kvx retargeting where it breaks dataflow invariants, but
> also as hinted by logic in ia64.c:ia64_set_sched_flags(). 

Can you elaborate on the dataflow issues you've encountered? I don't recall the
specific reason why control speculation before register allocation cannot be
enabled with sel-sched, but I'd expect it has to do with the interval between
the speculative load and the check, in which the register may not be stored to
memory normally (needs dedicated spill/fill instructions), and interaction with
uninitialized variables assigned the same register.

If on KVX you don't need speculation checks, those concerns would not apply.

Why are you looking for pre-RA (prepass) scheduling specifically? To avoid
anti-dependencies created by register allocation?

> My question is whether GCC can or cannot do any control speculation of loads
> during prepass scheduling. From what I observed, enabling control speculation
> in region scheduling only enables the load instructions to get ready earlier
> in their home basic block, not being scheduled in a dominator basic block like
> expected to happen for improving performance in the above example. 

But there's no control flow inside a basic block, so the load can appear earlier
due to data speculation (or normal scheduling), not control speculation.

I think GCC may have correctness issues with ia64-style control speculation
before register allocation, but I can't think of a reason why check-free loads
would pose a problem.

Alexander

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: IA64 control speculation of loads
  2021-02-08 14:44   ` Alexander Monakov
@ 2021-02-08 21:12     ` Benoît De Dinechin
  2021-02-09 10:33       ` Benoît De Dinechin
  0 siblings, 1 reply; 4+ messages in thread
From: Benoît De Dinechin @ 2021-02-08 21:12 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc, Andrey Belevantsev

Hi Alexander,

Indeed there is control speculation on this example, but it only happens is sched2, which is delayed to mach and uses SEL_SCHED:

  ../gcc/ia64/gcc/cc1 -fpreprocessed ../list_sum.i  -quiet -dumpbase list_sum.c -auxbase list_sum -O3  -o list_sum.s -da -dp -dA -fsched-verbose=6

  grep movdf_speculative list_sum.c.*
  list_sum.c.298r.mach:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.298r.mach:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.298r.mach:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.298r.mach:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.299r.barriers:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.299r.barriers:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.303r.shorten:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.303r.shorten:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.304r.nothrow:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.304r.nothrow:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.306r.final:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.306r.final:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.307r.dfinish:            ] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.307r.dfinish:            ] UNSPEC_LDS)) 25 {movdf_speculative}

The kvx back-end uses SCHED_EBB in sched2, which is also delayed to mach. On kvx, SCHED_EBB performs better than SEL_SCHED in most cases.

I suspect that control speculation likely does not work in prepass scheduling with SEL_SCHED because of the following in ia64_set_sched_flags():

  if (mflag_sched_control_spec
      && (!sel_sched_p ()
          || reload_completed))
    {
      mask |= BEGIN_CONTROL;
  
      if (!sel_sched_p () && mflag_sched_in_control_spec)
        mask |= BE_IN_CONTROL;
    }

On the kvx, I tried the following kvx_sched_set_sched_flags():


----- Original Message -----
From: "Alexander Monakov" <amonakov@ispras.ru>
To: "Benoît De Dinechin" <bddinechin@kalrayinc.com>
Cc: "gcc" <gcc@gcc.gnu.org>, "Andrey Belevantsev" <abel@ispras.ru>
Sent: Monday, February 8, 2021 3:44:08 PM
Subject: Re: IA64 control speculation of loads

On Mon, 8 Feb 2021, Benoît De Dinechin wrote:

> Hello, 
> 
> Is there a way to activate control speculation of loads in GCC, starting with
> the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: 

I think in that loop cost modeling in sel-sched estimates that load speculation
would not be profitable. With a long-latency operation after the load, I do get
a speculative load at -O3 (for the 'payload' field, but not 'next'):

struct list {
  struct list *next;
  double payload;
};

double f(struct list *l)
{
  double result = 0;
  for (; l; l = l->next)
    result += 1 / l->payload;
  return result;
}

> Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use
> in a manycore processor it produces. These VLIW cores run Linux, and Kalray
> develops GCC and LLVM code generators for them (see kvx compilers on
> https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is
> critically dependent on the control speculation of loads. Being a
> Fischer-style VLIW, the kvx architecture has dismissable loads instead of
> control speculative loads, so there is no need to create speculation check
> with recovery code. 
> 
> I first tried in prepass scheduling with SCHED_RGN, hoping from various
> comments in the source file that it could move loads across blocks
> (sched-rgn.c:26 The first run performs interblock scheduling, moving insns
> between different blocks in the same "region"). SCHED_EBB is not available in
> prepass and SEL_SCHED does not work with control speculation: not only from
> experience with the kvx retargeting where it breaks dataflow invariants, but
> also as hinted by logic in ia64.c:ia64_set_sched_flags(). 

Can you elaborate on the dataflow issues you've encountered? I don't recall the
specific reason why control speculation before register allocation cannot be
enabled with sel-sched, but I'd expect it has to do with the interval between
the speculative load and the check, in which the register may not be stored to
memory normally (needs dedicated spill/fill instructions), and interaction with
uninitialized variables assigned the same register.

If on KVX you don't need speculation checks, those concerns would not apply.

Why are you looking for pre-RA (prepass) scheduling specifically? To avoid
anti-dependencies created by register allocation?

> My question is whether GCC can or cannot do any control speculation of loads
> during prepass scheduling. From what I observed, enabling control speculation
> in region scheduling only enables the load instructions to get ready earlier
> in their home basic block, not being scheduled in a dominator basic block like
> expected to happen for improving performance in the above example. 

But there's no control flow inside a basic block, so the load can appear earlier
due to data speculation (or normal scheduling), not control speculation.

I think GCC may have correctness issues with ia64-style control speculation
before register allocation, but I can't think of a reason why check-free loads
would pose a problem.

Alexander


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: IA64 control speculation of loads
  2021-02-08 21:12     ` Benoît De Dinechin
@ 2021-02-09 10:33       ` Benoît De Dinechin
  0 siblings, 0 replies; 4+ messages in thread
From: Benoît De Dinechin @ 2021-02-09 10:33 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc

Sorry my previous answer was cut.

The motivation for prepass global code motion is indeed that after register allocation, inter-block scheduling is even more restricted due to anti-dependencies, including those due to live-out on side exit branches. Global code motion is a key performance enabler especially for the non-temporal loads (i.e. L1 cache bypass loads), which have an exposed latency close to 20 cycles on the current kvx cores.

The dataflow issues encountered with SEL_SCHED in prepass with control speculation enabled was inconsistent liveness reported by the compiler. I am running a test suite to reproduce it (saw it 3 months ago).

Here is again a motivating example where I expect the scheduler to speculate loads from the second to the first block in the loop, which dominates it, so in principle SCHED_RGN should do it:

  typedef struct list_cell_ {
    struct list_cell_ *next;
    float payload;
  } list_cell_, *list_cell;

  float
  list_sum(list_cell_ *list)
  {
    float result = 0.0;
    while (list->next) {
      list = list->next;
      result += 1.0f/list->payload;
      if (!list->next) break;
      list = list->next;
      result += 1.0f/list->payload;
    }
    return result;
  }

Here is the TARGET_SCHED_SET_SCHED_FLAGS, with comments that reflect my understanding on what to do. The commented line prevents SEL_SCHED with control speculation unless postpass (as in ia64):

  static void
  kvx_sched_set_sched_flags (struct spec_info_def *spec_info)                                                                          
  {
    unsigned int *flags = &(current_sched_info->flags);                                                                                
    // Speculative scheduling is enabled by non-zero spec_info->mask.                                                                  
    spec_info->mask = 0;                                                                                                               
    if (*flags & (SEL_SCHED | SCHED_RGN))                                                                                              
      {
        //if (!sel_sched_p () || reload_completed)                                                                                     
          {
            // Must do this in case of speculation.                                                                                    
            *flags |= USE_DEPS_LIST | DO_SPECULATION;                                                                                  
            // Do control speculation only.                                                                                            
            spec_info->mask = BEGIN_CONTROL;                                                                                           
            // Speculative scheduling without CHECK.                                                                                   
            spec_info->flags = SEL_SCHED_SPEC_DONT_CHECK_CONTROL;                                                                      
            // Dump into the sched_dump.                                                                                               
            spec_info->dump = sched_dump;                                                                                              
          }
      }
  }

The TARGET_SCHED_SET_SCHED_FLAGS is implemented by (should memoize to return 0 if already speculated with the same ts, assuming not relevant here):

  static int
  kvx_sched_speculate_insn (rtx_insn *insn, ds_t ts, rtx *new_pat)
  {
    rtx pattern = PATTERN (insn);
    if (GET_CODE (pattern) == SET)
      {
        rtx src = SET_SRC (pattern);
        if (GET_CODE (src) == MEM)
          {
            *new_pat = pattern;
            return 1;
          }
      }
    return -1;
  }

And TARGET_SCHED_NEEDS_BLOCK_P always returns false.

When I compile the motivating example above for the KVX, kvx_sched_speculate_insn() is indeed called with reload_completed==0 (prepass) for the two loads of the second block, but no code motion to the first block happens. Generated code is the same for SCHED_RGN (default) or SEL_SCHED (-fselective-scheduling), up to a renaming of the registers, although SEL_SCHED calls kvx_sched_speculate_insn() several times for each load.

For the ia64 on the motivating example, it seems there is no prepass control speculation either:

  ./gcc/ia64/gcc/cc1 -fpreprocessed list_sum2.i -quiet -dumpbase list_sum2.c -dp -auxbase list_sum2 -O3 -version -ffast-math -o list_sum2.s -da -dp -msched-control-spec -msched-in-control-spec
  grep _speculative list_sum2.c.*
  list_sum2.c.298r.mach:            ] UNSPEC_LDS)) 24 {movsf_speculative}
  ...

I noticed that the ia64 target uses the undocumented target hooks TARGET_SCHED_GET_INSN_SPEC_DS and TARGET_SCHED_GET_INSN_CHECKED_DS whose code is actually executed on this example.

Any recommendation on how to get load control speculation in prepass for any of the GCC 7.5 targets?

Best,

Benoît


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-02-09 10:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1094575209.18861549.1606140937179.JavaMail.zimbra@kalray.eu>
2021-02-08 12:25 ` IA64 control speculation of loads Benoît De Dinechin
2021-02-08 14:44   ` Alexander Monakov
2021-02-08 21:12     ` Benoît De Dinechin
2021-02-09 10:33       ` Benoît De Dinechin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).