option -mprfchw on 2 different Opteron cpus

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* option -mprfchw on 2 different Opteron cpus
@ 2016-05-01 20:25 NightStrike
  2016-05-02  9:55 ` Kumar, Venkataramanan
  0 siblings, 1 reply; 5+ messages in thread
From: NightStrike @ 2016-05-01 20:25 UTC (permalink / raw)
  To: gcc; +Cc: Jan Hubicka, Jakub Jelinek

Reposting from here:
https://gcc.gnu.org/ml/gcc-help/2016-05/msg00003.html

Not sure if this applies:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54210

If I compile on a k8 Opteron 248 with -march=native, I do not see
-mprfchw listed in the options in -fverbose-asm.  In the assembly, I
see this:

prefetcht0      (%rax)  # ivtmp.1160
prefetcht0      304(%rcx)       #
prefetcht0      (%rax)  # ivtmp.1160

If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying
to target the older system), I do see it listed in the options in
-fverbose-asm.  In the assembly, I see this:

prefetcht0      (%rax)  # ivtmp.1160
prefetcht0      304(%rcx)       #
prefetchw       (%rax)  # ivtmp.1160

(The third line is the only difference)

In both cases, I'm using gcc 4.9.3.  Which is correct for a k8 Opteron 248?

Also, FWIW:

1) The march=native version that uses prefetcht0 is very repeatably
faster by about 15% in the particular test case I'm looking at.

2) The compilers in both instances are not just the same version, they
are the same compiler binary installed on an NFS mount and shared to both
computers.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: option -mprfchw on 2 different Opteron cpus
  2016-05-01 20:25 option -mprfchw on 2 different Opteron cpus NightStrike
@ 2016-05-02  9:55 ` Kumar, Venkataramanan
  2016-05-02 17:01   ` NightStrike
  0 siblings, 1 reply; 5+ messages in thread
From: Kumar, Venkataramanan @ 2016-05-02  9:55 UTC (permalink / raw)
  To: NightStrike, Uros Bizjak (ubizjak@gmail.com), lopezibanez
  Cc: Jan Hubicka, Jakub Jelinek, gcc

Hi,

> -----Original Message-----
> From: gcc-owner@gcc.gnu.org [mailto:gcc-owner@gcc.gnu.org] On Behalf Of
> NightStrike
> Sent: Monday, May 2, 2016 1:55 AM
> To: gcc@gcc.gnu.org
> Cc: Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek <jakub@redhat.com>
> Subject: option -mprfchw on 2 different Opteron cpus	
> 
> Reposting from here:
> https://gcc.gnu.org/ml/gcc-help/2016-05/msg00003.html
> 
> Not sure if this applies:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54210
> 
> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw
> listed in the options in -fverbose-asm.  In the assembly, I see this:
> 
> prefetcht0      (%rax)  # ivtmp.1160
> prefetcht0      304(%rcx)       #
> prefetcht0      (%rax)  # ivtmp.1160

In AMD processors -mprfchw flag  is used to enable "3dnowprefetch" ISA support.

(Snip)
CPUID Fn8000_0001_ECX Feature Identifiers
Bit 8 
3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and
“PREFETCHW” in APM3
Ref: http://support.amd.com/TechDocs/25481.pdf
(Snip)

Can you please confirm what this CPUID flag returns on your k8 machine ?.
I believe this ISA is not available on k8 machine so when -march=native is added you don’t see  -mprfchw in verbose.

> 
> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to
> target the older system), I do see it listed in the options in -fverbose-asm.  In
> the assembly, I see this:

K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch). 
https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
So when you add -march=k8 you see -mprfchw  getting listed in verbose.

> 
> prefetcht0      (%rax)  # ivtmp.1160
> prefetcht0      304(%rcx)       #
> prefetchw       (%rax)  # ivtmp.1160
> 
> (The third line is the only difference)
> 

This is my guess without seeing the test case, when write  prefetching is requested "prefetchw" is generated. 
3dnow (TARGET_3DNOW) ISA has support for it. 

(Snip)
Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID
Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
Fn8000_0001_EDX[3DNow] = 1.
(Snip)
Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf

> In both cases, I'm using gcc 4.9.3.  Which is correct for a k8 Opteron 248?
> 
> Also, FWIW:
> 
> 1) The march=native version that uses prefetcht0 is very repeatably faster by
> about 15% in the particular test case I'm looking at.
> 
> 2) The compilers in both instances are not just the same version, they are the
> same compiler binary installed on an NFS mount and shared to both
> computers.

As per GCC4.9.3 source.

(Snip)
(define_expand "prefetch"
  [(prefetch (match_operand 0 "address_operand")
             (match_operand:SI 1 "const_int_operand")
             (match_operand:SI 2 "const_int_operand"))]
  "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
{
  bool write = INTVAL (operands[1]) != 0;
  int locality = INTVAL (operands[2]);

  gcc_assert (IN_RANGE (locality, 0, 3));

  /* Use 3dNOW prefetch in case we are asking for write prefetch not
     supported by SSE counterpart or the SSE prefetch is not available
     (K6 machines).  Otherwise use SSE prefetch as it allows specifying
     of locality.  */
  if (TARGET_PREFETCHWT1 && write && locality <= 2)
    operands[2] = const2_rtx;
  else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
    operands[2] = GEN_INT (3);
  else
    operands[1] = const0_rtx;
})
(Snip)

Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false.
   else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
TARGET_PRFCHW is off on native. 

So there are two issues here. 

(1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine.
(2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 .

Regards,
Venkat.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: option -mprfchw on 2 different Opteron cpus
  2016-05-02  9:55 ` Kumar, Venkataramanan
@ 2016-05-02 17:01   ` NightStrike
  2016-05-03  4:40     ` Kumar, Venkataramanan
  0 siblings, 1 reply; 5+ messages in thread
From: NightStrike @ 2016-05-02 17:01 UTC (permalink / raw)
  To: Kumar, Venkataramanan
  Cc: Uros Bizjak (ubizjak@gmail.com),
	lopezibanez, Jan Hubicka, Jakub Jelinek, gcc

On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
>> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw
>> listed in the options in -fverbose-asm.  In the assembly, I see this:
>>
>> prefetcht0      (%rax)  # ivtmp.1160
>> prefetcht0      304(%rcx)       #
>> prefetcht0      (%rax)  # ivtmp.1160
>
> In AMD processors -mprfchw flag  is used to enable "3dnowprefetch" ISA support.
>
> (Snip)
> CPUID Fn8000_0001_ECX Feature Identifiers
> Bit 8
> 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and
> “PREFETCHW” in APM3
> Ref: http://support.amd.com/TechDocs/25481.pdf
> (Snip)
>
> Can you please confirm what this CPUID flag returns on your k8 machine ?.
> I believe this ISA is not available on k8 machine so when -march=native is added you don’t see  -mprfchw in verbose.

Looks like zero?  This was generated with the cpuid program from
http://www.etallen.com/cpuid.html

CPU 0:
   0x00000000 0x00: eax=0x00000001 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
   0x00000001 0x00: eax=0x00000f58 ebx=0x00000800 ecx=0x00000000 edx=0x078bfbff
   0x80000000 0x00: eax=0x80000018 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
   0x80000001 0x00: eax=0x00000f58 ebx=0x00000405 ecx=0x00000000 edx=0xe1d3fbff
   0x80000002 0x00: eax=0x20444d41 ebx=0x6574704f ecx=0x286e6f72 edx=0x20296d74
   0x80000003 0x00: eax=0x636f7250 ebx=0x6f737365 ecx=0x34322072 edx=0x00000038
   0x80000004 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000005 0x00: eax=0xff08ff08 ebx=0xff20ff20 ecx=0x40020140 edx=0x40020140
   0x80000006 0x00: eax=0x00000000 ebx=0x42004200 ecx=0x04008140 edx=0x00000000
   0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000009
   0x80000008 0x00: eax=0x00003028 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000b 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000c 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000d 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000f 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000010 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000011 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000012 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000013 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000014 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000015 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000016 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000017 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000018 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80860000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0xc0000000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000

CPU:
   vendor_id = "AuthenticAMD"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium 4/Pentium D/Pentium Extreme
Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon
XP-M/Opteron/Sempron/Turion (15)
      model           = 0x5 (5)
      stepping id     = 0x8 (8)
      extended family = 0x0 (0)
      extended model  = 0x0 (0)
      (simple synth)  = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x0 (0)
      cpu count                      = 0x0 (0)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0x0 (0)
   brand id = 0x00 (0): unknown
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = false
      thermal monitor and clock ctrl         = false
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = false
      hyper-threading / multi-core supported = false
      therm. monitor                         = false
      IA64                                   = false
      pending break event                    = false
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = false
      PCLMULDQ instruction                    = false
      64-bit debug store                      = false
      MONITOR/MWAIT                           = false
      CPL-qualified debug store               = false
      VMX: virtual machine extensions         = false
      SMX: safer mode extensions              = false
      Enhanced Intel SpeedStep Technology     = false
      thermal monitor 2                       = false
      SSSE3 extensions                        = false
      context ID: adaptive or shared L1 data  = false
      FMA instruction                         = false
      CMPXCHG16B instruction                  = false
      xTPR disable                            = false
      perfmon and debug                       = false
      process context identifiers             = false
      direct cache access                     = false
      SSE4.1 extensions                       = false
      SSE4.2 extensions                       = false
      extended xAPIC support                  = false
      MOVBE instruction                       = false
      POPCNT instruction                      = false
      time stamp counter deadline             = false
      AES instruction                         = false
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   extended processor signature (0x80000001/eax):
      family/generation = AMD Athlon 64/Opteron/Sempron/Turion (15)
      model             = 0x5 (5)
      stepping id       = 0x8 (8)
      extended family   = 0x0 (0)
      extended model    = 0x0 (0)
      (simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
   extended feature flags (0x80000001/edx):
      x87 FPU on chip                       = true
      virtual-8086 mode enhancement         = true
      debugging extensions                  = true
      page size extensions                  = true
      time stamp counter                    = true
      RDMSR and WRMSR support               = true
      physical address extensions           = true
      machine check exception               = true
      CMPXCHG8B inst.                       = true
      APIC on chip                          = true
      SYSCALL and SYSRET instructions       = true
      memory type range registers           = true
      global paging extension               = true
      machine check architecture            = true
      conditional move/compare instruction  = true
      page attribute table                  = true
      page size extension                   = true
      multiprocessing capable               = false
      no-execute page protection            = true
      AMD multimedia instruction extensions = true
      MMX Technology                        = true
      FXSAVE/FXRSTOR                        = true
      SSE extensions                        = false
      1-GB large page support               = false
      RDTSCP                                = false
      long mode (AA-64)                     = true
      3DNow! instruction extensions         = true
      3DNow! instructions                   = true
   extended brand id (0x80000001/ebx):
      raw             = 0x405 (1029)
      BrandId         = 0x405 (1029)
      BrandTableIndex = 0x10 (16)
      NN              = 0x5 (5)
   AMD feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = false
      CMP Legacy                             = false
      SVM: secure virtual machine            = false
      extended APIC space                    = false
      AltMovCr8                              = false
      LZCNT advanced bit manipulation        = false
      SSE4A support                          = false
      misaligned SSE mode                    = false
      3DNow! PREFETCH/PREFETCHW instructions = false
      OS visible workaround                  = false
      instruction based sampling             = false
      XOP support                            = false
      SKINIT/STGI support                    = false
      watchdog timer support                 = false
      lightweight profiling support          = false
      4-operand FMA instruction              = false
      NodeId MSR C001100C                    = false
      TBM support                            = false
      topology extensions                    = false
   brand = "AMD Opteron(tm) Processor 248"
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries     = 0x8 (8)
      instruction associativity = 0xff (255)
      data # entries            = 0x8 (8)
      data associativity        = 0xff (255)
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
      instruction # entries     = 0x20 (32)
      instruction associativity = 0xff (255)
      data # entries            = 0x20 (32)
      data associativity        = 0xff (255)
   L1 data cache information (0x80000005/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 0x2 (2)
      size (Kb)         = 0x40 (64)
   L1 instruction cache information (0x80000005/edx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 0x2 (2)
      size (Kb)         = 0x40 (64)
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
      instruction # entries     = 0x200 (512)
      instruction associativity = 4-way (4)
      data # entries            = 0x200 (512)
      data associativity        = 4-way (4)
   L2 unified cache information (0x80000006/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 16-way (8)
      size (Kb)         = 0x400 (1024)
   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x0 (0)
      lines per tag         = 0x0 (0)
      associativity         = L2 off (0)
      size (in 512Kb units) = 0x0 (0)
   Advanced Power Management Features (0x80000007/edx):
      temperature sensing diode      = true
      frequency ID (FID) control     = false
      voltage ID (VID) control       = false
      thermal trip (TTP)             = true
      thermal monitor (TM)           = false
      software thermal control (STC) = false
      100 MHz multiplier control     = false
      hardware P-State control       = false
      TscInvariant                   = false
   Physical Address and Linear Address Size (0x80000008/eax):
      maximum physical address bits         = 0x28 (40)
      maximum linear (virtual) address bits = 0x30 (48)
      maximum guest physical address bits   = 0x0 (0)
   Logical CPU cores (0x80000008/ecx):
      number of CPU cores - 1 = 0x0 (0)
      ApicIdCoreIdSize        = 0x0 (0)
   SVM Secure Virtual Machine (0x8000000a/eax):
      SvmRev: SVM revision = 0x0 (0)
   SVM Secure Virtual Machine (0x8000000a/edx):
      nested paging                 = false
      LBR virtualization            = false
      SVM lock                      = false
      NRIP save                     = false
      MSR based TSC rate control    = false
      VMCB clean bits support       = false
      flush by ASID                 = false
      decode assists                = false
      SSSE3/SSE5 opcode set disable = false
      pause intercept filter        = false
      pause filter threshold        = false
   NASID: number of address space identifiers = 0x0 (0):
   (instruction supported synth):
      CMPXCHG8B                = true
      conditional move/compare = true
      PREFETCH/PREFETCHW       = true
   (multi-processing synth): none
   (multi-processing method): AMD
   (synth) = AMD Opteron (DP SledgeHammer SH7-C0), 940-pin, .13um Processor 248

>>
>> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to
>> target the older system), I do see it listed in the options in -fverbose-asm.  In
>> the assembly, I see this:
>
> K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch).
> https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
> So when you add -march=k8 you see -mprfchw  getting listed in verbose.
>
>>
>> prefetcht0      (%rax)  # ivtmp.1160
>> prefetcht0      304(%rcx)       #
>> prefetchw       (%rax)  # ivtmp.1160
>>
>> (The third line is the only difference)
>>
>
> This is my guess without seeing the test case, when write  prefetching is requested "prefetchw" is generated.
> 3dnow (TARGET_3DNOW) ISA has support for it.
>
> (Snip)
> Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID
> Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
> Fn8000_0001_EDX[3DNow] = 1.
> (Snip)
> Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
>
>> In both cases, I'm using gcc 4.9.3.  Which is correct for a k8 Opteron 248?
>>
>> Also, FWIW:
>>
>> 1) The march=native version that uses prefetcht0 is very repeatably faster by
>> about 15% in the particular test case I'm looking at.
>>
>> 2) The compilers in both instances are not just the same version, they are the
>> same compiler binary installed on an NFS mount and shared to both
>> computers.
>
> As per GCC4.9.3 source.
>
> (Snip)
> (define_expand "prefetch"
>   [(prefetch (match_operand 0 "address_operand")
>              (match_operand:SI 1 "const_int_operand")
>              (match_operand:SI 2 "const_int_operand"))]
>   "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
> {
>   bool write = INTVAL (operands[1]) != 0;
>   int locality = INTVAL (operands[2]);
>
>   gcc_assert (IN_RANGE (locality, 0, 3));
>
>   /* Use 3dNOW prefetch in case we are asking for write prefetch not
>      supported by SSE counterpart or the SSE prefetch is not available
>      (K6 machines).  Otherwise use SSE prefetch as it allows specifying
>      of locality.  */
>   if (TARGET_PREFETCHWT1 && write && locality <= 2)
>     operands[2] = const2_rtx;
>   else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
>     operands[2] = GEN_INT (3);
>   else
>     operands[1] = const0_rtx;
> })
> (Snip)
>
> Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false.
>    else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> TARGET_PRFCHW is off on native.
>
> So there are two issues here.
>
> (1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine.
> (2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 .
>
> Regards,
> Venkat.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: option -mprfchw on 2 different Opteron cpus
  2016-05-02 17:01   ` NightStrike
@ 2016-05-03  4:40     ` Kumar, Venkataramanan
  2016-08-16 16:43       ` NightStrike
  0 siblings, 1 reply; 5+ messages in thread
From: Kumar, Venkataramanan @ 2016-05-03  4:40 UTC (permalink / raw)
  To: NightStrike
  Cc: Uros Bizjak (ubizjak@gmail.com),
	lopezibanez, Jan Hubicka, Jakub Jelinek, gcc

Hi 

> -----Original Message-----
> From: NightStrike [mailto:nightstrike@gmail.com]
> Sent: Monday, May 2, 2016 10:31 PM
> To: Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com>
> Cc: Uros Bizjak (ubizjak@gmail.com) <ubizjak@gmail.com>;
> lopezibanez@gmail.com; Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek
> <jakub@redhat.com>; gcc@gcc.gnu.org
> Subject: Re: option -mprfchw on 2 different Opteron cpus
> 
> On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
> <Venkataramanan.Kumar@amd.com> wrote:
> >> If I compile on a k8 Opteron 248 with -march=native, I do not see
> >> -mprfchw listed in the options in -fverbose-asm.  In the assembly, I see
> this:
> >>
> >> prefetcht0      (%rax)  # ivtmp.1160
> >> prefetcht0      304(%rcx)       #
> >> prefetcht0      (%rax)  # ivtmp.1160
> >
> > In AMD processors -mprfchw flag  is used to enable "3dnowprefetch" ISA
> support.
> >
> > (Snip)
> > CPUID Fn8000_0001_ECX Feature Identifiers Bit 8
> > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See
> > “PREFETCH” and “PREFETCHW” in APM3
> > Ref: http://support.amd.com/TechDocs/25481.pdf
> > (Snip)
> >
> > Can you please confirm what this CPUID flag returns on your k8 machine ?.
> > I believe this ISA is not available on k8 machine so when -march=native is
> added you don’t see  -mprfchw in verbose.
> 
> Looks like zero?  This was generated with the cpuid program from
> http://www.etallen.com/cpuid.html
> 
>       3DNow! instruction extensions         = true
>       3DNow! instructions                   = true

It has 3Dnow support.  "prefetchw" is available with 3dnow.
 
>       misaligned SSE mode                    = false
>       3DNow! PREFETCH/PREFETCHW instructions = false

It does not have 3DNowprefetch enabling ISA flag -mprftchw is not correct for -march=k8.  

>       OS visible workaround                  = false
>       instruction based sampling             = false
> >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying
> >> to target the older system), I do see it listed in the options in
> >> -fverbose-asm.  In the assembly, I see this:
> >
> > K8 has 3dnow support and there is a patch that replaced 3dnow with
> prefetchw (3DNowPrefetch).
> > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
> > So when you add -march=k8 you see -mprfchw  getting listed in verbose.
> >
> >>
> >> prefetcht0      (%rax)  # ivtmp.1160
> >> prefetcht0      304(%rcx)       #
> >> prefetchw       (%rax)  # ivtmp.1160
> >>
> >> (The third line is the only difference)
> >>
> >
> > This is my guess without seeing the test case, when write  prefetching is
> requested "prefetchw" is generated.
> > 3dnow (TARGET_3DNOW) ISA has support for it.
> >
> > (Snip)
> > Support for the PREFETCH and PREFETCHW instructions is indicated by
> > CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
> > Fn8000_0001_EDX[3DNow] = 1.
> > (Snip)
> > Ref:
> http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
> >
> >> In both cases, I'm using gcc 4.9.3.  Which is correct for a k8 Opteron 248?
> >>
> >> Also, FWIW:
> >>
> >> 1) The march=native version that uses prefetcht0 is very repeatably
> >> faster by about 15% in the particular test case I'm looking at.
> >>
> >> 2) The compilers in both instances are not just the same version,
> >> they are the same compiler binary installed on an NFS mount and
> >> shared to both computers.
> >
> > As per GCC4.9.3 source.
> >
> > (Snip)
> > (define_expand "prefetch"
> >   [(prefetch (match_operand 0 "address_operand")
> >              (match_operand:SI 1 "const_int_operand")
> >              (match_operand:SI 2 "const_int_operand"))]
> >   "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
> > {
> >   bool write = INTVAL (operands[1]) != 0;
> >   int locality = INTVAL (operands[2]);
> >
> >   gcc_assert (IN_RANGE (locality, 0, 3));
> >
> >   /* Use 3dNOW prefetch in case we are asking for write prefetch not
> >      supported by SSE counterpart or the SSE prefetch is not available
> >      (K6 machines).  Otherwise use SSE prefetch as it allows specifying
> >      of locality.  */
> >   if (TARGET_PREFETCHWT1 && write && locality <= 2)
> >     operands[2] = const2_rtx;
> >   else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> >     operands[2] = GEN_INT (3);
> >   else
> >     operands[1] = const0_rtx;
> > })
> > (Snip)
> >
> > Write prefetch may be requested (either by auto prefetcher or builtins) but
> on -march=native, the below check could have become false.
> >    else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> > TARGET_PRFCHW is off on native.
> >
> > So there are two issues here.
> >
> > (1) ISA flags enabled with -march=k8 is different from -march=native on k8
> machine.

I think  we need to file bug for this.  Need to check with Uros why the flag -mprfchw is shared with 3dnow.
To work around this issue you can use -mno-prfchw when building with -march=k8.

> > (2) Need to check why GCC middle end requested write prefetch for the
> test case with -march=k8 .
On "prefetchw" generation it may be the case that GCC auto prefetcher requests write prefetches.
AFAIK generating write prefetches brings data from memory and marks the catch line modified and expects a write to happen next.
If read happens to that cache line instead  then data will be written back to memory before read which will be unnecessary. 
Hard to answer without test case and I don’t have a ready k8 machine with me.

> >
> > Regards,
> > Venkat.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: option -mprfchw on 2 different Opteron cpus
  2016-05-03  4:40     ` Kumar, Venkataramanan
@ 2016-08-16 16:43       ` NightStrike
  0 siblings, 0 replies; 5+ messages in thread
From: NightStrike @ 2016-08-16 16:43 UTC (permalink / raw)
  To: Kumar, Venkataramanan
  Cc: Uros Bizjak (ubizjak@gmail.com),
	lopezibanez, Jan Hubicka, Jakub Jelinek, gcc

On Tue, May 3, 2016 at 12:40 AM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
> Hi
>
>> -----Original Message-----
>> From: NightStrike [mailto:nightstrike@gmail.com]
>> Sent: Monday, May 2, 2016 10:31 PM
>> To: Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com>
>> Cc: Uros Bizjak (ubizjak@gmail.com) <ubizjak@gmail.com>;
>> lopezibanez@gmail.com; Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek
>> <jakub@redhat.com>; gcc@gcc.gnu.org
>> Subject: Re: option -mprfchw on 2 different Opteron cpus
>>
>> On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
>> <Venkataramanan.Kumar@amd.com> wrote:
>> >> If I compile on a k8 Opteron 248 with -march=native, I do not see
>> >> -mprfchw listed in the options in -fverbose-asm.  In the assembly, I see
>> this:
>> >>
>> >> prefetcht0      (%rax)  # ivtmp.1160
>> >> prefetcht0      304(%rcx)       #
>> >> prefetcht0      (%rax)  # ivtmp.1160
>> >
>> > In AMD processors -mprfchw flag  is used to enable "3dnowprefetch" ISA
>> support.
>> >
>> > (Snip)
>> > CPUID Fn8000_0001_ECX Feature Identifiers Bit 8
>> > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See
>> > “PREFETCH” and “PREFETCHW” in APM3
>> > Ref: http://support.amd.com/TechDocs/25481.pdf
>> > (Snip)
>> >
>> > Can you please confirm what this CPUID flag returns on your k8 machine ?.
>> > I believe this ISA is not available on k8 machine so when -march=native is
>> added you don’t see  -mprfchw in verbose.
>>
>> Looks like zero?  This was generated with the cpuid program from
>> http://www.etallen.com/cpuid.html
>>
>>       3DNow! instruction extensions         = true
>>       3DNow! instructions                   = true
>
> It has 3Dnow support.  "prefetchw" is available with 3dnow.
>
>>       misaligned SSE mode                    = false
>>       3DNow! PREFETCH/PREFETCHW instructions = false
>
> It does not have 3DNowprefetch enabling ISA flag -mprftchw is not correct for -march=k8.
>
>>       OS visible workaround                  = false
>>       instruction based sampling             = false
>> >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying
>> >> to target the older system), I do see it listed in the options in
>> >> -fverbose-asm.  In the assembly, I see this:
>> >
>> > K8 has 3dnow support and there is a patch that replaced 3dnow with
>> prefetchw (3DNowPrefetch).
>> > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
>> > So when you add -march=k8 you see -mprfchw  getting listed in verbose.
>> >
>> >>
>> >> prefetcht0      (%rax)  # ivtmp.1160
>> >> prefetcht0      304(%rcx)       #
>> >> prefetchw       (%rax)  # ivtmp.1160
>> >>
>> >> (The third line is the only difference)
>> >>
>> >
>> > This is my guess without seeing the test case, when write  prefetching is
>> requested "prefetchw" is generated.
>> > 3dnow (TARGET_3DNOW) ISA has support for it.
>> >
>> > (Snip)
>> > Support for the PREFETCH and PREFETCHW instructions is indicated by
>> > CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
>> > Fn8000_0001_EDX[3DNow] = 1.
>> > (Snip)
>> > Ref:
>> http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
>> >
>> >> In both cases, I'm using gcc 4.9.3.  Which is correct for a k8 Opteron 248?
>> >>
>> >> Also, FWIW:
>> >>
>> >> 1) The march=native version that uses prefetcht0 is very repeatably
>> >> faster by about 15% in the particular test case I'm looking at.
>> >>
>> >> 2) The compilers in both instances are not just the same version,
>> >> they are the same compiler binary installed on an NFS mount and
>> >> shared to both computers.
>> >
>> > As per GCC4.9.3 source.
>> >
>> > (Snip)
>> > (define_expand "prefetch"
>> >   [(prefetch (match_operand 0 "address_operand")
>> >              (match_operand:SI 1 "const_int_operand")
>> >              (match_operand:SI 2 "const_int_operand"))]
>> >   "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
>> > {
>> >   bool write = INTVAL (operands[1]) != 0;
>> >   int locality = INTVAL (operands[2]);
>> >
>> >   gcc_assert (IN_RANGE (locality, 0, 3));
>> >
>> >   /* Use 3dNOW prefetch in case we are asking for write prefetch not
>> >      supported by SSE counterpart or the SSE prefetch is not available
>> >      (K6 machines).  Otherwise use SSE prefetch as it allows specifying
>> >      of locality.  */
>> >   if (TARGET_PREFETCHWT1 && write && locality <= 2)
>> >     operands[2] = const2_rtx;
>> >   else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
>> >     operands[2] = GEN_INT (3);
>> >   else
>> >     operands[1] = const0_rtx;
>> > })
>> > (Snip)
>> >
>> > Write prefetch may be requested (either by auto prefetcher or builtins) but
>> on -march=native, the below check could have become false.
>> >    else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
>> > TARGET_PRFCHW is off on native.
>> >
>> > So there are two issues here.
>> >
>> > (1) ISA flags enabled with -march=k8 is different from -march=native on k8
>> machine.
>
> I think  we need to file bug for this.  Need to check with Uros why the flag -mprfchw is shared with 3dnow.
> To work around this issue you can use -mno-prfchw when building with -march=k8.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77270

>> > (2) Need to check why GCC middle end requested write prefetch for the
>> test case with -march=k8 .
> On "prefetchw" generation it may be the case that GCC auto prefetcher requests write prefetches.
> AFAIK generating write prefetches brings data from memory and marks the catch line modified and expects a write to happen next.
> If read happens to that cache line instead  then data will be written back to memory before read which will be unnecessary.
> Hard to answer without test case and I don’t have a ready k8 machine with me.

Should this be another bug filed if I can get a reduced test case, or
is PR77270 enough, or is this not a bug?

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-08-16 16:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-01 20:25 option -mprfchw on 2 different Opteron cpus NightStrike
2016-05-02  9:55 ` Kumar, Venkataramanan
2016-05-02 17:01   ` NightStrike
2016-05-03  4:40     ` Kumar, Venkataramanan
2016-08-16 16:43       ` NightStrike

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).