* option -mprfchw on 2 different Opteron cpus @ 2016-05-01 20:25 NightStrike 2016-05-02 9:55 ` Kumar, Venkataramanan 0 siblings, 1 reply; 5+ messages in thread From: NightStrike @ 2016-05-01 20:25 UTC (permalink / raw) To: gcc; +Cc: Jan Hubicka, Jakub Jelinek Reposting from here: https://gcc.gnu.org/ml/gcc-help/2016-05/msg00003.html Not sure if this applies: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54210 If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw listed in the options in -fverbose-asm. In the assembly, I see this: prefetcht0 (%rax) # ivtmp.1160 prefetcht0 304(%rcx) # prefetcht0 (%rax) # ivtmp.1160 If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to target the older system), I do see it listed in the options in -fverbose-asm. In the assembly, I see this: prefetcht0 (%rax) # ivtmp.1160 prefetcht0 304(%rcx) # prefetchw (%rax) # ivtmp.1160 (The third line is the only difference) In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? Also, FWIW: 1) The march=native version that uses prefetcht0 is very repeatably faster by about 15% in the particular test case I'm looking at. 2) The compilers in both instances are not just the same version, they are the same compiler binary installed on an NFS mount and shared to both computers. ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: option -mprfchw on 2 different Opteron cpus 2016-05-01 20:25 option -mprfchw on 2 different Opteron cpus NightStrike @ 2016-05-02 9:55 ` Kumar, Venkataramanan 2016-05-02 17:01 ` NightStrike 0 siblings, 1 reply; 5+ messages in thread From: Kumar, Venkataramanan @ 2016-05-02 9:55 UTC (permalink / raw) To: NightStrike, Uros Bizjak (ubizjak@gmail.com), lopezibanez Cc: Jan Hubicka, Jakub Jelinek, gcc Hi, > -----Original Message----- > From: gcc-owner@gcc.gnu.org [mailto:gcc-owner@gcc.gnu.org] On Behalf Of > NightStrike > Sent: Monday, May 2, 2016 1:55 AM > To: gcc@gcc.gnu.org > Cc: Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek <jakub@redhat.com> > Subject: option -mprfchw on 2 different Opteron cpus > > Reposting from here: > https://gcc.gnu.org/ml/gcc-help/2016-05/msg00003.html > > Not sure if this applies: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54210 > > If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw > listed in the options in -fverbose-asm. In the assembly, I see this: > > prefetcht0 (%rax) # ivtmp.1160 > prefetcht0 304(%rcx) # > prefetcht0 (%rax) # ivtmp.1160 In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA support. (Snip) CPUID Fn8000_0001_ECX Feature Identifiers Bit 8 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and “PREFETCHW” in APM3 Ref: http://support.amd.com/TechDocs/25481.pdf (Snip) Can you please confirm what this CPUID flag returns on your k8 machine ?. I believe this ISA is not available on k8 machine so when -march=native is added you don’t see -mprfchw in verbose. > > If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to > target the older system), I do see it listed in the options in -fverbose-asm. In > the assembly, I see this: K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch). https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html So when you add -march=k8 you see -mprfchw getting listed in verbose. > > prefetcht0 (%rax) # ivtmp.1160 > prefetcht0 304(%rcx) # > prefetchw (%rax) # ivtmp.1160 > > (The third line is the only difference) > This is my guess without seeing the test case, when write prefetching is requested "prefetchw" is generated. 3dnow (TARGET_3DNOW) ISA has support for it. (Snip) Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR Fn8000_0001_EDX[3DNow] = 1. (Snip) Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf > In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? > > Also, FWIW: > > 1) The march=native version that uses prefetcht0 is very repeatably faster by > about 15% in the particular test case I'm looking at. > > 2) The compilers in both instances are not just the same version, they are the > same compiler binary installed on an NFS mount and shared to both > computers. As per GCC4.9.3 source. (Snip) (define_expand "prefetch" [(prefetch (match_operand 0 "address_operand") (match_operand:SI 1 "const_int_operand") (match_operand:SI 2 "const_int_operand"))] "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" { bool write = INTVAL (operands[1]) != 0; int locality = INTVAL (operands[2]); gcc_assert (IN_RANGE (locality, 0, 3)); /* Use 3dNOW prefetch in case we are asking for write prefetch not supported by SSE counterpart or the SSE prefetch is not available (K6 machines). Otherwise use SSE prefetch as it allows specifying of locality. */ if (TARGET_PREFETCHWT1 && write && locality <= 2) operands[2] = const2_rtx; else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) operands[2] = GEN_INT (3); else operands[1] = const0_rtx; }) (Snip) Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false. else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) TARGET_PRFCHW is off on native. So there are two issues here. (1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine. (2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 . Regards, Venkat. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: option -mprfchw on 2 different Opteron cpus 2016-05-02 9:55 ` Kumar, Venkataramanan @ 2016-05-02 17:01 ` NightStrike 2016-05-03 4:40 ` Kumar, Venkataramanan 0 siblings, 1 reply; 5+ messages in thread From: NightStrike @ 2016-05-02 17:01 UTC (permalink / raw) To: Kumar, Venkataramanan Cc: Uros Bizjak (ubizjak@gmail.com), lopezibanez, Jan Hubicka, Jakub Jelinek, gcc On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com> wrote: >> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw >> listed in the options in -fverbose-asm. In the assembly, I see this: >> >> prefetcht0 (%rax) # ivtmp.1160 >> prefetcht0 304(%rcx) # >> prefetcht0 (%rax) # ivtmp.1160 > > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA support. > > (Snip) > CPUID Fn8000_0001_ECX Feature Identifiers > Bit 8 > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and > “PREFETCHW” in APM3 > Ref: http://support.amd.com/TechDocs/25481.pdf > (Snip) > > Can you please confirm what this CPUID flag returns on your k8 machine ?. > I believe this ISA is not available on k8 machine so when -march=native is added you don’t see -mprfchw in verbose. Looks like zero? This was generated with the cpuid program from http://www.etallen.com/cpuid.html CPU 0: 0x00000000 0x00: eax=0x00000001 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65 0x00000001 0x00: eax=0x00000f58 ebx=0x00000800 ecx=0x00000000 edx=0x078bfbff 0x80000000 0x00: eax=0x80000018 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65 0x80000001 0x00: eax=0x00000f58 ebx=0x00000405 ecx=0x00000000 edx=0xe1d3fbff 0x80000002 0x00: eax=0x20444d41 ebx=0x6574704f ecx=0x286e6f72 edx=0x20296d74 0x80000003 0x00: eax=0x636f7250 ebx=0x6f737365 ecx=0x34322072 edx=0x00000038 0x80000004 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000005 0x00: eax=0xff08ff08 ebx=0xff20ff20 ecx=0x40020140 edx=0x40020140 0x80000006 0x00: eax=0x00000000 ebx=0x42004200 ecx=0x04008140 edx=0x00000000 0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000009 0x80000008 0x00: eax=0x00003028 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000b 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000c 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000d 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000f 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000010 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000011 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000012 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000013 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000014 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000015 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000016 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000017 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000018 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80860000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0xc0000000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 CPU: vendor_id = "AuthenticAMD" version information (1/eax): processor type = primary processor (0) family = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15) model = 0x5 (5) stepping id = 0x8 (8) extended family = 0x0 (0) extended model = 0x0 (0) (simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon 64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um miscellaneous (1/ebx): process local APIC physical ID = 0x0 (0) cpu count = 0x0 (0) CLFLUSH line size = 0x8 (8) brand index = 0x0 (0) brand id = 0x00 (0): unknown feature information (1/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true debugging extensions = true page size extensions = true time stamp counter = true RDMSR and WRMSR support = true physical address extensions = true machine check exception = true CMPXCHG8B inst. = true APIC on chip = true SYSENTER and SYSEXIT = true memory type range registers = true PTE global bit = true machine check architecture = true conditional move/compare instruction = true page attribute table = true page size extension = true processor serial number = false CLFLUSH instruction = true debug store = false thermal monitor and clock ctrl = false MMX Technology = true FXSAVE/FXRSTOR = true SSE extensions = true SSE2 extensions = true self snoop = false hyper-threading / multi-core supported = false therm. monitor = false IA64 = false pending break event = false feature information (1/ecx): PNI/SSE3: Prescott New Instructions = false PCLMULDQ instruction = false 64-bit debug store = false MONITOR/MWAIT = false CPL-qualified debug store = false VMX: virtual machine extensions = false SMX: safer mode extensions = false Enhanced Intel SpeedStep Technology = false thermal monitor 2 = false SSSE3 extensions = false context ID: adaptive or shared L1 data = false FMA instruction = false CMPXCHG16B instruction = false xTPR disable = false perfmon and debug = false process context identifiers = false direct cache access = false SSE4.1 extensions = false SSE4.2 extensions = false extended xAPIC support = false MOVBE instruction = false POPCNT instruction = false time stamp counter deadline = false AES instruction = false XSAVE/XSTOR states = false OS-enabled XSAVE/XSTOR = false AVX: advanced vector extensions = false F16C half-precision convert instruction = false RDRAND instruction = false hypervisor guest status = false extended processor signature (0x80000001/eax): family/generation = AMD Athlon 64/Opteron/Sempron/Turion (15) model = 0x5 (5) stepping id = 0x8 (8) extended family = 0x0 (0) extended model = 0x0 (0) (simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon 64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um extended feature flags (0x80000001/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true debugging extensions = true page size extensions = true time stamp counter = true RDMSR and WRMSR support = true physical address extensions = true machine check exception = true CMPXCHG8B inst. = true APIC on chip = true SYSCALL and SYSRET instructions = true memory type range registers = true global paging extension = true machine check architecture = true conditional move/compare instruction = true page attribute table = true page size extension = true multiprocessing capable = false no-execute page protection = true AMD multimedia instruction extensions = true MMX Technology = true FXSAVE/FXRSTOR = true SSE extensions = false 1-GB large page support = false RDTSCP = false long mode (AA-64) = true 3DNow! instruction extensions = true 3DNow! instructions = true extended brand id (0x80000001/ebx): raw = 0x405 (1029) BrandId = 0x405 (1029) BrandTableIndex = 0x10 (16) NN = 0x5 (5) AMD feature flags (0x80000001/ecx): LAHF/SAHF supported in 64-bit mode = false CMP Legacy = false SVM: secure virtual machine = false extended APIC space = false AltMovCr8 = false LZCNT advanced bit manipulation = false SSE4A support = false misaligned SSE mode = false 3DNow! PREFETCH/PREFETCHW instructions = false OS visible workaround = false instruction based sampling = false XOP support = false SKINIT/STGI support = false watchdog timer support = false lightweight profiling support = false 4-operand FMA instruction = false NodeId MSR C001100C = false TBM support = false topology extensions = false brand = "AMD Opteron(tm) Processor 248" L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax): instruction # entries = 0x8 (8) instruction associativity = 0xff (255) data # entries = 0x8 (8) data associativity = 0xff (255) L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx): instruction # entries = 0x20 (32) instruction associativity = 0xff (255) data # entries = 0x20 (32) data associativity = 0xff (255) L1 data cache information (0x80000005/ecx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x2 (2) size (Kb) = 0x40 (64) L1 instruction cache information (0x80000005/edx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x2 (2) size (Kb) = 0x40 (64) L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x0 (0) data associativity = L2 off (0) L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx): instruction # entries = 0x200 (512) instruction associativity = 4-way (4) data # entries = 0x200 (512) data associativity = 4-way (4) L2 unified cache information (0x80000006/ecx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 16-way (8) size (Kb) = 0x400 (1024) L3 cache information (0x80000006/edx): line size (bytes) = 0x0 (0) lines per tag = 0x0 (0) associativity = L2 off (0) size (in 512Kb units) = 0x0 (0) Advanced Power Management Features (0x80000007/edx): temperature sensing diode = true frequency ID (FID) control = false voltage ID (VID) control = false thermal trip (TTP) = true thermal monitor (TM) = false software thermal control (STC) = false 100 MHz multiplier control = false hardware P-State control = false TscInvariant = false Physical Address and Linear Address Size (0x80000008/eax): maximum physical address bits = 0x28 (40) maximum linear (virtual) address bits = 0x30 (48) maximum guest physical address bits = 0x0 (0) Logical CPU cores (0x80000008/ecx): number of CPU cores - 1 = 0x0 (0) ApicIdCoreIdSize = 0x0 (0) SVM Secure Virtual Machine (0x8000000a/eax): SvmRev: SVM revision = 0x0 (0) SVM Secure Virtual Machine (0x8000000a/edx): nested paging = false LBR virtualization = false SVM lock = false NRIP save = false MSR based TSC rate control = false VMCB clean bits support = false flush by ASID = false decode assists = false SSSE3/SSE5 opcode set disable = false pause intercept filter = false pause filter threshold = false NASID: number of address space identifiers = 0x0 (0): (instruction supported synth): CMPXCHG8B = true conditional move/compare = true PREFETCH/PREFETCHW = true (multi-processing synth): none (multi-processing method): AMD (synth) = AMD Opteron (DP SledgeHammer SH7-C0), 940-pin, .13um Processor 248 >> >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to >> target the older system), I do see it listed in the options in -fverbose-asm. In >> the assembly, I see this: > > K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch). > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html > So when you add -march=k8 you see -mprfchw getting listed in verbose. > >> >> prefetcht0 (%rax) # ivtmp.1160 >> prefetcht0 304(%rcx) # >> prefetchw (%rax) # ivtmp.1160 >> >> (The third line is the only difference) >> > > This is my guess without seeing the test case, when write prefetching is requested "prefetchw" is generated. > 3dnow (TARGET_3DNOW) ISA has support for it. > > (Snip) > Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID > Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR > Fn8000_0001_EDX[3DNow] = 1. > (Snip) > Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf > >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? >> >> Also, FWIW: >> >> 1) The march=native version that uses prefetcht0 is very repeatably faster by >> about 15% in the particular test case I'm looking at. >> >> 2) The compilers in both instances are not just the same version, they are the >> same compiler binary installed on an NFS mount and shared to both >> computers. > > As per GCC4.9.3 source. > > (Snip) > (define_expand "prefetch" > [(prefetch (match_operand 0 "address_operand") > (match_operand:SI 1 "const_int_operand") > (match_operand:SI 2 "const_int_operand"))] > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" > { > bool write = INTVAL (operands[1]) != 0; > int locality = INTVAL (operands[2]); > > gcc_assert (IN_RANGE (locality, 0, 3)); > > /* Use 3dNOW prefetch in case we are asking for write prefetch not > supported by SSE counterpart or the SSE prefetch is not available > (K6 machines). Otherwise use SSE prefetch as it allows specifying > of locality. */ > if (TARGET_PREFETCHWT1 && write && locality <= 2) > operands[2] = const2_rtx; > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > operands[2] = GEN_INT (3); > else > operands[1] = const0_rtx; > }) > (Snip) > > Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false. > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > TARGET_PRFCHW is off on native. > > So there are two issues here. > > (1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine. > (2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 . > > Regards, > Venkat. ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: option -mprfchw on 2 different Opteron cpus 2016-05-02 17:01 ` NightStrike @ 2016-05-03 4:40 ` Kumar, Venkataramanan 2016-08-16 16:43 ` NightStrike 0 siblings, 1 reply; 5+ messages in thread From: Kumar, Venkataramanan @ 2016-05-03 4:40 UTC (permalink / raw) To: NightStrike Cc: Uros Bizjak (ubizjak@gmail.com), lopezibanez, Jan Hubicka, Jakub Jelinek, gcc Hi > -----Original Message----- > From: NightStrike [mailto:nightstrike@gmail.com] > Sent: Monday, May 2, 2016 10:31 PM > To: Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com> > Cc: Uros Bizjak (ubizjak@gmail.com) <ubizjak@gmail.com>; > lopezibanez@gmail.com; Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek > <jakub@redhat.com>; gcc@gcc.gnu.org > Subject: Re: option -mprfchw on 2 different Opteron cpus > > On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan > <Venkataramanan.Kumar@amd.com> wrote: > >> If I compile on a k8 Opteron 248 with -march=native, I do not see > >> -mprfchw listed in the options in -fverbose-asm. In the assembly, I see > this: > >> > >> prefetcht0 (%rax) # ivtmp.1160 > >> prefetcht0 304(%rcx) # > >> prefetcht0 (%rax) # ivtmp.1160 > > > > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA > support. > > > > (Snip) > > CPUID Fn8000_0001_ECX Feature Identifiers Bit 8 > > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See > > “PREFETCH” and “PREFETCHW” in APM3 > > Ref: http://support.amd.com/TechDocs/25481.pdf > > (Snip) > > > > Can you please confirm what this CPUID flag returns on your k8 machine ?. > > I believe this ISA is not available on k8 machine so when -march=native is > added you don’t see -mprfchw in verbose. > > Looks like zero? This was generated with the cpuid program from > http://www.etallen.com/cpuid.html > > 3DNow! instruction extensions = true > 3DNow! instructions = true It has 3Dnow support. "prefetchw" is available with 3dnow. > misaligned SSE mode = false > 3DNow! PREFETCH/PREFETCHW instructions = false It does not have 3DNowprefetch enabling ISA flag -mprftchw is not correct for -march=k8. > OS visible workaround = false > instruction based sampling = false > >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying > >> to target the older system), I do see it listed in the options in > >> -fverbose-asm. In the assembly, I see this: > > > > K8 has 3dnow support and there is a patch that replaced 3dnow with > prefetchw (3DNowPrefetch). > > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html > > So when you add -march=k8 you see -mprfchw getting listed in verbose. > > > >> > >> prefetcht0 (%rax) # ivtmp.1160 > >> prefetcht0 304(%rcx) # > >> prefetchw (%rax) # ivtmp.1160 > >> > >> (The third line is the only difference) > >> > > > > This is my guess without seeing the test case, when write prefetching is > requested "prefetchw" is generated. > > 3dnow (TARGET_3DNOW) ISA has support for it. > > > > (Snip) > > Support for the PREFETCH and PREFETCHW instructions is indicated by > > CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR > > Fn8000_0001_EDX[3DNow] = 1. > > (Snip) > > Ref: > http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf > > > >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? > >> > >> Also, FWIW: > >> > >> 1) The march=native version that uses prefetcht0 is very repeatably > >> faster by about 15% in the particular test case I'm looking at. > >> > >> 2) The compilers in both instances are not just the same version, > >> they are the same compiler binary installed on an NFS mount and > >> shared to both computers. > > > > As per GCC4.9.3 source. > > > > (Snip) > > (define_expand "prefetch" > > [(prefetch (match_operand 0 "address_operand") > > (match_operand:SI 1 "const_int_operand") > > (match_operand:SI 2 "const_int_operand"))] > > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" > > { > > bool write = INTVAL (operands[1]) != 0; > > int locality = INTVAL (operands[2]); > > > > gcc_assert (IN_RANGE (locality, 0, 3)); > > > > /* Use 3dNOW prefetch in case we are asking for write prefetch not > > supported by SSE counterpart or the SSE prefetch is not available > > (K6 machines). Otherwise use SSE prefetch as it allows specifying > > of locality. */ > > if (TARGET_PREFETCHWT1 && write && locality <= 2) > > operands[2] = const2_rtx; > > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > > operands[2] = GEN_INT (3); > > else > > operands[1] = const0_rtx; > > }) > > (Snip) > > > > Write prefetch may be requested (either by auto prefetcher or builtins) but > on -march=native, the below check could have become false. > > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > > TARGET_PRFCHW is off on native. > > > > So there are two issues here. > > > > (1) ISA flags enabled with -march=k8 is different from -march=native on k8 > machine. I think we need to file bug for this. Need to check with Uros why the flag -mprfchw is shared with 3dnow. To work around this issue you can use -mno-prfchw when building with -march=k8. > > (2) Need to check why GCC middle end requested write prefetch for the > test case with -march=k8 . On "prefetchw" generation it may be the case that GCC auto prefetcher requests write prefetches. AFAIK generating write prefetches brings data from memory and marks the catch line modified and expects a write to happen next. If read happens to that cache line instead then data will be written back to memory before read which will be unnecessary. Hard to answer without test case and I don’t have a ready k8 machine with me. > > > > Regards, > > Venkat. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: option -mprfchw on 2 different Opteron cpus 2016-05-03 4:40 ` Kumar, Venkataramanan @ 2016-08-16 16:43 ` NightStrike 0 siblings, 0 replies; 5+ messages in thread From: NightStrike @ 2016-08-16 16:43 UTC (permalink / raw) To: Kumar, Venkataramanan Cc: Uros Bizjak (ubizjak@gmail.com), lopezibanez, Jan Hubicka, Jakub Jelinek, gcc On Tue, May 3, 2016 at 12:40 AM, Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com> wrote: > Hi > >> -----Original Message----- >> From: NightStrike [mailto:nightstrike@gmail.com] >> Sent: Monday, May 2, 2016 10:31 PM >> To: Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com> >> Cc: Uros Bizjak (ubizjak@gmail.com) <ubizjak@gmail.com>; >> lopezibanez@gmail.com; Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek >> <jakub@redhat.com>; gcc@gcc.gnu.org >> Subject: Re: option -mprfchw on 2 different Opteron cpus >> >> On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan >> <Venkataramanan.Kumar@amd.com> wrote: >> >> If I compile on a k8 Opteron 248 with -march=native, I do not see >> >> -mprfchw listed in the options in -fverbose-asm. In the assembly, I see >> this: >> >> >> >> prefetcht0 (%rax) # ivtmp.1160 >> >> prefetcht0 304(%rcx) # >> >> prefetcht0 (%rax) # ivtmp.1160 >> > >> > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA >> support. >> > >> > (Snip) >> > CPUID Fn8000_0001_ECX Feature Identifiers Bit 8 >> > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See >> > “PREFETCH” and “PREFETCHW” in APM3 >> > Ref: http://support.amd.com/TechDocs/25481.pdf >> > (Snip) >> > >> > Can you please confirm what this CPUID flag returns on your k8 machine ?. >> > I believe this ISA is not available on k8 machine so when -march=native is >> added you don’t see -mprfchw in verbose. >> >> Looks like zero? This was generated with the cpuid program from >> http://www.etallen.com/cpuid.html >> >> 3DNow! instruction extensions = true >> 3DNow! instructions = true > > It has 3Dnow support. "prefetchw" is available with 3dnow. > >> misaligned SSE mode = false >> 3DNow! PREFETCH/PREFETCHW instructions = false > > It does not have 3DNowprefetch enabling ISA flag -mprftchw is not correct for -march=k8. > >> OS visible workaround = false >> instruction based sampling = false >> >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying >> >> to target the older system), I do see it listed in the options in >> >> -fverbose-asm. In the assembly, I see this: >> > >> > K8 has 3dnow support and there is a patch that replaced 3dnow with >> prefetchw (3DNowPrefetch). >> > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html >> > So when you add -march=k8 you see -mprfchw getting listed in verbose. >> > >> >> >> >> prefetcht0 (%rax) # ivtmp.1160 >> >> prefetcht0 304(%rcx) # >> >> prefetchw (%rax) # ivtmp.1160 >> >> >> >> (The third line is the only difference) >> >> >> > >> > This is my guess without seeing the test case, when write prefetching is >> requested "prefetchw" is generated. >> > 3dnow (TARGET_3DNOW) ISA has support for it. >> > >> > (Snip) >> > Support for the PREFETCH and PREFETCHW instructions is indicated by >> > CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR >> > Fn8000_0001_EDX[3DNow] = 1. >> > (Snip) >> > Ref: >> http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf >> > >> >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? >> >> >> >> Also, FWIW: >> >> >> >> 1) The march=native version that uses prefetcht0 is very repeatably >> >> faster by about 15% in the particular test case I'm looking at. >> >> >> >> 2) The compilers in both instances are not just the same version, >> >> they are the same compiler binary installed on an NFS mount and >> >> shared to both computers. >> > >> > As per GCC4.9.3 source. >> > >> > (Snip) >> > (define_expand "prefetch" >> > [(prefetch (match_operand 0 "address_operand") >> > (match_operand:SI 1 "const_int_operand") >> > (match_operand:SI 2 "const_int_operand"))] >> > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" >> > { >> > bool write = INTVAL (operands[1]) != 0; >> > int locality = INTVAL (operands[2]); >> > >> > gcc_assert (IN_RANGE (locality, 0, 3)); >> > >> > /* Use 3dNOW prefetch in case we are asking for write prefetch not >> > supported by SSE counterpart or the SSE prefetch is not available >> > (K6 machines). Otherwise use SSE prefetch as it allows specifying >> > of locality. */ >> > if (TARGET_PREFETCHWT1 && write && locality <= 2) >> > operands[2] = const2_rtx; >> > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) >> > operands[2] = GEN_INT (3); >> > else >> > operands[1] = const0_rtx; >> > }) >> > (Snip) >> > >> > Write prefetch may be requested (either by auto prefetcher or builtins) but >> on -march=native, the below check could have become false. >> > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) >> > TARGET_PRFCHW is off on native. >> > >> > So there are two issues here. >> > >> > (1) ISA flags enabled with -march=k8 is different from -march=native on k8 >> machine. > > I think we need to file bug for this. Need to check with Uros why the flag -mprfchw is shared with 3dnow. > To work around this issue you can use -mno-prfchw when building with -march=k8. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77270 >> > (2) Need to check why GCC middle end requested write prefetch for the >> test case with -march=k8 . > On "prefetchw" generation it may be the case that GCC auto prefetcher requests write prefetches. > AFAIK generating write prefetches brings data from memory and marks the catch line modified and expects a write to happen next. > If read happens to that cache line instead then data will be written back to memory before read which will be unnecessary. > Hard to answer without test case and I don’t have a ready k8 machine with me. Should this be another bug filed if I can get a reduced test case, or is PR77270 enough, or is this not a bug? ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-08-16 16:43 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-01 20:25 option -mprfchw on 2 different Opteron cpus NightStrike 2016-05-02 9:55 ` Kumar, Venkataramanan 2016-05-02 17:01 ` NightStrike 2016-05-03 4:40 ` Kumar, Venkataramanan 2016-08-16 16:43 ` NightStrike
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).