* option -mprfchw on 2 different Opteron cpus
@ 2016-05-01 20:25 NightStrike
2016-05-02 9:55 ` Kumar, Venkataramanan
0 siblings, 1 reply; 5+ messages in thread
From: NightStrike @ 2016-05-01 20:25 UTC (permalink / raw)
To: gcc; +Cc: Jan Hubicka, Jakub Jelinek
Reposting from here:
https://gcc.gnu.org/ml/gcc-help/2016-05/msg00003.html
Not sure if this applies:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54210
If I compile on a k8 Opteron 248 with -march=native, I do not see
-mprfchw listed in the options in -fverbose-asm. In the assembly, I
see this:
prefetcht0 (%rax) # ivtmp.1160
prefetcht0 304(%rcx) #
prefetcht0 (%rax) # ivtmp.1160
If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying
to target the older system), I do see it listed in the options in
-fverbose-asm. In the assembly, I see this:
prefetcht0 (%rax) # ivtmp.1160
prefetcht0 304(%rcx) #
prefetchw (%rax) # ivtmp.1160
(The third line is the only difference)
In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248?
Also, FWIW:
1) The march=native version that uses prefetcht0 is very repeatably
faster by about 15% in the particular test case I'm looking at.
2) The compilers in both instances are not just the same version, they
are the same compiler binary installed on an NFS mount and shared to both
computers.
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: option -mprfchw on 2 different Opteron cpus
2016-05-01 20:25 option -mprfchw on 2 different Opteron cpus NightStrike
@ 2016-05-02 9:55 ` Kumar, Venkataramanan
2016-05-02 17:01 ` NightStrike
0 siblings, 1 reply; 5+ messages in thread
From: Kumar, Venkataramanan @ 2016-05-02 9:55 UTC (permalink / raw)
To: NightStrike, Uros Bizjak (ubizjak@gmail.com), lopezibanez
Cc: Jan Hubicka, Jakub Jelinek, gcc
Hi,
> -----Original Message-----
> From: gcc-owner@gcc.gnu.org [mailto:gcc-owner@gcc.gnu.org] On Behalf Of
> NightStrike
> Sent: Monday, May 2, 2016 1:55 AM
> To: gcc@gcc.gnu.org
> Cc: Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek <jakub@redhat.com>
> Subject: option -mprfchw on 2 different Opteron cpus
>
> Reposting from here:
> https://gcc.gnu.org/ml/gcc-help/2016-05/msg00003.html
>
> Not sure if this applies:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54210
>
> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw
> listed in the options in -fverbose-asm. In the assembly, I see this:
>
> prefetcht0 (%rax) # ivtmp.1160
> prefetcht0 304(%rcx) #
> prefetcht0 (%rax) # ivtmp.1160
In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA support.
(Snip)
CPUID Fn8000_0001_ECX Feature Identifiers
Bit 8
3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and
“PREFETCHW” in APM3
Ref: http://support.amd.com/TechDocs/25481.pdf
(Snip)
Can you please confirm what this CPUID flag returns on your k8 machine ?.
I believe this ISA is not available on k8 machine so when -march=native is added you don’t see -mprfchw in verbose.
>
> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to
> target the older system), I do see it listed in the options in -fverbose-asm. In
> the assembly, I see this:
K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch).
https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
So when you add -march=k8 you see -mprfchw getting listed in verbose.
>
> prefetcht0 (%rax) # ivtmp.1160
> prefetcht0 304(%rcx) #
> prefetchw (%rax) # ivtmp.1160
>
> (The third line is the only difference)
>
This is my guess without seeing the test case, when write prefetching is requested "prefetchw" is generated.
3dnow (TARGET_3DNOW) ISA has support for it.
(Snip)
Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID
Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
Fn8000_0001_EDX[3DNow] = 1.
(Snip)
Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248?
>
> Also, FWIW:
>
> 1) The march=native version that uses prefetcht0 is very repeatably faster by
> about 15% in the particular test case I'm looking at.
>
> 2) The compilers in both instances are not just the same version, they are the
> same compiler binary installed on an NFS mount and shared to both
> computers.
As per GCC4.9.3 source.
(Snip)
(define_expand "prefetch"
[(prefetch (match_operand 0 "address_operand")
(match_operand:SI 1 "const_int_operand")
(match_operand:SI 2 "const_int_operand"))]
"TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
{
bool write = INTVAL (operands[1]) != 0;
int locality = INTVAL (operands[2]);
gcc_assert (IN_RANGE (locality, 0, 3));
/* Use 3dNOW prefetch in case we are asking for write prefetch not
supported by SSE counterpart or the SSE prefetch is not available
(K6 machines). Otherwise use SSE prefetch as it allows specifying
of locality. */
if (TARGET_PREFETCHWT1 && write && locality <= 2)
operands[2] = const2_rtx;
else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
operands[2] = GEN_INT (3);
else
operands[1] = const0_rtx;
})
(Snip)
Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false.
else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
TARGET_PRFCHW is off on native.
So there are two issues here.
(1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine.
(2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 .
Regards,
Venkat.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: option -mprfchw on 2 different Opteron cpus
2016-05-02 9:55 ` Kumar, Venkataramanan
@ 2016-05-02 17:01 ` NightStrike
2016-05-03 4:40 ` Kumar, Venkataramanan
0 siblings, 1 reply; 5+ messages in thread
From: NightStrike @ 2016-05-02 17:01 UTC (permalink / raw)
To: Kumar, Venkataramanan
Cc: Uros Bizjak (ubizjak@gmail.com),
lopezibanez, Jan Hubicka, Jakub Jelinek, gcc
On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
>> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw
>> listed in the options in -fverbose-asm. In the assembly, I see this:
>>
>> prefetcht0 (%rax) # ivtmp.1160
>> prefetcht0 304(%rcx) #
>> prefetcht0 (%rax) # ivtmp.1160
>
> In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA support.
>
> (Snip)
> CPUID Fn8000_0001_ECX Feature Identifiers
> Bit 8
> 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and
> “PREFETCHW” in APM3
> Ref: http://support.amd.com/TechDocs/25481.pdf
> (Snip)
>
> Can you please confirm what this CPUID flag returns on your k8 machine ?.
> I believe this ISA is not available on k8 machine so when -march=native is added you don’t see -mprfchw in verbose.
Looks like zero? This was generated with the cpuid program from
http://www.etallen.com/cpuid.html
CPU 0:
0x00000000 0x00: eax=0x00000001 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
0x00000001 0x00: eax=0x00000f58 ebx=0x00000800 ecx=0x00000000 edx=0x078bfbff
0x80000000 0x00: eax=0x80000018 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
0x80000001 0x00: eax=0x00000f58 ebx=0x00000405 ecx=0x00000000 edx=0xe1d3fbff
0x80000002 0x00: eax=0x20444d41 ebx=0x6574704f ecx=0x286e6f72 edx=0x20296d74
0x80000003 0x00: eax=0x636f7250 ebx=0x6f737365 ecx=0x34322072 edx=0x00000038
0x80000004 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000005 0x00: eax=0xff08ff08 ebx=0xff20ff20 ecx=0x40020140 edx=0x40020140
0x80000006 0x00: eax=0x00000000 ebx=0x42004200 ecx=0x04008140 edx=0x00000000
0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000009
0x80000008 0x00: eax=0x00003028 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000b 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000c 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000d 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000f 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000010 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000011 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000012 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000013 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000014 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000015 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000016 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000017 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000018 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80860000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0xc0000000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
CPU:
vendor_id = "AuthenticAMD"
version information (1/eax):
processor type = primary processor (0)
family = Intel Pentium 4/Pentium D/Pentium Extreme
Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon
XP-M/Opteron/Sempron/Turion (15)
model = 0x5 (5)
stepping id = 0x8 (8)
extended family = 0x0 (0)
extended model = 0x0 (0)
(simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
miscellaneous (1/ebx):
process local APIC physical ID = 0x0 (0)
cpu count = 0x0 (0)
CLFLUSH line size = 0x8 (8)
brand index = 0x0 (0)
brand id = 0x00 (0): unknown
feature information (1/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSENTER and SYSEXIT = true
memory type range registers = true
PTE global bit = true
machine check architecture = true
conditional move/compare instruction = true
page attribute table = true
page size extension = true
processor serial number = false
CLFLUSH instruction = true
debug store = false
thermal monitor and clock ctrl = false
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = true
SSE2 extensions = true
self snoop = false
hyper-threading / multi-core supported = false
therm. monitor = false
IA64 = false
pending break event = false
feature information (1/ecx):
PNI/SSE3: Prescott New Instructions = false
PCLMULDQ instruction = false
64-bit debug store = false
MONITOR/MWAIT = false
CPL-qualified debug store = false
VMX: virtual machine extensions = false
SMX: safer mode extensions = false
Enhanced Intel SpeedStep Technology = false
thermal monitor 2 = false
SSSE3 extensions = false
context ID: adaptive or shared L1 data = false
FMA instruction = false
CMPXCHG16B instruction = false
xTPR disable = false
perfmon and debug = false
process context identifiers = false
direct cache access = false
SSE4.1 extensions = false
SSE4.2 extensions = false
extended xAPIC support = false
MOVBE instruction = false
POPCNT instruction = false
time stamp counter deadline = false
AES instruction = false
XSAVE/XSTOR states = false
OS-enabled XSAVE/XSTOR = false
AVX: advanced vector extensions = false
F16C half-precision convert instruction = false
RDRAND instruction = false
hypervisor guest status = false
extended processor signature (0x80000001/eax):
family/generation = AMD Athlon 64/Opteron/Sempron/Turion (15)
model = 0x5 (5)
stepping id = 0x8 (8)
extended family = 0x0 (0)
extended model = 0x0 (0)
(simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
extended feature flags (0x80000001/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSCALL and SYSRET instructions = true
memory type range registers = true
global paging extension = true
machine check architecture = true
conditional move/compare instruction = true
page attribute table = true
page size extension = true
multiprocessing capable = false
no-execute page protection = true
AMD multimedia instruction extensions = true
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = false
1-GB large page support = false
RDTSCP = false
long mode (AA-64) = true
3DNow! instruction extensions = true
3DNow! instructions = true
extended brand id (0x80000001/ebx):
raw = 0x405 (1029)
BrandId = 0x405 (1029)
BrandTableIndex = 0x10 (16)
NN = 0x5 (5)
AMD feature flags (0x80000001/ecx):
LAHF/SAHF supported in 64-bit mode = false
CMP Legacy = false
SVM: secure virtual machine = false
extended APIC space = false
AltMovCr8 = false
LZCNT advanced bit manipulation = false
SSE4A support = false
misaligned SSE mode = false
3DNow! PREFETCH/PREFETCHW instructions = false
OS visible workaround = false
instruction based sampling = false
XOP support = false
SKINIT/STGI support = false
watchdog timer support = false
lightweight profiling support = false
4-operand FMA instruction = false
NodeId MSR C001100C = false
TBM support = false
topology extensions = false
brand = "AMD Opteron(tm) Processor 248"
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
instruction # entries = 0x8 (8)
instruction associativity = 0xff (255)
data # entries = 0x8 (8)
data associativity = 0xff (255)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
instruction # entries = 0x20 (32)
instruction associativity = 0xff (255)
data # entries = 0x20 (32)
data associativity = 0xff (255)
L1 data cache information (0x80000005/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 0x2 (2)
size (Kb) = 0x40 (64)
L1 instruction cache information (0x80000005/edx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 0x2 (2)
size (Kb) = 0x40 (64)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
instruction # entries = 0x200 (512)
instruction associativity = 4-way (4)
data # entries = 0x200 (512)
data associativity = 4-way (4)
L2 unified cache information (0x80000006/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 16-way (8)
size (Kb) = 0x400 (1024)
L3 cache information (0x80000006/edx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = L2 off (0)
size (in 512Kb units) = 0x0 (0)
Advanced Power Management Features (0x80000007/edx):
temperature sensing diode = true
frequency ID (FID) control = false
voltage ID (VID) control = false
thermal trip (TTP) = true
thermal monitor (TM) = false
software thermal control (STC) = false
100 MHz multiplier control = false
hardware P-State control = false
TscInvariant = false
Physical Address and Linear Address Size (0x80000008/eax):
maximum physical address bits = 0x28 (40)
maximum linear (virtual) address bits = 0x30 (48)
maximum guest physical address bits = 0x0 (0)
Logical CPU cores (0x80000008/ecx):
number of CPU cores - 1 = 0x0 (0)
ApicIdCoreIdSize = 0x0 (0)
SVM Secure Virtual Machine (0x8000000a/eax):
SvmRev: SVM revision = 0x0 (0)
SVM Secure Virtual Machine (0x8000000a/edx):
nested paging = false
LBR virtualization = false
SVM lock = false
NRIP save = false
MSR based TSC rate control = false
VMCB clean bits support = false
flush by ASID = false
decode assists = false
SSSE3/SSE5 opcode set disable = false
pause intercept filter = false
pause filter threshold = false
NASID: number of address space identifiers = 0x0 (0):
(instruction supported synth):
CMPXCHG8B = true
conditional move/compare = true
PREFETCH/PREFETCHW = true
(multi-processing synth): none
(multi-processing method): AMD
(synth) = AMD Opteron (DP SledgeHammer SH7-C0), 940-pin, .13um Processor 248
>>
>> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to
>> target the older system), I do see it listed in the options in -fverbose-asm. In
>> the assembly, I see this:
>
> K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch).
> https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
> So when you add -march=k8 you see -mprfchw getting listed in verbose.
>
>>
>> prefetcht0 (%rax) # ivtmp.1160
>> prefetcht0 304(%rcx) #
>> prefetchw (%rax) # ivtmp.1160
>>
>> (The third line is the only difference)
>>
>
> This is my guess without seeing the test case, when write prefetching is requested "prefetchw" is generated.
> 3dnow (TARGET_3DNOW) ISA has support for it.
>
> (Snip)
> Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID
> Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
> Fn8000_0001_EDX[3DNow] = 1.
> (Snip)
> Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
>
>> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248?
>>
>> Also, FWIW:
>>
>> 1) The march=native version that uses prefetcht0 is very repeatably faster by
>> about 15% in the particular test case I'm looking at.
>>
>> 2) The compilers in both instances are not just the same version, they are the
>> same compiler binary installed on an NFS mount and shared to both
>> computers.
>
> As per GCC4.9.3 source.
>
> (Snip)
> (define_expand "prefetch"
> [(prefetch (match_operand 0 "address_operand")
> (match_operand:SI 1 "const_int_operand")
> (match_operand:SI 2 "const_int_operand"))]
> "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
> {
> bool write = INTVAL (operands[1]) != 0;
> int locality = INTVAL (operands[2]);
>
> gcc_assert (IN_RANGE (locality, 0, 3));
>
> /* Use 3dNOW prefetch in case we are asking for write prefetch not
> supported by SSE counterpart or the SSE prefetch is not available
> (K6 machines). Otherwise use SSE prefetch as it allows specifying
> of locality. */
> if (TARGET_PREFETCHWT1 && write && locality <= 2)
> operands[2] = const2_rtx;
> else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> operands[2] = GEN_INT (3);
> else
> operands[1] = const0_rtx;
> })
> (Snip)
>
> Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false.
> else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> TARGET_PRFCHW is off on native.
>
> So there are two issues here.
>
> (1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine.
> (2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 .
>
> Regards,
> Venkat.
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: option -mprfchw on 2 different Opteron cpus
2016-05-02 17:01 ` NightStrike
@ 2016-05-03 4:40 ` Kumar, Venkataramanan
2016-08-16 16:43 ` NightStrike
0 siblings, 1 reply; 5+ messages in thread
From: Kumar, Venkataramanan @ 2016-05-03 4:40 UTC (permalink / raw)
To: NightStrike
Cc: Uros Bizjak (ubizjak@gmail.com),
lopezibanez, Jan Hubicka, Jakub Jelinek, gcc
Hi
> -----Original Message-----
> From: NightStrike [mailto:nightstrike@gmail.com]
> Sent: Monday, May 2, 2016 10:31 PM
> To: Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com>
> Cc: Uros Bizjak (ubizjak@gmail.com) <ubizjak@gmail.com>;
> lopezibanez@gmail.com; Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek
> <jakub@redhat.com>; gcc@gcc.gnu.org
> Subject: Re: option -mprfchw on 2 different Opteron cpus
>
> On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
> <Venkataramanan.Kumar@amd.com> wrote:
> >> If I compile on a k8 Opteron 248 with -march=native, I do not see
> >> -mprfchw listed in the options in -fverbose-asm. In the assembly, I see
> this:
> >>
> >> prefetcht0 (%rax) # ivtmp.1160
> >> prefetcht0 304(%rcx) #
> >> prefetcht0 (%rax) # ivtmp.1160
> >
> > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA
> support.
> >
> > (Snip)
> > CPUID Fn8000_0001_ECX Feature Identifiers Bit 8
> > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See
> > “PREFETCH” and “PREFETCHW” in APM3
> > Ref: http://support.amd.com/TechDocs/25481.pdf
> > (Snip)
> >
> > Can you please confirm what this CPUID flag returns on your k8 machine ?.
> > I believe this ISA is not available on k8 machine so when -march=native is
> added you don’t see -mprfchw in verbose.
>
> Looks like zero? This was generated with the cpuid program from
> http://www.etallen.com/cpuid.html
>
> 3DNow! instruction extensions = true
> 3DNow! instructions = true
It has 3Dnow support. "prefetchw" is available with 3dnow.
> misaligned SSE mode = false
> 3DNow! PREFETCH/PREFETCHW instructions = false
It does not have 3DNowprefetch enabling ISA flag -mprftchw is not correct for -march=k8.
> OS visible workaround = false
> instruction based sampling = false
> >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying
> >> to target the older system), I do see it listed in the options in
> >> -fverbose-asm. In the assembly, I see this:
> >
> > K8 has 3dnow support and there is a patch that replaced 3dnow with
> prefetchw (3DNowPrefetch).
> > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
> > So when you add -march=k8 you see -mprfchw getting listed in verbose.
> >
> >>
> >> prefetcht0 (%rax) # ivtmp.1160
> >> prefetcht0 304(%rcx) #
> >> prefetchw (%rax) # ivtmp.1160
> >>
> >> (The third line is the only difference)
> >>
> >
> > This is my guess without seeing the test case, when write prefetching is
> requested "prefetchw" is generated.
> > 3dnow (TARGET_3DNOW) ISA has support for it.
> >
> > (Snip)
> > Support for the PREFETCH and PREFETCHW instructions is indicated by
> > CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
> > Fn8000_0001_EDX[3DNow] = 1.
> > (Snip)
> > Ref:
> http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
> >
> >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248?
> >>
> >> Also, FWIW:
> >>
> >> 1) The march=native version that uses prefetcht0 is very repeatably
> >> faster by about 15% in the particular test case I'm looking at.
> >>
> >> 2) The compilers in both instances are not just the same version,
> >> they are the same compiler binary installed on an NFS mount and
> >> shared to both computers.
> >
> > As per GCC4.9.3 source.
> >
> > (Snip)
> > (define_expand "prefetch"
> > [(prefetch (match_operand 0 "address_operand")
> > (match_operand:SI 1 "const_int_operand")
> > (match_operand:SI 2 "const_int_operand"))]
> > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
> > {
> > bool write = INTVAL (operands[1]) != 0;
> > int locality = INTVAL (operands[2]);
> >
> > gcc_assert (IN_RANGE (locality, 0, 3));
> >
> > /* Use 3dNOW prefetch in case we are asking for write prefetch not
> > supported by SSE counterpart or the SSE prefetch is not available
> > (K6 machines). Otherwise use SSE prefetch as it allows specifying
> > of locality. */
> > if (TARGET_PREFETCHWT1 && write && locality <= 2)
> > operands[2] = const2_rtx;
> > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> > operands[2] = GEN_INT (3);
> > else
> > operands[1] = const0_rtx;
> > })
> > (Snip)
> >
> > Write prefetch may be requested (either by auto prefetcher or builtins) but
> on -march=native, the below check could have become false.
> > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> > TARGET_PRFCHW is off on native.
> >
> > So there are two issues here.
> >
> > (1) ISA flags enabled with -march=k8 is different from -march=native on k8
> machine.
I think we need to file bug for this. Need to check with Uros why the flag -mprfchw is shared with 3dnow.
To work around this issue you can use -mno-prfchw when building with -march=k8.
> > (2) Need to check why GCC middle end requested write prefetch for the
> test case with -march=k8 .
On "prefetchw" generation it may be the case that GCC auto prefetcher requests write prefetches.
AFAIK generating write prefetches brings data from memory and marks the catch line modified and expects a write to happen next.
If read happens to that cache line instead then data will be written back to memory before read which will be unnecessary.
Hard to answer without test case and I don’t have a ready k8 machine with me.
> >
> > Regards,
> > Venkat.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: option -mprfchw on 2 different Opteron cpus
2016-05-03 4:40 ` Kumar, Venkataramanan
@ 2016-08-16 16:43 ` NightStrike
0 siblings, 0 replies; 5+ messages in thread
From: NightStrike @ 2016-08-16 16:43 UTC (permalink / raw)
To: Kumar, Venkataramanan
Cc: Uros Bizjak (ubizjak@gmail.com),
lopezibanez, Jan Hubicka, Jakub Jelinek, gcc
On Tue, May 3, 2016 at 12:40 AM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
> Hi
>
>> -----Original Message-----
>> From: NightStrike [mailto:nightstrike@gmail.com]
>> Sent: Monday, May 2, 2016 10:31 PM
>> To: Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com>
>> Cc: Uros Bizjak (ubizjak@gmail.com) <ubizjak@gmail.com>;
>> lopezibanez@gmail.com; Jan Hubicka <hubicka@ucw.cz>; Jakub Jelinek
>> <jakub@redhat.com>; gcc@gcc.gnu.org
>> Subject: Re: option -mprfchw on 2 different Opteron cpus
>>
>> On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
>> <Venkataramanan.Kumar@amd.com> wrote:
>> >> If I compile on a k8 Opteron 248 with -march=native, I do not see
>> >> -mprfchw listed in the options in -fverbose-asm. In the assembly, I see
>> this:
>> >>
>> >> prefetcht0 (%rax) # ivtmp.1160
>> >> prefetcht0 304(%rcx) #
>> >> prefetcht0 (%rax) # ivtmp.1160
>> >
>> > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA
>> support.
>> >
>> > (Snip)
>> > CPUID Fn8000_0001_ECX Feature Identifiers Bit 8
>> > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See
>> > “PREFETCH” and “PREFETCHW” in APM3
>> > Ref: http://support.amd.com/TechDocs/25481.pdf
>> > (Snip)
>> >
>> > Can you please confirm what this CPUID flag returns on your k8 machine ?.
>> > I believe this ISA is not available on k8 machine so when -march=native is
>> added you don’t see -mprfchw in verbose.
>>
>> Looks like zero? This was generated with the cpuid program from
>> http://www.etallen.com/cpuid.html
>>
>> 3DNow! instruction extensions = true
>> 3DNow! instructions = true
>
> It has 3Dnow support. "prefetchw" is available with 3dnow.
>
>> misaligned SSE mode = false
>> 3DNow! PREFETCH/PREFETCHW instructions = false
>
> It does not have 3DNowprefetch enabling ISA flag -mprftchw is not correct for -march=k8.
>
>> OS visible workaround = false
>> instruction based sampling = false
>> >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying
>> >> to target the older system), I do see it listed in the options in
>> >> -fverbose-asm. In the assembly, I see this:
>> >
>> > K8 has 3dnow support and there is a patch that replaced 3dnow with
>> prefetchw (3DNowPrefetch).
>> > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
>> > So when you add -march=k8 you see -mprfchw getting listed in verbose.
>> >
>> >>
>> >> prefetcht0 (%rax) # ivtmp.1160
>> >> prefetcht0 304(%rcx) #
>> >> prefetchw (%rax) # ivtmp.1160
>> >>
>> >> (The third line is the only difference)
>> >>
>> >
>> > This is my guess without seeing the test case, when write prefetching is
>> requested "prefetchw" is generated.
>> > 3dnow (TARGET_3DNOW) ISA has support for it.
>> >
>> > (Snip)
>> > Support for the PREFETCH and PREFETCHW instructions is indicated by
>> > CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
>> > Fn8000_0001_EDX[3DNow] = 1.
>> > (Snip)
>> > Ref:
>> http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
>> >
>> >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248?
>> >>
>> >> Also, FWIW:
>> >>
>> >> 1) The march=native version that uses prefetcht0 is very repeatably
>> >> faster by about 15% in the particular test case I'm looking at.
>> >>
>> >> 2) The compilers in both instances are not just the same version,
>> >> they are the same compiler binary installed on an NFS mount and
>> >> shared to both computers.
>> >
>> > As per GCC4.9.3 source.
>> >
>> > (Snip)
>> > (define_expand "prefetch"
>> > [(prefetch (match_operand 0 "address_operand")
>> > (match_operand:SI 1 "const_int_operand")
>> > (match_operand:SI 2 "const_int_operand"))]
>> > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
>> > {
>> > bool write = INTVAL (operands[1]) != 0;
>> > int locality = INTVAL (operands[2]);
>> >
>> > gcc_assert (IN_RANGE (locality, 0, 3));
>> >
>> > /* Use 3dNOW prefetch in case we are asking for write prefetch not
>> > supported by SSE counterpart or the SSE prefetch is not available
>> > (K6 machines). Otherwise use SSE prefetch as it allows specifying
>> > of locality. */
>> > if (TARGET_PREFETCHWT1 && write && locality <= 2)
>> > operands[2] = const2_rtx;
>> > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
>> > operands[2] = GEN_INT (3);
>> > else
>> > operands[1] = const0_rtx;
>> > })
>> > (Snip)
>> >
>> > Write prefetch may be requested (either by auto prefetcher or builtins) but
>> on -march=native, the below check could have become false.
>> > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
>> > TARGET_PRFCHW is off on native.
>> >
>> > So there are two issues here.
>> >
>> > (1) ISA flags enabled with -march=k8 is different from -march=native on k8
>> machine.
>
> I think we need to file bug for this. Need to check with Uros why the flag -mprfchw is shared with 3dnow.
> To work around this issue you can use -mno-prfchw when building with -march=k8.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77270
>> > (2) Need to check why GCC middle end requested write prefetch for the
>> test case with -march=k8 .
> On "prefetchw" generation it may be the case that GCC auto prefetcher requests write prefetches.
> AFAIK generating write prefetches brings data from memory and marks the catch line modified and expects a write to happen next.
> If read happens to that cache line instead then data will be written back to memory before read which will be unnecessary.
> Hard to answer without test case and I don’t have a ready k8 machine with me.
Should this be another bug filed if I can get a reduced test case, or
is PR77270 enough, or is this not a bug?
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-08-16 16:43 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-01 20:25 option -mprfchw on 2 different Opteron cpus NightStrike
2016-05-02 9:55 ` Kumar, Venkataramanan
2016-05-02 17:01 ` NightStrike
2016-05-03 4:40 ` Kumar, Venkataramanan
2016-08-16 16:43 ` NightStrike
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).