From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 61850 invoked by alias); 2 May 2016 17:01:59 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 61728 invoked by uid 89); 2 May 2016 17:01:58 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.2 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=aes, AES, STC, temperature X-HELO: mail-yw0-f171.google.com Received: from mail-yw0-f171.google.com (HELO mail-yw0-f171.google.com) (209.85.161.171) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Mon, 02 May 2016 17:01:47 +0000 Received: by mail-yw0-f171.google.com with SMTP id g133so260684434ywb.2 for ; Mon, 02 May 2016 10:01:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=sRpkU10XelagG1fBnx32/Jw35E2HeEay9xe8ri+Igj0=; b=l05aUc0pilw5EkVMl7hpmchb0VfKdhsttZlBY6z5PgsIXYdmK5TSxjrI0YFnl38dih VxUjp3iPtXiBhA0rFcYnHCbjvbCXGhbxuvuhnpU1y4uDFLf2+G6HEACFQndTKdZ5kKEV DoMXf5IU9rN4rFd/GU1k1U8vC087HbmACYd5eePdvhfFxnk/kwjgGZVzTItwrVLRkCrv Vtl7VwnG9P/eacQfRhoojm/zdXPA5U3sgJrDCUgLHi7lIRtOK5esJhu8DpKyrCETBenJ 4ctBfNA/DLlOUPBKIug3YWtCyYhCBqCyrhuZYLlQCq+uGIMoqm7kCUID9uAsKC1+vxdH +yqA== X-Gm-Message-State: AOPr4FWw7ul2SRmhqUerh+SKrnf40/y9/rIi+fHJMAcDpyR2hlfEhKdXfJtX7KDjhxeQbzAw/HKKks6HLorqoQ== X-Received: by 10.37.218.69 with SMTP id n66mr18153866ybf.146.1462208505674; Mon, 02 May 2016 10:01:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.210.135 with HTTP; Mon, 2 May 2016 10:01:26 -0700 (PDT) In-Reply-To: References: From: NightStrike Date: Mon, 02 May 2016 17:01:00 -0000 Message-ID: Subject: Re: option -mprfchw on 2 different Opteron cpus To: "Kumar, Venkataramanan" Cc: "Uros Bizjak (ubizjak@gmail.com)" , "lopezibanez@gmail.com" , Jan Hubicka , Jakub Jelinek , "gcc@gcc.gnu.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2016-05/txt/msg00005.txt.bz2 On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan wrote: >> If I compile on a k8 Opteron 248 with -march=3Dnative, I do not see -mpr= fchw >> listed in the options in -fverbose-asm. In the assembly, I see this: >> >> prefetcht0 (%rax) # ivtmp.1160 >> prefetcht0 304(%rcx) # >> prefetcht0 (%rax) # ivtmp.1160 > > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA su= pport. > > (Snip) > CPUID Fn8000_0001_ECX Feature Identifiers > Bit 8 > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See =E2=80=9CP= REFETCH=E2=80=9D and > =E2=80=9CPREFETCHW=E2=80=9D in APM3 > Ref: http://support.amd.com/TechDocs/25481.pdf > (Snip) > > Can you please confirm what this CPUID flag returns on your k8 machine ?. > I believe this ISA is not available on k8 machine so when -march=3Dnative= is added you don=E2=80=99t see -mprfchw in verbose. Looks like zero? This was generated with the cpuid program from http://www.etallen.com/cpuid.html CPU 0: 0x00000000 0x00: eax=3D0x00000001 ebx=3D0x68747541 ecx=3D0x444d4163 edx= =3D0x69746e65 0x00000001 0x00: eax=3D0x00000f58 ebx=3D0x00000800 ecx=3D0x00000000 edx= =3D0x078bfbff 0x80000000 0x00: eax=3D0x80000018 ebx=3D0x68747541 ecx=3D0x444d4163 edx= =3D0x69746e65 0x80000001 0x00: eax=3D0x00000f58 ebx=3D0x00000405 ecx=3D0x00000000 edx= =3D0xe1d3fbff 0x80000002 0x00: eax=3D0x20444d41 ebx=3D0x6574704f ecx=3D0x286e6f72 edx= =3D0x20296d74 0x80000003 0x00: eax=3D0x636f7250 ebx=3D0x6f737365 ecx=3D0x34322072 edx= =3D0x00000038 0x80000004 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000005 0x00: eax=3D0xff08ff08 ebx=3D0xff20ff20 ecx=3D0x40020140 edx= =3D0x40020140 0x80000006 0x00: eax=3D0x00000000 ebx=3D0x42004200 ecx=3D0x04008140 edx= =3D0x00000000 0x80000007 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000009 0x80000008 0x00: eax=3D0x00003028 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000009 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x8000000a 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x8000000b 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x8000000c 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x8000000d 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x8000000e 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x8000000f 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000010 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000011 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000012 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000013 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000014 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000015 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000016 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000017 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80000018 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0x80860000 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 0xc0000000 0x00: eax=3D0x00000000 ebx=3D0x00000000 ecx=3D0x00000000 edx= =3D0x00000000 CPU: vendor_id =3D "AuthenticAMD" version information (1/eax): processor type =3D primary processor (0) family =3D Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15) model =3D 0x5 (5) stepping id =3D 0x8 (8) extended family =3D 0x0 (0) extended model =3D 0x0 (0) (simple synth) =3D AMD Opteron (DP SledgeHammer SH7-C0) / Athlon 64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um miscellaneous (1/ebx): process local APIC physical ID =3D 0x0 (0) cpu count =3D 0x0 (0) CLFLUSH line size =3D 0x8 (8) brand index =3D 0x0 (0) brand id =3D 0x00 (0): unknown feature information (1/edx): x87 FPU on chip =3D true virtual-8086 mode enhancement =3D true debugging extensions =3D true page size extensions =3D true time stamp counter =3D true RDMSR and WRMSR support =3D true physical address extensions =3D true machine check exception =3D true CMPXCHG8B inst. =3D true APIC on chip =3D true SYSENTER and SYSEXIT =3D true memory type range registers =3D true PTE global bit =3D true machine check architecture =3D true conditional move/compare instruction =3D true page attribute table =3D true page size extension =3D true processor serial number =3D false CLFLUSH instruction =3D true debug store =3D false thermal monitor and clock ctrl =3D false MMX Technology =3D true FXSAVE/FXRSTOR =3D true SSE extensions =3D true SSE2 extensions =3D true self snoop =3D false hyper-threading / multi-core supported =3D false therm. monitor =3D false IA64 =3D false pending break event =3D false feature information (1/ecx): PNI/SSE3: Prescott New Instructions =3D false PCLMULDQ instruction =3D false 64-bit debug store =3D false MONITOR/MWAIT =3D false CPL-qualified debug store =3D false VMX: virtual machine extensions =3D false SMX: safer mode extensions =3D false Enhanced Intel SpeedStep Technology =3D false thermal monitor 2 =3D false SSSE3 extensions =3D false context ID: adaptive or shared L1 data =3D false FMA instruction =3D false CMPXCHG16B instruction =3D false xTPR disable =3D false perfmon and debug =3D false process context identifiers =3D false direct cache access =3D false SSE4.1 extensions =3D false SSE4.2 extensions =3D false extended xAPIC support =3D false MOVBE instruction =3D false POPCNT instruction =3D false time stamp counter deadline =3D false AES instruction =3D false XSAVE/XSTOR states =3D false OS-enabled XSAVE/XSTOR =3D false AVX: advanced vector extensions =3D false F16C half-precision convert instruction =3D false RDRAND instruction =3D false hypervisor guest status =3D false extended processor signature (0x80000001/eax): family/generation =3D AMD Athlon 64/Opteron/Sempron/Turion (15) model =3D 0x5 (5) stepping id =3D 0x8 (8) extended family =3D 0x0 (0) extended model =3D 0x0 (0) (simple synth) =3D AMD Opteron (DP SledgeHammer SH7-C0) / Athlon 64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um extended feature flags (0x80000001/edx): x87 FPU on chip =3D true virtual-8086 mode enhancement =3D true debugging extensions =3D true page size extensions =3D true time stamp counter =3D true RDMSR and WRMSR support =3D true physical address extensions =3D true machine check exception =3D true CMPXCHG8B inst. =3D true APIC on chip =3D true SYSCALL and SYSRET instructions =3D true memory type range registers =3D true global paging extension =3D true machine check architecture =3D true conditional move/compare instruction =3D true page attribute table =3D true page size extension =3D true multiprocessing capable =3D false no-execute page protection =3D true AMD multimedia instruction extensions =3D true MMX Technology =3D true FXSAVE/FXRSTOR =3D true SSE extensions =3D false 1-GB large page support =3D false RDTSCP =3D false long mode (AA-64) =3D true 3DNow! instruction extensions =3D true 3DNow! instructions =3D true extended brand id (0x80000001/ebx): raw =3D 0x405 (1029) BrandId =3D 0x405 (1029) BrandTableIndex =3D 0x10 (16) NN =3D 0x5 (5) AMD feature flags (0x80000001/ecx): LAHF/SAHF supported in 64-bit mode =3D false CMP Legacy =3D false SVM: secure virtual machine =3D false extended APIC space =3D false AltMovCr8 =3D false LZCNT advanced bit manipulation =3D false SSE4A support =3D false misaligned SSE mode =3D false 3DNow! PREFETCH/PREFETCHW instructions =3D false OS visible workaround =3D false instruction based sampling =3D false XOP support =3D false SKINIT/STGI support =3D false watchdog timer support =3D false lightweight profiling support =3D false 4-operand FMA instruction =3D false NodeId MSR C001100C =3D false TBM support =3D false topology extensions =3D false brand =3D "AMD Opteron(tm) Processor 248" L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax): instruction # entries =3D 0x8 (8) instruction associativity =3D 0xff (255) data # entries =3D 0x8 (8) data associativity =3D 0xff (255) L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx): instruction # entries =3D 0x20 (32) instruction associativity =3D 0xff (255) data # entries =3D 0x20 (32) data associativity =3D 0xff (255) L1 data cache information (0x80000005/ecx): line size (bytes) =3D 0x40 (64) lines per tag =3D 0x1 (1) associativity =3D 0x2 (2) size (Kb) =3D 0x40 (64) L1 instruction cache information (0x80000005/edx): line size (bytes) =3D 0x40 (64) lines per tag =3D 0x1 (1) associativity =3D 0x2 (2) size (Kb) =3D 0x40 (64) L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax): instruction # entries =3D 0x0 (0) instruction associativity =3D L2 off (0) data # entries =3D 0x0 (0) data associativity =3D L2 off (0) L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx): instruction # entries =3D 0x200 (512) instruction associativity =3D 4-way (4) data # entries =3D 0x200 (512) data associativity =3D 4-way (4) L2 unified cache information (0x80000006/ecx): line size (bytes) =3D 0x40 (64) lines per tag =3D 0x1 (1) associativity =3D 16-way (8) size (Kb) =3D 0x400 (1024) L3 cache information (0x80000006/edx): line size (bytes) =3D 0x0 (0) lines per tag =3D 0x0 (0) associativity =3D L2 off (0) size (in 512Kb units) =3D 0x0 (0) Advanced Power Management Features (0x80000007/edx): temperature sensing diode =3D true frequency ID (FID) control =3D false voltage ID (VID) control =3D false thermal trip (TTP) =3D true thermal monitor (TM) =3D false software thermal control (STC) =3D false 100 MHz multiplier control =3D false hardware P-State control =3D false TscInvariant =3D false Physical Address and Linear Address Size (0x80000008/eax): maximum physical address bits =3D 0x28 (40) maximum linear (virtual) address bits =3D 0x30 (48) maximum guest physical address bits =3D 0x0 (0) Logical CPU cores (0x80000008/ecx): number of CPU cores - 1 =3D 0x0 (0) ApicIdCoreIdSize =3D 0x0 (0) SVM Secure Virtual Machine (0x8000000a/eax): SvmRev: SVM revision =3D 0x0 (0) SVM Secure Virtual Machine (0x8000000a/edx): nested paging =3D false LBR virtualization =3D false SVM lock =3D false NRIP save =3D false MSR based TSC rate control =3D false VMCB clean bits support =3D false flush by ASID =3D false decode assists =3D false SSSE3/SSE5 opcode set disable =3D false pause intercept filter =3D false pause filter threshold =3D false NASID: number of address space identifiers =3D 0x0 (0): (instruction supported synth): CMPXCHG8B =3D true conditional move/compare =3D true PREFETCH/PREFETCHW =3D true (multi-processing synth): none (multi-processing method): AMD (synth) =3D AMD Opteron (DP SledgeHammer SH7-C0), 940-pin, .13um Process= or 248 >> >> If I compile on a bdver2 Opteron 6386 SE with -march=3Dk8 (thus trying to >> target the older system), I do see it listed in the options in -fverbose= -asm. In >> the assembly, I see this: > > K8 has 3dnow support and there is a patch that replaced 3dnow with prefet= chw (3DNowPrefetch). > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html > So when you add -march=3Dk8 you see -mprfchw getting listed in verbose. > >> >> prefetcht0 (%rax) # ivtmp.1160 >> prefetcht0 304(%rcx) # >> prefetchw (%rax) # ivtmp.1160 >> >> (The third line is the only difference) >> > > This is my guess without seeing the test case, when write prefetching is= requested "prefetchw" is generated. > 3dnow (TARGET_3DNOW) ISA has support for it. > > (Snip) > Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID > Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR > Fn8000_0001_EDX[3DNow] =3D 1. > (Snip) > Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf > >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 2= 48? >> >> Also, FWIW: >> >> 1) The march=3Dnative version that uses prefetcht0 is very repeatably fa= ster by >> about 15% in the particular test case I'm looking at. >> >> 2) The compilers in both instances are not just the same version, they a= re the >> same compiler binary installed on an NFS mount and shared to both >> computers. > > As per GCC4.9.3 source. > > (Snip) > (define_expand "prefetch" > [(prefetch (match_operand 0 "address_operand") > (match_operand:SI 1 "const_int_operand") > (match_operand:SI 2 "const_int_operand"))] > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" > { > bool write =3D INTVAL (operands[1]) !=3D 0; > int locality =3D INTVAL (operands[2]); > > gcc_assert (IN_RANGE (locality, 0, 3)); > > /* Use 3dNOW prefetch in case we are asking for write prefetch not > supported by SSE counterpart or the SSE prefetch is not available > (K6 machines). Otherwise use SSE prefetch as it allows specifying > of locality. */ > if (TARGET_PREFETCHWT1 && write && locality <=3D 2) > operands[2] =3D const2_rtx; > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > operands[2] =3D GEN_INT (3); > else > operands[1] =3D const0_rtx; > }) > (Snip) > > Write prefetch may be requested (either by auto prefetcher or builtins) b= ut on -march=3Dnative, the below check could have become false. > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > TARGET_PRFCHW is off on native. > > So there are two issues here. > > (1) ISA flags enabled with -march=3Dk8 is different from -march=3Dnative = on k8 machine. > (2) Need to check why GCC middle end requested write prefetch for the tes= t case with -march=3Dk8 . > > Regards, > Venkat.