From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 30115 invoked by alias); 22 Apr 2013 20:29:06 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 30097 invoked by uid 48); 22 Apr 2013 20:29:02 -0000 From: "anlauf at gmx dot de" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/57037] New: GCC does not generate non-temporal stores on i386 with SSE2+ Date: Mon, 22 Apr 2013 20:29:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: anlauf at gmx dot de X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 X-SW-Source: 2013-04/txt/msg01913.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57037 Bug #: 57037 Summary: GCC does not generate non-temporal stores on i386 with SSE2+ Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned@gcc.gnu.org ReportedBy: anlauf@gmx.de Hello, it appears that gcc does not generate non-temporal stores available on i386 at least with SSE2. This is an important optimization for some memory-bandwidth limited codes. Example: for the stream triad kernel, subroutine stream_kernel_triad (a, b, c, n, s) integer , intent(in) :: n double precision :: a(*), b(*), c(*) double precision, intent(in) :: s integer :: j do j = 1,n a(j) = b(j) + s*c(j) end do end subroutine stream_kernel_triad the Intel compiler generates vectorized code with a throughput that is 25% higher on my Core2 than when disabling the generation of non-temporal stores (i.e. compiling with "-opt-streaming-stores never"). gfortran (using -Ofast -fprefetch-loop-arrays) exactly reproduces the performance of the Intel compiler without temporal stores. It appears that this is an important optimization. Harald