From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13810 invoked by alias); 25 Feb 2002 00:21:36 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 13729 invoked from network); 25 Feb 2002 00:21:32 -0000 Received: from unknown (HELO Nicole.fhm.edu) (213.7.87.14) by sources.redhat.com with SMTP; 25 Feb 2002 00:21:32 -0000 Received: from localhost.localdomain (unknown [10.23.201.7]) by Nicole.fhm.edu (Postfix on SuSE Linux 7.2 (i386)) with ESMTP id AB1E9FA47; Mon, 25 Feb 2002 01:20:38 +0100 (CET) Subject: Re: Altivec strangeness? From: Daniel Egger To: Aldy Hernandez Cc: GCC Developer Mailinglist In-Reply-To: References: Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Evolution/1.0.2 Date: Sun, 24 Feb 2002 16:41:00 -0000 Message-Id: <1014596591.3287.19.camel@sonja> Mime-Version: 1.0 X-SW-Source: 2002-02/txt/msg01443.txt.bz2 Am Mon, 2002-02-25 um 00.26 schrieb Aldy Hernandez: > > a) nasty because it requires a lot of typing. > declare a macro: > #define VSHORT_1S ((vector short int){1,1,1,1,1,1,1,1}) That's no much shorter than const vector short shortones = (vector short int){1,1,1,1,1,1,1,1}; globally defined. > as i have mentioned before, the vector initializers generate pretty > bad code, but that will be remedied when, in 3.2, i rewrite them > to use the vector constant infrastructure. right now, they just > get initialized as arrays, which is less than optimal. Indeed. > in the code's defense, how many times do you initialize a given > vector in a function? once! it's not like it's going to drag > performance down. No, not in my case. I've small functions which have an generic implementation but can be replaced by vectorised code. A profile of a short run of the application will look like that: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls us/call us/call name 11.83 0.68 0.68 99864 6.81 9.81 synth_filter 11.48 1.34 0.66 2105726 0.31 0.31 put_pixels_altivec 11.48 2.00 0.66 1436416 0.46 0.46 j_rev_dct_altivec For a function which is called a few million times per second runtime it makes a lot of difference whether a constant vector is loaded from memory whereby extra code is required to setup the base address for the vector load or the vector simply get splatted into a vector register which uses less memory, less opcodes and is likely happen in the same amount of cpu cycles. This is an example of assembly output produced by gcc 3.1: .align 2 .globl put_pixels_clamped_altivec .type put_pixels_clamped_altivec,@function put_pixels_clamped_altivec: lis %r0,0x108 lis %r9,zeros@ha ori %r0,%r0,16 la %r9,zeros@l(%r9) dst %r3,%r0,0 lvx %v13,0,%r9 li %r0,8 li %r11,0 mtctr %r0 li %r9,4 .L53: lvx %v0,0,%r3 addi %r3,%r3,16 vpkshus %v0,%v0,%v13 vspltw %v1,%v0,1 vspltw %v0,%v0,0 stvewx %v0,%r11,%r4 stvewx %v1,%r9,%r4 add %r4,%r4,%r5 bdnz .L53 blr As you can see it takes an additional lis, la to get the address for the vector load. The inner loop is executed 8 times BTW. > and if you have it in a loop, it's probably invariant, so move it out of it. You bet on it. :) > let's concentrate on getting the bugs ironed out of the current > implementation, and then we can tackle code quality issues. I hope you don't mind if I fool a bit around with code generation now. :) -- Servus, Daniel