From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14127 invoked by alias); 13 Aug 2010 07:26:33 -0000 Received: (qmail 14077 invoked by uid 22791); 13 Aug 2010 07:26:31 -0000 X-SWARE-Spam-Status: No, hits=-1.0 required=5.0 tests=BAYES_00,DKIM_ADSP_NXDOMAIN,NO_DNS_FOR_FROM X-Spam-Check-By: sourceware.org Received: from one.firstfloor.org (HELO one.firstfloor.org) (213.235.205.2) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 13 Aug 2010 07:26:23 +0000 Received: from gargoyle.firstfloor.org (p5B3C96B3.dip0.t-ipconnect.de [91.60.150.179]) by one.firstfloor.org (Postfix) with ESMTP id 44E4E1A9804F; Fri, 13 Aug 2010 09:26:21 +0200 (CEST) Date: Fri, 13 Aug 2010 07:26:00 -0000 From: Andi Kleen To: Richard Henderson Cc: Andi Kleen , gcc-patches@gcc.gnu.org Subject: Re: Vectorized _cpp_clean_line Message-ID: <20100813070300.GA12885@gargoyle.fritz.box> References: <4C601691.1000303@moene.org> <4C601E08.4020303@google.com> <4C6035C2.9020505@moene.org> <4C60378B.4060303@google.com> <4C603AC2.5070403@moene.org> <45B5C4E0-DFA5-413E-8FC8-E13077862245@apple.com> <877hjy8qwk.fsf@basil.nowhere.org> <4C64699B.20804@redhat.com> <20100812220708.GC7058@basil.fritz.box> <4C647448.6080707@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4C647448.6080707@redhat.com> User-Agent: Mutt/1.5.17 (2007-11-01) Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2010-08/txt/msg00981.txt.bz2 On Thu, Aug 12, 2010 at 03:23:04PM -0700, Richard Henderson wrote: > On 08/12/2010 03:07 PM, Andi Kleen wrote: > > At least for sse 4.2 I'm not sure the table lookup > > for alignment is worth it. The unaligned loads are quite > > cheap on current micro architectures with sse 4.2 > > and the page end test is also not that expensive. > > Perhaps. That's something else that will want testing, as > it's all of a dozen instructions. > > At minimum the page end test should not be performed inside > the loop. We can adjust END before beginning the loop so > that we never cross a page. The test runs in parallel with the match on a OOO CPU. It would only be a problem if you were decoder limited. Moving it out would require special case tail code. glibc used a lot of switches for that in its code, I didn't like this. The best probably would be to ensure there is always a tail pad in the caller, but it is presumably difficult if you mmap() the input file. > > I originally avoided the indirect call because I was worried > > about the effect on CPUs with indirect branch predictor. > > WithOUT the indirect branch predictor, you mean? Which ones Yes without. > don't have that? Surely we have to be going back pretty far... Nope. They're a relatively recent invention: a lot of x86 CPUs still being used don't have them. > > Since the call is the same destination every time, that matches > up well with the indirect branch predictor, AFAIK. If we're > worried about the indirect branch predictor, we could write Yes if you have a indirect branch predictor you're fine, assuming the rest of the compiler didn't thrash the buffers. Or maybe profile feedback will fix it and does the necessarily inlining (but you have to fix PR45227 first :-) Also when I tested this last time it didn't seem to work very well. And it would only help if you run it on the same type of system as the end host. Or maybe it's in the wash because it's only once per line. > > static inline bool > search_line_fast (s, end, out) > { > if (fast_impl == 0) > return search_line_sse42 (s, end, out); > else if (fast_impl == 1) > return search_line_sse2 (s, end, out); > else > return search_line_acc_char (s, end, out); > } > > where FAST_IMPL is set up appropriately by init_vectorized_lexer. > > The question being, are three predicted jumps faster than one > indirect jump on a processor without the proper predictor? Yes usually, especially if you don't have to go through all three on average. -Andi P.S.: I wonder if there's more to be gotten from larger changes in cpplib. The clang preprocessor doesn't use vectorization and it seems to be still faster?