From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-xb36.google.com (mail-yb1-xb36.google.com [IPv6:2607:f8b0:4864:20::b36]) by sourceware.org (Postfix) with ESMTPS id DA40A3861802; Mon, 16 Aug 2021 10:02:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DA40A3861802 Received: by mail-yb1-xb36.google.com with SMTP id w17so31643022ybl.11; Mon, 16 Aug 2021 03:02:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=aIQPlwcGZxXRIqDEColfyoRpAqQFHA9hL75dO7tFY0o=; b=Ch4x4Vi4kKaE09YvT+j27bwMLbz+GuM2s7pDcJksPeDNNxGtD7oj5sNedvyIl6zJkh GObzMo7a8+SF2C8Oe3SS1hj1zeoqM3bqxzLDtRTTa4Y4reU53HfFbn8j+txmcFoCkXMh 0NnfI1w3oKCfdOt/QH1X1LjD3u8b9q8Sz5FgTYEpatPjJs/QpfM2tLzLs9TpSHRvqF6p rCj5oErKfxS6liR0OD5nZpfd56NsLQlNv84M4cFHPFwWGx+l66WkkvU+JOzS69S/XXn4 GmQ6e8S7WnSaJgPhKECnB7Kpz+D0KYJeV5UWNU2IoOggeyUR3ZJX5adpTVFaZp+1fGAb vZlQ== X-Gm-Message-State: AOAM531ufwJoQpmYtDgWGoJasaJ7Dop0ZrCcaRscVhj4Bi4iCvV5gFOI /Z6IHFSCVMBSKQMg/XAxz4tiWVFEPZc2PLMTcwT+SMManKDQgg== X-Google-Smtp-Source: ABdhPJwzYhy9X7jP3Ziyv/32kocnr04VMOhR3SZzVIUgMedveSQNYsef+19j4a5YOMTqSP+WLsp8LULfo6zHU3SJC/c= X-Received: by 2002:a25:be09:: with SMTP id h9mr20669248ybk.239.1629108173030; Mon, 16 Aug 2021 03:02:53 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kito Cheng Date: Mon, 16 Aug 2021 18:02:42 +0800 Message-ID: Subject: Re: [PATCH] RISC-V: Enable overlap-by-pieces in case of fast unaliged access To: =?UTF-8?Q?Christoph_M=C3=BCllner?= Cc: Palmer Dabbelt , GCC Patches , Andrew Waterman , Jojo R Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, LIKELY_SPAM_BODY, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Aug 2021 10:03:04 -0000 HI Christoph: Could you submit v3 patch which is v1 with overlap_op_by_pieces field, testcase from v2 and add a few more comments to describe the field? And add an -mtune=3Dultra-size to make it able to test without change other behavior? Hi Palmer: Are you OK with that? On Sat, Aug 14, 2021 at 1:54 AM Christoph M=C3=BCllner via Gcc-patches wrote: > > Ping. > > On Thu, Aug 5, 2021 at 11:11 AM Christoph M=C3=BCllner wrote: > > > > Ping. > > > > On Thu, Jul 29, 2021 at 9:36 PM Christoph M=C3=BCllner wrote: > > > > > > On Thu, Jul 29, 2021 at 8:54 PM Palmer Dabbelt w= rote: > > > > > > > > On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wro= te: > > > > > Ok, so if I understand correctly Palmer and Andrew prefer > > > > > overlap_op_by_pieces to be controlled > > > > > by its own field in the riscv_tune_param struct and not by the fi= eld > > > > > slow_unaligned_access in this struct > > > > > (i.e. slow_unaligned_access=3D=3Dfalse is not enough to imply > > > > > overlap_op_by_pieces=3D=3Dtrue). > > > > > > > > I guess, but I'm not really worried about this at that level of det= ail > > > > right now. It's not like the tune structures form any sort of exte= rnal > > > > interface we have to keep stable, we can do whatever we want with t= hose > > > > fields so I'd just aim for encoding the desired behavior as simply = as > > > > possible rather than trying to build something extensible. > > > > > > > > There are really two questions we need to answer: is this code actu= ally > > > > faster for the C906, and is this what the average users wants under= -Os. > > > > > > I never mentioned -Os. > > > My main goal is code compiled for -O2, -O3 or even -Ofast. > > > And I want to execute code as fast as possible. > > > > > > Loading hot data from cache is faster when being done by a single > > > load-word instruction than 4 load-byte instructions. > > > Less instructions implies less pressure for the instruction cache. > > > Less instructions implies less work for a CPU pipeline. > > > Architectures, which don't have a penalty for unaligned accesses > > > therefore observe a performance benefit. > > > > > > What I understand from Andrew's email is that it is not that simple > > > and implementation might have a penalty for overlapping accesses > > > that is high enough to avoid them. I don't have the details for C906, > > > so I can't say if that's the case. > > > > > > > That first one is pretty easy: just running those simple code seque= nces > > > > under a sweep of page offsets should be sufficient to determine if = this > > > > is always faster (in which case it's an easy yes), if it's always s= lower > > > > (an easy no), or if there's some slow cases like page/cache line > > > > crossing (in which case we'd need to think a bit). > > > > > > > > The second one is a bit tricker. In the past we'd said these sort = of > > > > "actively misalign accesses to generate smaller code" sort of thing > > > > isn't suitable for -Os (as most machines still have very slow unali= gned > > > > accesses) but is suitable for -Oz (don't remember if that ever ende= d up > > > > in GCC, though). That still seems like a reasonable decision, but = if it > > > > turns out that implementations with fast unaligned accesses become = the > > > > norm then it'd probably be worth revisiting it. Not sure exactly h= ow to > > > > determine that tipping point, but I think we're a long way away fro= m it > > > > right now. > > > > > > > > IMO it's really just premature to try and design an encoding of the > > > > tuning paramaters until we have an idea of what they are, as we'll = just > > > > end up devolving down the path of trying to encode all possible har= dware > > > > and that's generally a huge waste of time. Since there's no ABI he= re we > > > > can refactor this however we want as new tunings show up. > > > > > > I guess you mean that there needs to be a clear benefit for a support= ed > > > machine in GCC. Either obviously (see below), by measurement results, > > > or by decision > > > of the machine's maintainer (especially if the decision is a trade-of= f). > > > > > > > > > > > > I don't have access to pipeline details that give proof that ther= e are cases > > > > > where this patch causes a performance penalty. > > > > > > > > > > So, I leave this here as a summary for someone who has enough inf= ormation and > > > > > interest to move this forward: > > > > > * the original patch should be sufficient, but does not have test= s: > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html > > > > > * the tests can be taken from this patch: > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html > > > > > Note, that there is a duplicated "sw" in builtins-overlap-6.c, = which > > > > > should be a "sd". > > > > > > > > > > Thanks for the feedback! > > > > > > > > Cool. Looks like the C906 is starting to show up in the real world= , so > > > > we should be able to find someone who has access to one and cares e= nough > > > > to at least run some simple benchamrks of these code sequences. IM= O > > > > that's a pretty low interest bar, so I don't see any harm in waitin= g -- > > > > when the hardware is common then I'm sure someone will care enough = to > > > > give this a shot, and until then it's not really impacting anyone e= ither > > > > way. > > > > > > > > The -Os thing is a bigger discussion, and while I'm happy to have i= t I > > > > don't really think we're even close to these being common enough ye= t. I > > > > saw your memmove patch and think the same rationale might apply the= re, > > > > but I haven't looked closely and won't have time to for a bit as I'= ve > > > > got to get around to the other projects. > > > > > > The cpymemsi patch is also targeting -O2 or higher for fast code exec= ution. > > > And it is one of the cases where there is an obvious performance bene= fit > > > for all machines that have slow_unaligned_access=3D=3Dfalse. > > > > > > At the moment the cpymemsi expansion for RISC-V is implemented as if > > > there is no machine with slow_unaligned_access=3D=3Dfalse. > > > And in fact there is a machine already in GCC mainline with this prop= erty: C906. > > > > > > Machines that can do fast unaligned accesses should not be wasting th= eir > > > cycles with load-store-pairs of bytes, if they can do load-store pair= s of words. > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt wrote: > > > > >> > > > > >> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote: > > > > >> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt wrote: > > > > >> >> > > > > >> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu= .org wrote: > > > > >> >> > Could you add a testcase? Otherwise LGTM. > > > > >> >> > > > > > >> >> > Option: -O2 -mtune=3Dthead-c906 -march=3Drv64gc -mabi=3Dlp6= 4 > > > > >> >> > void foo(char *dst){ > > > > >> >> > __builtin_memset(dst, 0, 15); > > > > >> >> > } > > > > >> >> > > > > >> >> I'd like to see: > > > > >> >> > > > > >> >> * Test results. This is only on for one target right now, so= relying on > > > > >> >> it to just work on others isn't a good idea. > > > > >> >> * Something to demonstrate this doesn't break -mstrict-align. > > > > >> >> * Some sort of performance analysis. Most machines that supp= ort > > > > >> >> unaligned access do so with some performance degredation, > > > > >> > > > > > >> > Also, some machines that gracefully support misaligned accesse= s under > > > > >> > most circumstances nevertheless experience a perf degradation = when the > > > > >> > load depends on two stores that overlap partially but not full= y. This > > > > >> > transformation will obviously trigger such behavior from time = to time. > > > > >> > > > > >> Ya, I thought I wrote a response to this but I guess it's just i= n a > > > > >> buffer somewhere. The code sequences this is generating are rea= lly the > > > > >> worst case for unaligned stores: one of them is always guarantee= d to be > > > > >> misaligned, and it partially overlaps with a store one cycle awa= y. > > > > >> > > > > >> We're really only saving a handful of instructions at best here,= so > > > > >> there's not much room for error when it comes to these sorts of = things. > > > > >> Even if this difficult case is handled fast we're only talking a= bout > > > > >> saving 2 cycles, which is pretty borderline as the single-issue = in-order > > > > >> machines I've worked with that do support misaligned accesses te= nd to > > > > >> take at least a few cycles of performance hit on misaligned acce= sses. > > > > >> Even if misaligned accesses are single cycle, some back of the e= nvelope > > > > >> calculations says that a pipeline flush when crossing a cache li= ne would > > > > >> definitely push this negative and one per page crossing would be= in the > > > > >> margins (depending on how we assume the original accesses are al= igned). > > > > >> > > > > >> This is way too subtle of a thing to analyze without at least so= me > > > > >> knowledge of how this pipeline works, whether that's from a benc= hmark or > > > > >> just a pipeline description. > > > > >> > > > > >> > Note, I'm not objecting to this patch; I'm only responding to = Palmer's > > > > >> > comment. It makes sense to enable this kind of optimization f= or > > > > >> > -mtune=3Dnative etc., just not for standard software distribut= ions. > > > > >> > > > > >> IMO there are really two cases here: -mtune=3Dc906 and -Os (-mtu= ne=3Dnative > > > > >> is sort of a red herring, it's just syntax). For -mtune=3Dc906 = I'm happy > > > > >> enabling this as long as it's actually correct and improves perf= ormance, > > > > >> but that'll need at least some hardware-oriented analysis -- whe= ther > > > > >> it's a benchmark or just a confirmation that these sequences are > > > > >> actually expected to run fast. > > > > >> > > > > >> -Os is a different case, though. Last time this came up we deci= ded that > > > > >> -Os shouldn't purposefully misalign accesses, even when that sav= es code > > > > >> size, as that's too likely to hit pathological performance cases= . I > > > > >> don't know if the weights have changed here: the C906 is current= ly by > > > > >> far the cheapest chip (which likely means it's going to be the m= ost > > > > >> popular), but it's unclear as to whether it's even RISC-V compli= ant and > > > > >> I don't know of many people who've actually gotten one. IMO it'= s too > > > > >> early to flip -Os over to this behavior, we at least need to kno= w that > > > > >> this chip is going to be sufficiently RISC-V-ey that standard so= ftware > > > > >> will even run on it much less be popular. > > > > >> > > > > >> > > > > > >> > > > > > >> >> it's unclear > > > > >> >> that this conversion will actually manifst a performance im= provement. > > > > >> >> I don't have a C906 and don't know what workloads people ca= re about > > > > >> >> running on one, but I'm certainly not convinced this is a w= in -- > > > > >> >> what's listed here looks to be the best case, and that's on= ly saving > > > > >> >> two instructions to generate a pretty pathological sequence > > > > >> >> (misaligned access that conflicts with a prior store). > > > > >> > > > > >> Ah, I guess there it was ;) > > > > >> > > > > >> >> > > > > >> >> Jojo: do you have any description of the C906 pipeline? Spec= ifically in > > > > >> >> this case it'd be good to know how it handles unaligned acces= ses. > > > > >> >> > > > > >> >> > > > > > >> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-= patches > > > > >> >> > wrote: > > > > >> >> >> > > > > >> >> >> This patch enables the overlap-by-pieces feature of the by= -pieces > > > > >> >> >> infrastructure for inlining builtins in case the target ha= s set > > > > >> >> >> riscv_slow_unaligned_access_p to false. > > > > >> >> >> > > > > >> >> >> To demonstrate the effect for targets with fast unaligned = access, > > > > >> >> >> the following code sequences are generated for a 15-byte m= emset-zero. > > > > >> >> >> > > > > >> >> >> Without overlap_op_by_pieces we get: > > > > >> >> >> 8e: 00053023 sd zero,0(a0) > > > > >> >> >> 92: 00052423 sw zero,8(a0) > > > > >> >> >> 96: 00051623 sh zero,12(a0) > > > > >> >> >> 9a: 00050723 sb zero,14(a0) > > > > >> >> >> > > > > >> >> >> With overlap_op_by_pieces we get: > > > > >> >> >> 7e: 00053023 sd zero,0(a0) > > > > >> >> >> 82: 000533a3 sd zero,7(a0) > > > > >> >> >> > > > > >> >> >> gcc/ChangeLog: > > > > >> >> >> > > > > >> >> >> * config/riscv/riscv.c (riscv_overlap_op_by_pieces= ): New function. > > > > >> >> >> (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to > > > > >> >> >> riscv_overlap_op_by_pieces. > > > > >> >> >> > > > > >> >> >> Signed-off-by: Christoph Muellner > > > > >> >> >> --- > > > > >> >> >> gcc/config/riscv/riscv.c | 11 +++++++++++ > > > > >> >> >> 1 file changed, 11 insertions(+) > > > > >> >> >> > > > > >> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/r= iscv.c > > > > >> >> >> index 576960bb37c..98c76ba657a 100644 > > > > >> >> >> --- a/gcc/config/riscv/riscv.c > > > > >> >> >> +++ b/gcc/config/riscv/riscv.c > > > > >> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machin= e_mode, unsigned int) > > > > >> >> >> return riscv_slow_unaligned_access_p; > > > > >> >> >> } > > > > >> >> >> > > > > >> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P. */ > > > > >> >> >> + > > > > >> >> >> +static bool > > > > >> >> >> +riscv_overlap_op_by_pieces (void) > > > > >> >> >> +{ > > > > >> >> >> + return !riscv_slow_unaligned_access_p; > > > > >> >> >> +} > > > > >> >> >> + > > > > >> >> >> /* Implement TARGET_CAN_CHANGE_MODE_CLASS. */ > > > > >> >> >> > > > > >> >> >> static bool > > > > >> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void) > > > > >> >> >> #undef TARGET_SLOW_UNALIGNED_ACCESS > > > > >> >> >> #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned= _access > > > > >> >> >> > > > > >> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P > > > > >> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by= _pieces > > > > >> >> >> + > > > > >> >> >> #undef TARGET_SECONDARY_MEMORY_NEEDED > > > > >> >> >> #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_me= mory_needed > > > > >> >> >> > > > > >> >> >> -- > > > > >> >> >> 2.31.1 > > > > >> >> >>