From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x334.google.com (mail-ot1-x334.google.com [IPv6:2607:f8b0:4864:20::334]) by sourceware.org (Postfix) with ESMTPS id 2D8A53858C50 for ; Wed, 8 Feb 2023 19:49:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2D8A53858C50 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-ot1-x334.google.com with SMTP id g21-20020a9d6495000000b0068bb336141dso5535166otl.11 for ; Wed, 08 Feb 2023 11:49:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:from:to:cc:subject:date:message-id:reply-to; bh=wxVvffyms6a5Gb/Dq6nZ5n1TQIa1N09kASFP4ghNCoo=; b=WAYGLZXgkQlZ9f107BzAPaMzcWJUh6E/U59J4WJcjKm9RqJscqAtyIlfpsl9cgyAhc hPEHKGBAsnlvVkNCExHbgd8x5lMAVyB8tRGUpt1gQYtBZ5ybeBcrlR7CxN6Oyq4cNDJq PqtvG9lstQ+rWHTcXtth0Fb4lqF974aYSYlP8A351PCemV7oa8Ac2dxQeTT6cu6WSyC0 apFqJjO+MAypWmdGhj/WMRdA7swzWFt4H6RXgrzS8TTmybJ6/JOLMy+yQtcsiHV3dja9 lb9zSq3lwilVyQPMJYDAr8Pm9G6PYivk4DAu0/8GAHDLgbrvDlRyk7KxkHry6rlEUbB3 Lm/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=wxVvffyms6a5Gb/Dq6nZ5n1TQIa1N09kASFP4ghNCoo=; b=17GMz3Efh+Gyd6Tt6kntzIDXziY/1gs0IWltktjmCkz0/xejwM3TarIcjjAsSCqpal EEjKh79wSy0RcxvJQ64WgZ1rCJeYszkxMCC1VnFIxrKC1JPphBLmvHFajjj7PKkCNUMI 8go1BQC7rfRBXF5zbw77XmpLUcUkig+boqpY6jg+kjicJ63H/ZgNeMiELPWV9wskifY+ oKj+I6AWxBfqn9OKm7zaO06cAcFQrVsKfcPbDa6J/54ygnTu7vXOsDkkNSoXRaEur6PN TgEKsUA/A2VGSpzDmOtr5hhMcIFR/817KDHBaZfJSn5DBGbxP56TPPeWp2xzMwHGcwu+ 8YOw== X-Gm-Message-State: AO0yUKXPT1TPUs6FqIEGQ36XdKRb8/TPSfzR2reQfBnHeul3ePgRT3Pe fT2IXOTrEJA6MJag6Xx+NVLe4w== X-Google-Smtp-Source: AK7set9xSvBYT8Did+qw0vvY45ojeqLOOHK4I67zmmyBoyngTBjOMSFeyDx5P0e1YeRFQgVi69dW9A== X-Received: by 2002:a05:6830:690c:b0:68b:c8e9:5b06 with SMTP id cx12-20020a056830690c00b0068bc8e95b06mr3787070otb.28.1675885741341; Wed, 08 Feb 2023 11:49:01 -0800 (PST) Received: from ?IPV6:2804:1b3:a7c2:8ced:8458:e6b7:cf66:aa19? ([2804:1b3:a7c2:8ced:8458:e6b7:cf66:aa19]) by smtp.gmail.com with ESMTPSA id h2-20020a9d5542000000b0068d752f1870sm6760713oti.5.2023.02.08.11.48.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 08 Feb 2023 11:49:00 -0800 (PST) Message-ID: Date: Wed, 8 Feb 2023 16:48:56 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [RFC PATCH 18/19] riscv: Add an optimized strncmp routine Content-Language: en-US To: Palmer Dabbelt , philipp.tomsich@vrull.eu Cc: goldstein.w.n@gmail.com, christoph.muellner@vrull.eu, libc-alpha@sourceware.org, Darius Rad , Andrew Waterman , DJ Delorie , Vineet Gupta , kito.cheng@sifive.com, jeffreyalaw@gmail.com, heiko.stuebner@vrull.eu References: From: Adhemerval Zanella Netto Organization: Linaro In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_STORAGE_GOOGLE,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 08/02/23 14:55, Palmer Dabbelt wrote: > On Wed, 08 Feb 2023 07:13:44 PST (-0800), philipp.tomsich@vrull.eu wrote: >> On Tue, 7 Feb 2023 at 02:20, Noah Goldstein wrote: >>> >>> On Mon, Feb 6, 2023 at 6:23 PM Christoph Muellner >>> wrote: >>> > >>> > From: Christoph Müllner >>> > >>> > The implementation of strncmp() can be accelerated using Zbb's orc.b >>> > instruction. Let's add an optimized implementation that makes use >>> > of this instruction. >>> > >>> > Signed-off-by: Christoph Müllner >>> >>> Not necessary, but imo performance patches should have at least some reference >>> to the expected speedup versus the existing alternatives. >> >> Given that this is effectively a SWAR-like optimization (orc.b allows >> us to test 8 bytes in parallel for a NUL byte), we should be able to >> show the benefit through a reduction in dynamic instructions.  Would >> this be considered reasonable reference data? > > Generally for performance improvements the only metrics that count come from real hardware.  Processor implementation is complex and it's not generally true that reducing dynamic instructions results in better performance (particularly when more complex flavors instructions replace simpler ones). > I agree with Noah here that we need to have some baseline performance number, even tough we are comparing naive implementations (what glibc used to have for implementations). > We've not been so good about this on the RISC-V side of things, though.  I think that's largely because we didn't have all that much complexity around this, but there's a ton of stuff showing up right now.  The general theory has been that Zbb instructions will execute faster than their corresponding I sequences, but nobody has proved that.  I believe the new JH7110 has Zba and Zbb, so maybe the right answer there is to just benchmark things before merging them?  That way we can get back to doing things sanely before we go too far down the premature optimization rabbit hole. > > FWIW: we had a pretty similar discussion in Linux land around these and nobody could get the JH7110 to boot, but given that we have ~6 months until glibc releases again hopefully that will be sorted out.  There's a bunch of ongoing work looking at the more core issues like probing, so maybe it's best to focus on getting that all sorted out first?  It's kind of awkward to have a bunch of routines posted in a whole new framework that's not sorting out all the probing dependencies. Just a heads up that with latest generic string routines optimization, all str* routines should now use new zbb extensions (if compiler is instructed to do so). I think you might squeeze some cycles with hand crafted assembly routine, but I would rather focus on trying to optimize code generation instead. The generic routines still assumes that hardware can't or is prohibitive expensive to issue unaligned memory access. However, I think we move toward this direction to start adding unaligned variants when it makes sense. Another usual tuning is loop unrolling, which depends on underlying hardware. Unfortunately we need to explicit force gcc to unroll some loop construction (for instance check sysdeps/powerpc/powerpc64/power4/Makefile), so this might be another approach you might use to tune RISCV routines. The memcpy, memmove, memset, memcmp are a slight different subject. Although current generic mem routines does use some explicit unrolling, it also does not take in consideration unaligned access, vector instructions, or special instruction (such as cache clear one). And these usually make a lot of difference. What I would expect it maybe we can use a similar strategy Google is doing with llvm libc, which based its work on the automemcpy paper [1]. It means that for unaligned, each architecture will reimplement the memory routine block. Although the project focus on static compiling, I think using hooks over assembly routines might be a better approach (you might reuse code blocks or try different strategies more easily). [1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf