From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-x532.google.com (mail-pg1-x532.google.com [IPv6:2607:f8b0:4864:20::532]) by sourceware.org (Postfix) with ESMTPS id EB36E3856092 for ; Thu, 6 Jul 2023 19:29:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EB36E3856092 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com Received: by mail-pg1-x532.google.com with SMTP id 41be03b00d2f7-5577900c06bso859533a12.2 for ; Thu, 06 Jul 2023 12:29:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20221208.gappssmtp.com; s=20221208; t=1688671793; x=1691263793; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=1O6tiy+8dpH3s/rtD1kD3ONuiQquwInPvauDxyKJNBI=; b=y2KHP74cjgje0GAHZEjNsvgY9yvQtkCHEzGSgH/hRpM7nTw6WGBKdH+4daqMgCuHcR 3iIxYlaphVlLHzKX+KvW/ocbXmtjkM5Kdo2Z9wmKfi26s+8vhf5ZoatfTZN3FW+/RA7Q W5qJUwVhF/icDdxupT7i2cPH6/HOD82xl9Stp1mCTKtq8yxrDfxUEotqsGP8hyZWjHoG kaQJ0NkZbkk2BUJ85QSlyeffVRP4MXH1UQ+iNIFlEvxIdEgHlmqnbcn60ur1Rm6fsq/z nAXPATzquHOxH5te0mTPRJWUrsMwefucEXxt3Yc5UnU7CSZOflOh1pccrA4i+fuUeg60 ZLFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688671793; x=1691263793; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=1O6tiy+8dpH3s/rtD1kD3ONuiQquwInPvauDxyKJNBI=; b=aA/w2YuIAqLUZ3Yjihj5KA6fB2hRQ8FZMhMPO+e3905p4sdPrRmC6FQXGKvEnALn+b Yw+AyRNVCIZWqBqIoiJygnngSkO7fNYd98t+LZ4YCw3tm+3UvwG41qO4HJOARoUTY02D up1llKOo5Aw+FjFELTtCdyDg1/XOS6sJUqSmFXpaukvP1F6W10WxQi0KxyDuuoWBJoyP BWHgQp9IgykRtQFTrHSswB6Si0Nxg6EHSSh3/8ZmZwk/Ua3TjhSpHg2cNvm5wj76mS4Y KLP6nHMBavKArjqxMOxGQf8VDJTO/mBpukNOCzjhhHtRYvy4VFcgYwNHyH3Crd4jIZdh yqOg== X-Gm-Message-State: ABy/qLbkpg2CEz82MKhQ26tDzs3Chqp5kKiyWrBq6NSGmpkLeFyCsAg/ 2bjnUtEfc9dpLhRGdsu8+XE+Y+ixAiRQeOMlWnQ= X-Google-Smtp-Source: APBJJlFrEnvkD7w7TRbGgnrT7Roz0RAWZR3PRVw9IKZ68/jcig34D7AEOV5ZKDy/xWVLHRnUuoo2QQ== X-Received: by 2002:a05:6a21:6da6:b0:12e:3394:e2bb with SMTP id wl38-20020a056a216da600b0012e3394e2bbmr3272288pzb.43.1688671793543; Thu, 06 Jul 2023 12:29:53 -0700 (PDT) Received: from evan.ba.rivosinc.com ([66.220.2.162]) by smtp.gmail.com with ESMTPSA id c5-20020aa78805000000b0067a50223e3bsm1606489pfo.111.2023.07.06.12.29.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 06 Jul 2023 12:29:53 -0700 (PDT) From: Evan Green To: libc-alpha@sourceware.org Cc: palmer@rivosinc.com, slewis@rivosinc.com, vineetg@rivosinc.com, Florian Weimer , Evan Green Subject: [PATCH v4 0/3] RISC-V: ifunced memcpy using new kernel hwprobe interface Date: Thu, 6 Jul 2023 12:29:43 -0700 Message-Id: <20230706192947.1566767-1-evan@rivosinc.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: This series illustrates the use of a recently accepted Linux syscall that enumerates architectural information about the RISC-V cores the system is running on. In this series we expose a small wrapper function around the syscall. An ifunc selector for memcpy queries it to see if unaligned access is "fast" on this hardware. If it is, it selects a newly provided implementation of memcpy that doesn't work hard at aligning the src and destination buffers. I opted to spin the whole series, though it's perfectly safe to take just the first two patches for the hwprobe interface and abandon the third patch as a separate issue. Performance numbers were compared using a small test program [1], run on a D1 Nezha board, which supports fast unaligned access. "Fast" here means copying unaligned words is faster than copying byte-wise, but still slower than copying aligned words. Here's the speed of various memcpy()s with the generic implementation: memcpy size 1 count 1000000 offset 0 took 109564 us memcpy size 3 count 1000000 offset 0 took 138425 us memcpy size 4 count 1000000 offset 0 took 148374 us memcpy size 7 count 1000000 offset 0 took 178433 us memcpy size 8 count 1000000 offset 0 took 188430 us memcpy size f count 1000000 offset 0 took 266118 us memcpy size f count 1000000 offset 1 took 265940 us memcpy size f count 1000000 offset 3 took 265934 us memcpy size f count 1000000 offset 7 took 266215 us memcpy size f count 1000000 offset 8 took 265954 us memcpy size f count 1000000 offset 9 took 265886 us memcpy size 10 count 1000000 offset 0 took 195308 us memcpy size 11 count 1000000 offset 0 took 205161 us memcpy size 17 count 1000000 offset 0 took 274376 us memcpy size 18 count 1000000 offset 0 took 199188 us memcpy size 19 count 1000000 offset 0 took 209258 us memcpy size 1f count 1000000 offset 0 took 278263 us memcpy size 20 count 1000000 offset 0 took 207364 us memcpy size 21 count 1000000 offset 0 took 217143 us memcpy size 3f count 1000000 offset 0 took 300023 us memcpy size 40 count 1000000 offset 0 took 231063 us memcpy size 41 count 1000000 offset 0 took 241259 us memcpy size 7c count 100000 offset 0 took 32807 us memcpy size 7f count 100000 offset 0 took 36274 us memcpy size ff count 100000 offset 0 took 47818 us memcpy size ff count 100000 offset 0 took 47932 us memcpy size 100 count 100000 offset 0 took 40468 us memcpy size 200 count 100000 offset 0 took 64245 us memcpy size 27f count 100000 offset 0 took 82549 us memcpy size 400 count 100000 offset 0 took 111254 us memcpy size 407 count 100000 offset 0 took 119364 us memcpy size 800 count 100000 offset 0 took 203899 us memcpy size 87f count 100000 offset 0 took 222465 us memcpy size 87f count 100000 offset 3 took 222289 us memcpy size 1000 count 100000 offset 0 took 388846 us memcpy size 1000 count 100000 offset 1 took 468827 us memcpy size 1000 count 100000 offset 3 took 397098 us memcpy size 1000 count 100000 offset 4 took 397379 us memcpy size 1000 count 100000 offset 5 took 397368 us memcpy size 1000 count 100000 offset 7 took 396867 us memcpy size 1000 count 100000 offset 8 took 389227 us memcpy size 1000 count 100000 offset 9 took 395949 us memcpy size 3000 count 50000 offset 0 took 674837 us memcpy size 3000 count 50000 offset 1 took 676944 us memcpy size 3000 count 50000 offset 3 took 679709 us memcpy size 3000 count 50000 offset 4 took 680829 us memcpy size 3000 count 50000 offset 5 took 678024 us memcpy size 3000 count 50000 offset 7 took 681097 us memcpy size 3000 count 50000 offset 8 took 670004 us memcpy size 3000 count 50000 offset 9 took 674553 us Here is that same test run with the assembly memcpy() in this series: memcpy size 1 count 1000000 offset 0 took 92703 us memcpy size 3 count 1000000 offset 0 took 112527 us memcpy size 4 count 1000000 offset 0 took 120481 us memcpy size 7 count 1000000 offset 0 took 149558 us memcpy size 8 count 1000000 offset 0 took 90617 us memcpy size f count 1000000 offset 0 took 174373 us memcpy size f count 1000000 offset 1 took 178615 us memcpy size f count 1000000 offset 3 took 178845 us memcpy size f count 1000000 offset 7 took 178636 us memcpy size f count 1000000 offset 8 took 174442 us memcpy size f count 1000000 offset 9 took 178660 us memcpy size 10 count 1000000 offset 0 took 99845 us memcpy size 11 count 1000000 offset 0 took 112522 us memcpy size 17 count 1000000 offset 0 took 179735 us memcpy size 18 count 1000000 offset 0 took 110870 us memcpy size 19 count 1000000 offset 0 took 121472 us memcpy size 1f count 1000000 offset 0 took 188231 us memcpy size 20 count 1000000 offset 0 took 119571 us memcpy size 21 count 1000000 offset 0 took 132429 us memcpy size 3f count 1000000 offset 0 took 227021 us memcpy size 40 count 1000000 offset 0 took 166416 us memcpy size 41 count 1000000 offset 0 took 180206 us memcpy size 7c count 100000 offset 0 took 28602 us memcpy size 7f count 100000 offset 0 took 31676 us memcpy size ff count 100000 offset 0 took 39257 us memcpy size ff count 100000 offset 0 took 39176 us memcpy size 100 count 100000 offset 0 took 21928 us memcpy size 200 count 100000 offset 0 took 35814 us memcpy size 27f count 100000 offset 0 took 60315 us memcpy size 400 count 100000 offset 0 took 63652 us memcpy size 407 count 100000 offset 0 took 73160 us memcpy size 800 count 100000 offset 0 took 121532 us memcpy size 87f count 100000 offset 0 took 147269 us memcpy size 87f count 100000 offset 3 took 144744 us memcpy size 1000 count 100000 offset 0 took 232057 us memcpy size 1000 count 100000 offset 1 took 254319 us memcpy size 1000 count 100000 offset 3 took 256973 us memcpy size 1000 count 100000 offset 4 took 257655 us memcpy size 1000 count 100000 offset 5 took 259456 us memcpy size 1000 count 100000 offset 7 took 260849 us memcpy size 1000 count 100000 offset 8 took 232347 us memcpy size 1000 count 100000 offset 9 took 254330 us memcpy size 3000 count 50000 offset 0 took 382376 us memcpy size 3000 count 50000 offset 1 took 389872 us memcpy size 3000 count 50000 offset 3 took 385310 us memcpy size 3000 count 50000 offset 4 took 389748 us memcpy size 3000 count 50000 offset 5 took 391707 us memcpy size 3000 count 50000 offset 7 took 386778 us memcpy size 3000 count 50000 offset 8 took 385691 us memcpy size 3000 count 50000 offset 9 took 392030 us The assembly routine is measurably better. [1] https://pastebin.com/DRyECNQW Changes in v4: - Remove __USE_GNU (Florian) - __nonnull, __wur, __THROW, and __fortified_attr_access decorations (Florian) - change long to long int (Florian) - Fix comment formatting (Florian) - Update backup kernel header content copy. - Fix function declaration formatting (Florian) - Changed export versions to 2.38 - Fixed comment style (Florian) Changes in v3: - Update argument types to match v4 kernel interface - Add the "return" to the vsyscall - Fix up vdso arg types to match kernel v4 version - Remove ifdef around INLINE_VSYSCALL (Adhemerval) - Word align dest for large memcpy()s. - Add tags - Remove spurious blank line from sysdeps/riscv/memcpy.c Changes in v2: - hwprobe.h: Use __has_include and duplicate Linux content to make compilation work when Linux headers are absent (Adhemerval) - hwprobe.h: Put declaration under __USE_GNU (Adhemerval) - Use INLINE_SYSCALL_CALL (Adhemerval) - Update versions - Update UNALIGNED_MASK to match kernel v3 series. - Add vDSO interface - Used _MASK instead of _FAST value itself. Evan Green (3): riscv: Add Linux hwprobe syscall support riscv: Add hwprobe vdso call support riscv: Add and use alignment-ignorant memcpy sysdeps/riscv/memcopy.h | 26 ++++ sysdeps/riscv/memcpy.c | 64 +++++++++ sysdeps/riscv/memcpy_noalignment.S | 121 ++++++++++++++++++ sysdeps/unix/sysv/linux/dl-vdso-setup.c | 10 ++ sysdeps/unix/sysv/linux/dl-vdso-setup.h | 3 + sysdeps/unix/sysv/linux/riscv/Makefile | 8 +- sysdeps/unix/sysv/linux/riscv/Versions | 3 + sysdeps/unix/sysv/linux/riscv/hwprobe.c | 31 +++++ .../unix/sysv/linux/riscv/memcpy-generic.c | 24 ++++ .../unix/sysv/linux/riscv/rv32/libc.abilist | 1 + .../unix/sysv/linux/riscv/rv64/libc.abilist | 1 + sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h | 72 +++++++++++ sysdeps/unix/sysv/linux/riscv/sysdep.h | 1 + 13 files changed, 363 insertions(+), 2 deletions(-) create mode 100644 sysdeps/riscv/memcopy.h create mode 100644 sysdeps/riscv/memcpy.c create mode 100644 sysdeps/riscv/memcpy_noalignment.S create mode 100644 sysdeps/unix/sysv/linux/riscv/hwprobe.c create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c create mode 100644 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h -- 2.34.1