From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) by sourceware.org (Postfix) with ESMTPS id A6B283858D35 for ; Thu, 17 Aug 2023 16:27:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A6B283858D35 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com Received: by mail-lf1-x131.google.com with SMTP id 2adb3069b0e04-4fe457ec6e7so12780083e87.3 for ; Thu, 17 Aug 2023 09:27:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20221208.gappssmtp.com; s=20221208; t=1692289662; x=1692894462; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=O90dGCwcfsc4RS4ReJTeD5XjVs8bDM45pQOToP+jmZo=; b=YhTKvaX0lIUldiNzB3/FdKGdPBqr2LMYX5UKiOCCW+447l6vBCSePGWCAnmAmvbEws K6OhHUQCNEzBehnLiA4DVXwYgCGU0kOSn7doCk3FEphtpZJdMOL8Hn9oLB/AhckRGJk+ ZEcOYBi3S6eJ1Em6IKaY19/zRN0u/7OTTrH1a9F/gvf/CAm+DQU7tqD97utiTaax4Mpl RbzhYGowGRHxyT+o0t3qYE7bExRCyIYz1c6qOyMNhFfwY/YNseP2bo6k3DtXl2wQZ1Mh Xi6FEpvnSwbU6SbcJSzDi2E4HbAZOxKmmnfVdBItLsiuwZYKovH9aIonQLwnPIjGZPfH T+Ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692289662; x=1692894462; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=O90dGCwcfsc4RS4ReJTeD5XjVs8bDM45pQOToP+jmZo=; b=CxyaG3vgVe0TPb8IvvO6fgZm1RcLt3qjxgP4rWlqBtkdO4srkF8Hpxy5IrYlRObONR F0QJQ+EjvIs9NJN5O8/WoCu0m1P6qjBq95iqpzJ56UQMwOElOG0zBuf8d3yolXPx+bVd Ts0vqR1YwVGDf0b98UqR437k68vFRlNeodllvuEAQEbkcKK4OOOOKbXM2g+1bIgnTmwi AyCypchxdssDuK8OyfweEV5EW1Y5LRhs+SCazA/0Q7Sdu6vxrRUQifUo2FUyJRG/L8jf HAyZ2q4lPb3pMfVHFFE2O3hgqWEGB1j/LCwyJfBfl83FEic9PwxhoRZ8/yb01l6lMljZ jxCQ== X-Gm-Message-State: AOJu0Yxu6VCU/7laH1Psdu1PuuYjd3TsZ2qQtxCZkac4cC3oPZuVcSi8 dBsUJOwSnJkjiWsWVWVLkCS73leAVTJBrfK8YTMchw== X-Google-Smtp-Source: AGHT+IFSTYaFBNyu+JGEKlAsi3OewMqC7hCXqYFmLUW0snQAS55qODth/37cPzydkhH7ulRBv/rWkzyzxpn28GZsowI= X-Received: by 2002:a05:6512:3f9:b0:4ff:80d4:e132 with SMTP id n25-20020a05651203f900b004ff80d4e132mr3719118lfq.29.1692289661925; Thu, 17 Aug 2023 09:27:41 -0700 (PDT) MIME-Version: 1.0 References: <20230802155903.2552780-1-evan@rivosinc.com> <20230802155903.2552780-6-evan@rivosinc.com> <87il9w37vi.fsf@oldenburg.str.redhat.com> <6f0911c6-b24b-444c-4b4b-a62e49a51734@linaro.org> <548fc7d5-6225-69e7-f4a7-47669d2fdbd5@linaro.org> In-Reply-To: From: Evan Green Date: Thu, 17 Aug 2023 09:27:05 -0700 Message-ID: Subject: Re: [PATCH v6 5/5] riscv: Add and use alignment-ignorant memcpy To: enh Cc: Richard Henderson , Florian Weimer , libc-alpha@sourceware.org, slewis@rivosinc.com, palmer@rivosinc.com, vineetg@rivosinc.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, Aug 16, 2023 at 4:18=E2=80=AFPM enh wrote: > > On Tue, Aug 15, 2023 at 4:02=E2=80=AFPM Evan Green wr= ote: > > > > On Tue, Aug 15, 2023 at 2:54=E2=80=AFPM enh wrote: > > > > > > On Tue, Aug 15, 2023 at 9:41=E2=80=AFAM Evan Green wrote: > > > > > > > > On Fri, Aug 11, 2023 at 5:01=E2=80=AFPM enh wrote: > > > > > > > > > > On Mon, Aug 7, 2023 at 5:01=E2=80=AFPM Evan Green wrote: > > > > > > > > > > > > On Mon, Aug 7, 2023 at 3:48=E2=80=AFPM enh wro= te: > > > > > > > > > > > > > > On Mon, Aug 7, 2023 at 3:11=E2=80=AFPM Evan Green wrote: > > > > > > > > > > > > > > > > On Thu, Aug 3, 2023 at 3:30=E2=80=AFPM Richard Henderson > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On 8/3/23 11:42, Evan Green wrote: > > > > > > > > > > On Thu, Aug 3, 2023 at 10:50=E2=80=AFAM Richard Henders= on > > > > > > > > > > wrote: > > > > > > > > > >> Outside libc something is required. > > > > > > > > > >> > > > > > > > > > >> An extra parameter to ifunc is surprising though, and = clearly not ideal per the extra > > > > > > > > > >> hoops above. I would hope for something with hidden v= isibility in libc_nonshared.a that > > > > > > > > > >> could always be called directly. > > > > > > > > > > > > > > > > > > > > My previous spin took that approach, defining a > > > > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could ro= ute to the real > > > > > > > > > > function if available, or make the syscall directly if = not. But that > > > > > > > > > > approach had the drawback that ifunc users couldn't tak= e advantage of > > > > > > > > > > the vDSO, and then all users had to comprehend the diff= erence between > > > > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early(). > > > > > > > > > > > > > > > > > > I would define __riscv_hwprobe such that it could take ad= vantage of the vDSO once > > > > > > > > > initialization reaches a certain point, but cope with bei= ng run earlier than that point by > > > > > > > > > falling back to the syscall. > > > > > > > > > > > > > > > > > > That constrains the implementation, I guess, in that it c= an't set errno, but just > > > > > > > > > returning the negative errno from the syscall seems fine. > > > > > > > > > > > > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_ris= cv_hwprobe) very early, but I > > > > > > > > > would hope that some application of __attribute__((weak))= might correctly get you a NULL > > > > > > > > > prior to full relocations being complete. > > > > > > > > > > > > > > > > Right, this is what we had in the previous iteration of thi= s series, > > > > > > > > and it did work ok. But it wasn't as good since it meant if= unc > > > > > > > > selectors always got stuck in the null/fallback case and we= re forced > > > > > > > > to make the syscall. With this mechanism they get to take a= dvantage of > > > > > > > > the vDSO. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc wri= ters are > > > > > > > > > > already used to getting hwcap info via a parameter. Add= ing this second > > > > > > > > > > parameter, which also provides hwcap-like things, seems= like a natural > > > > > > > > > > extension. I didn't quite follow what you meant by the = "extra hoops > > > > > > > > > > above". > > > > > > > > > > > > > > > > > > The check for null function pointer, for sure. But also = consider how __riscv_hwprobe is > > > > > > > > > going to be used. > > > > > > > > > > > > > > > > > > It might be worth defining some helper functions for prob= ing a single key or a single > > > > > > > > > field. E.g. > > > > > > > > > > > > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned in= t flags) > > > > > > > > > { > > > > > > > > > struct riscv_hwprobe pair =3D { .key =3D key }; > > > > > > > > > int err =3D __riscv_hwprobe(&pair, 1, 0, NULL, flags); > > > > > > > > > if (err) > > > > > > > > > return err; > > > > > > > > > if (pair.key =3D=3D -1) > > > > > > > > > return -ENOENT; > > > > > > > > > return pair.value; > > > > > > > > > } > > > > > > > > > > > > > > > > > > This implementation requires that no future hwprobe key d= efine a value which as a valid > > > > > > > > > value in the errno range (or better, bit 63 unused). Alt= ernately, or additionally: > > > > > > > > > > > > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask,= uint64_t val, int flags) > > > > > > > > > { > > > > > > > > > struct riscv_hwprobe pair =3D { .key =3D key }; > > > > > > > > > return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) =3D= =3D 0 > > > > > > > > > && pair.key !=3D -1 > > > > > > > > > && (pair.value & mask) =3D=3D val); > > > > > > > > > } > > > > > > > > > > > > > > > > > > These yield either > > > > > > > > > > > > > > > > > > int64_t v =3D __riscv_hwprobe_one_key(CPUPERF_0, 0); > > > > > > > > > if (v >=3D 0 && (v & MISALIGNED_MASK) =3D=3D MISALIG= NED_FAST) > > > > > > > > > return __memcpy_noalignment; > > > > > > > > > return __memcpy_generic; > > > > > > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > > > if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_M= ASK, MISALIGNED_FAST, 0)) > > > > > > > > > return __memcpy_noalignment; > > > > > > > > > return __memcpy_generic; > > > > > > > > > > > > > > > > > > which to my mind looks much better for a pattern you'll b= e replicating so very many times > > > > > > > > > across all of the ifunc implementations in the system. > > > > > > > > > > > > > > > > Ah, I see. I could make a static inline function in the hea= der that > > > > > > > > looks something like this (mangled by gmail, sorry): > > > > > > > > > > > > > > > > /* Helper function usable from ifunc selectors that probes = a single key. */ > > > > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwp= robe_func, > > > > > > > > signed long long int key, > > > > > > > > unsigned long long int *value) > > > > > > > > { > > > > > > > > struct riscv_hwprobe pair; > > > > > > > > int rc; > > > > > > > > > > > > > > > > if (!hwprobe_func) > > > > > > > > return -ENOSYS; > > > > > > > > > > > > > > > > pair.key =3D key; > > > > > > > > rc =3D hwprobe_func(&pair, 1, 0, NULL, 0); > > > > > > > > if (rc) { > > > > > > > > return rc; > > > > > > > > } > > > > > > > > > > > > > > > > if (pair.key < 0) { > > > > > > > > return -ENOENT; > > > > > > > > } > > > > > > > > > > > > > > > > *value =3D pair.value; > > > > > > > > return 0; > > > > > > > > } > > > > > > > > > > > > > > > > The ifunc selector would then be significantly cleaned up, = looking > > > > > > > > something like: > > > > > > > > > > > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPU= PERF_0, &value)) > > > > > > > > return __memcpy_generic; > > > > > > > > > > > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) =3D=3D RISCV_HWP= ROBE_MISALIGNED_FAST) > > > > > > > > return __memcpy_noalignment; > > > > > > > > > > > > > > (Android's libc maintainer here, having joined the list just = to talk > > > > > > > about risc-v ifuncs :-) ) > > > > > > > > > > > > > > has anyone thought about calling ifunc resolvers more like th= is... > > > > > > > > > > > > > > --same part of the dynamic loader that caches the two getauxv= al()s for arm64-- > > > > > > > static struct riscv_hwprobe probes[] =3D { > > > > > > > {.value =3D RISCV_HWPROBE_KEY_MVENDORID}, > > > > > > > {.value =3D RISCV_HWPROBE_KEY_MARCHID}, > > > > > > > {.value =3D RISCV_HWPROBE_KEY_MIMPID}, > > > > > > > {.value =3D RISCV_HWPROBE_KEY_BASE_BEHAVIOR}, > > > > > > > {.value =3D RISCV_HWPROBE_KEY_IMA_EXT}, > > > > > > > {.value =3D RISCV_HWPROBE_KEY_CPUPERF_0}, > > > > > > > ... // every time a new key is added to the kernel, we add it= here > > > > > > > }; > > > > > > > __riscv_hwprobe(...); // called once > > > > > > > > > > > > > > --part of the dynamic loader that calls ifunc resolvers-- > > > > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes); > > > > > > > > > > > > > > this is similar to what we already have for arm64 (where ther= e's a > > > > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 a= nd > > > > > > > potentially others), but more uniform, and avoiding the sourc= e > > > > > > > (in)compatibility issues of adding new fields to a struct [ev= en if it > > > > > > > does have a size_t to "version" it like the arm64 ifunc struc= t]. > > > > > > > > > > > > > > yes, it means everyone pays to get all the hwprobes, but that= gets > > > > > > > amortized. and lookup in the ifunc resolver is simple and qui= ck. if we > > > > > > > know that the keys will be kept dense, we can even have code = in ifunc > > > > > > > resolvers like > > > > > > > > > > > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWP= ROBE_IMA_V) ... > > > > > > > > > > > > > > though personally for the "big ticket items" that get a lette= r to > > > > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCA= P), > > > > > > > probe_count, probes_ptr)` to the resolver, but i hear that's > > > > > > > controversial :-) > > > > > > > > > > > > Hello, welcome to the fun! :) > > > > > > > > > > (sorry for the delay. i've been thinking :-) ) > > > > > > > > > > > What you're describing here is almost exactly what we did insid= e the > > > > > > vDSO function. The vDSO function acts as a front for a handful = of > > > > > > probe values that we've already completed and cached in userspa= ce. We > > > > > > opted to make it a function, rather than exposing the data itse= lf via > > > > > > vDSO, so that we had future flexibility in what elements we cac= hed in > > > > > > userspace and their storage format. We can update the kernel as= needed > > > > > > to cache the hottest things in userspace, even if that means > > > > > > rearranging the data format, passing through some extra informa= tion, > > > > > > or adding an extra snip of code. My hope is callers can directl= y > > > > > > interact with the vDSO function (though maybe as Richard sugges= ted > > > > > > maybe with the help of a tidy inline helper), rather than tryin= g to > > > > > > add a second layer of userspace caching. > > > > > > > > > > on reflection i think i might be too focused on the FMV use case,= in > > > > > part because we're looking at those compiler-generated ifuncs for > > > > > arm64 on Android atm. i think i'm imagining a world where there's= a > > > > > lot of that, and worrying about having to pay for the setup, call= , and > > > > > loop for each ifunc, and wondering why we don't just pay once ins= tead. > > > > > (as a bit of background context, Android "app" start is actually = a > > > > > dlopen() in a clone of an existing zygote process, and in general= app > > > > > launch time is one of the key metrics anyone who's serious is > > > > > optimizing for. you'd be surprised how much of my life i spend > > > > > explaining to people that if they want dlopen() to be faster, may= be > > > > > they shouldn't ask us to run thousands of ELF constructors.) > > > > > > > > > > but... the more time i spend looking at what we actually need in > > > > > third-party open source libraries right now i realize that libc a= nd > > > > > FMV (which is still a future thing for us anyway) are really the = only > > > > > _actual_ ifunc users. perhaps in part because macOS/iOS don't hav= e > > > > > ifuncs, all the libraries that are part of the OS itself, for exa= mple, > > > > > are just doing their own thing with function pointers and > > > > > pthread_once() or whatever. > > > > > > > > > > (i have yet to try to get any data on actual apps. i have no reas= on to > > > > > think they'll be very different, but that could easily be skewed = by > > > > > popular middleware or a popular game engine using ifuncs, so i do= plan > > > > > on following up on that.) > > > > > > > > > > "how do they decide what to set that function pointer to?". well,= it > > > > > looks like in most cases cpuid on x86 and calls to getauxval() > > > > > everywhere else. in some cases that's actually via some other lib= rary: > > > > > https://github.com/pytorch/cpuinfo or > > > > > https://github.com/google/cpu_features for example. so they have = a > > > > > layer of caching there, even in cases where they don't have a sin= gle > > > > > function that sets all the function pointers. > > > > > > > > Right, function multi-versioning is just the sort of spot where we'= d > > > > imagine hwprobe gets used, since it's providing similar/equivalent > > > > information to what cpuid does on x86. It may not be quite as fast = as > > > > cpuid (I don't know how fast cpuid actually is). But with the vDSO > > > > function+data in userspace it should be able to match getauxval() i= n > > > > performance, as they're both a function pointer plus a loop. We're > > > > sort of planning for a world in which RISC-V has a wider set of the= se > > > > values to fetch, such that a ifunc selector may need a more complex > > > > set of information. Hwprobe and the vDSO gives us the ability both = to > > > > answer multiple queries fast, and freely allocate more keys that ma= y > > > > represent versioned features or even compound features. > > > > > > yeah, my incorrect mental model was that -- primarily because of > > > x86-64 and cpuid -- every function would get its own ifunc resolver > > > that would have to make a query. but the [in progress] arm64 > > > implementation shows that that's not really the case anyway, and we > > > can just cache __riscv_hwprobe() in the same [one] place that > > > getauxval() is already being cached for arm64. > > > > Sounds good. > > > > > > > > > > so assuming i don't find that apps look very different from the O= S > > > > > (that is: that apps use lots of ifuncs), i probably don't care at= all > > > > > until we get to FMV. and i probably don't care for FMV, because > > > > > compiler-rt (or gcc's equivalent) will be the "caching layer" the= re. > > > > > (and on Android it'll be a while before i have to worry about lib= c's > > > > > ifuncs because we'll require V and not use ifuncs there for the > > > > > foreseeable future.) > > > > > > > > > > so, yeah, given that i've adopted the "pass a null pointer rather= than > > > > > no arguments" convention you have, we have room for expansion if/= when > > > > > FMV is a big thing, and until then -- unless i'm shocked by what = i > > > > > find looking at actual apps -- i don't think i have any reason to > > > > > believe that ifuncs matter that much, and if compiler-rt makes on= e > > > > > __riscv_hwprobe() call per .so, that's probably fine. (i already = spend > > > > > a big chunk of my life advising people to just have one .so file, > > > > > exporting nothing but a JNI_OnLoad symbol, so this will just make= that > > > > > advice even better advice :-) ) > > > > > > > > Just to confirm, by "pass a null pointer", you're saying that the > > > > Android libc also passes NULL as the second ifunc selector argument > > > > (or first)? > > > > > > #elif defined(__riscv) > > > // This argument and its value is just a placeholder for now, > > > // but it means that if we do pass something in future (such as > > > // getauxval() and/or hwprobe key/value pairs), callees will be abl= e to > > > // recognize what they're being given. > > > typedef ElfW(Addr) (*ifunc_resolver_t)(void*); > > > return reinterpret_cast(resolver_addr)(nullptr); > > > > > > it's arm64 that has the initial getauxval() argument: > > > > > > #if defined(__aarch64__) > > > typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*); > > > static __ifunc_arg_t arg; > > > static bool initialized =3D false; > > > if (!initialized) { > > > initialized =3D true; > > > arg._size =3D sizeof(__ifunc_arg_t); > > > arg._hwcap =3D getauxval(AT_HWCAP); > > > arg._hwcap2 =3D getauxval(AT_HWCAP2); > > > } > > > return reinterpret_cast(resolver_addr)(arg._hwcap > > > | _IFUNC_ARG_HWCAP, &arg); > > > > > > https://android.googlesource.com/platform/bionic/+/main/libc/bionic/b= ionic_call_ifunc_resolver.cpp > > > > > > > That's good. It sounds like you're planning to just > > > > continue passing NULL for now, and wait for people to start clamori= ng > > > > for this in android libc? > > > > > > yeah, and i'm assuming there will never be any clamor ... yesterday > > > and today i actually checked a bunch of popular apks, and didn't find > > > any that were currently using ifuncs. > > > > > > the only change i'm thinking of making right now is that "there's a > > > single argument, and it's null" should probably be the default. > > > obviously since Android doesn't add new architectures very often, thi= s > > > is only likely to affect x86/x86-64 for the foreseeable future, but > > > being able to recognize at a glance "am i running under a libc new > > > enough to pass me arguments?" would certainly have helped for arm64. > > > even if x86/x86-64 never benefit, it seems like the right default for > > > the #else clause... > > > > Sounds good, thanks for the pointers. The paranoid person in me would > > also add a comment in the risc-v section that if a pointer to hwprobe > > is added, it should be added as the second argument, behind hwcap as > > the first (assuming this change lands). > > > > Come to think of it, the static inline helper I'm proposing in my > > discussion with Richard needs to take both arguments, since callers > > need to check both ((arg1 !=3D 0) && (arg2 !=3D NULL)) to safely know t= hat > > arg2 is a pointer to __riscv_hwprobe(). > > presumably not `(arg1 !=3D 0)` but `(arg1 & _IFUNC_ARG_HWCAP)` to match a= rm64? It looks like we didn't do that _IFUNC_ARG_HWCAP bit on riscv. Actually, looking at the history of sysdeps/riscv/dl-irel.h, hwcap has always been passed as the first argument. So I think I don't need to check it in the (glibc-specific) inline helper function, I can safely assume it's there and go straight to checking the second argument. If you were coding this directly in a library or application, you would need to check the first arg to be compatible with other libcs like Android's. I think checking against zero should be ok since bits like I, M, A should always be set. (I didn't dig through the kernel history, so maybe not on really old kernels? But you're not going to get hwprobe on those anyway either so the false bailout is correct). -Evan