From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qv1-xf33.google.com (mail-qv1-xf33.google.com [IPv6:2607:f8b0:4864:20::f33]) by sourceware.org (Postfix) with ESMTPS id 3DBD63858D1E for ; Sat, 12 Aug 2023 00:01:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3DBD63858D1E Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=google.com Received: by mail-qv1-xf33.google.com with SMTP id 6a1803df08f44-6418cc115cdso8164676d6.0 for ; Fri, 11 Aug 2023 17:01:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691798505; x=1692403305; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=rjitX5z36ai5y1F4/6Wvi0e1ziitkTYVA5Rhi5ODPKA=; b=jnPJevBLMLvXNCaVhZ+oAcg1wT1h490jSSilj1imrbWQaWZb92tkonOtF4MQJJNnWx eBs+AmHJ+60NW6h0YEebh70Uxh3HwZv/mfIooNLJ6MCsndSkNQhGQRtSjGS83oOBEJ+H rvZV2eXkifu+i3on8ngfkbIdK7EtCIh0gk1WJi1r+iI2HbkH/EkFWpGvMDp+O4p3hPlq ZT0prgVzAgiOJm1oQPu9LG45WXt7WgZHkFqRoyg3qPq3tRkkuPn6z0JH7pz0DKhHgDK6 Si9zBKazCPr6Q3G9WEaB0exBw7mYjiR0S30DQtpEAvBRfU4IvxnDiNNaIEgoavAyxxqI PYTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691798505; x=1692403305; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rjitX5z36ai5y1F4/6Wvi0e1ziitkTYVA5Rhi5ODPKA=; b=lLXZKuP1m2Ew56sF5WfDbjuEcSE0Gjp/m5sgcQaqEo4SsDktjIw4ZuNclcbKcUdNc7 JqaxDGbKIeTycpyM0fLnwX+bcvqomyPVQreh+aSlEKYaraY3VhH9ev0W8+qtrwRowi0H 5xitgYWlqPrBfNzJFCs6z/ZiJRWN3AZf7a4kFoLu7KyP14s9JNx1JrwWKz9HYl2swKab BDhCcw9NzHTtCYF6+TXt2SpTremYjKtmrauAXifnT0usAXWm4OuXS9ferqLpb4J05ZTT 6rEzg0ULthlS+Uxppg5iwzuoyNtkc/ZIVHib7xgaG42zgagbv2sbCQ5oXEGiWf8PX02P tfcw== X-Gm-Message-State: AOJu0YzbruOFjGAB8NncB81QXvUGFL20r0TpxZRp+QwrzIQUuG/9fO9r caxjObreR9Vt35o1ED2a8RA6tRk3QM0Mh2OcndY1GA== X-Google-Smtp-Source: AGHT+IHrKIwD+b4gpf+VjDw+FrT3ViyUeNNWKVc+x62W/qSpLqbKCRkCvxOOud10KU7+/3BQcAyiqB9lMjzG5jwDOl8= X-Received: by 2002:ad4:5dca:0:b0:63f:50c8:3fa2 with SMTP id m10-20020ad45dca000000b0063f50c83fa2mr3489026qvh.53.1691798505451; Fri, 11 Aug 2023 17:01:45 -0700 (PDT) MIME-Version: 1.0 References: <20230802155903.2552780-1-evan@rivosinc.com> <20230802155903.2552780-6-evan@rivosinc.com> <87il9w37vi.fsf@oldenburg.str.redhat.com> <6f0911c6-b24b-444c-4b4b-a62e49a51734@linaro.org> <548fc7d5-6225-69e7-f4a7-47669d2fdbd5@linaro.org> In-Reply-To: From: enh Date: Fri, 11 Aug 2023 17:01:33 -0700 Message-ID: Subject: Re: [PATCH v6 5/5] riscv: Add and use alignment-ignorant memcpy To: Evan Green Cc: Richard Henderson , Florian Weimer , libc-alpha@sourceware.org, slewis@rivosinc.com, palmer@rivosinc.com, vineetg@rivosinc.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-18.4 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, Aug 7, 2023 at 5:01=E2=80=AFPM Evan Green wrote= : > > On Mon, Aug 7, 2023 at 3:48=E2=80=AFPM enh wrote: > > > > On Mon, Aug 7, 2023 at 3:11=E2=80=AFPM Evan Green w= rote: > > > > > > On Thu, Aug 3, 2023 at 3:30=E2=80=AFPM Richard Henderson > > > wrote: > > > > > > > > On 8/3/23 11:42, Evan Green wrote: > > > > > On Thu, Aug 3, 2023 at 10:50=E2=80=AFAM Richard Henderson > > > > > wrote: > > > > >> Outside libc something is required. > > > > >> > > > > >> An extra parameter to ifunc is surprising though, and clearly no= t ideal per the extra > > > > >> hoops above. I would hope for something with hidden visibility = in libc_nonshared.a that > > > > >> could always be called directly. > > > > > > > > > > My previous spin took that approach, defining a > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the= real > > > > > function if available, or make the syscall directly if not. But t= hat > > > > > approach had the drawback that ifunc users couldn't take advantag= e of > > > > > the vDSO, and then all users had to comprehend the difference bet= ween > > > > > __riscv_hwprobe() and __riscv_hwprobe_early(). > > > > > > > > I would define __riscv_hwprobe such that it could take advantage of= the vDSO once > > > > initialization reaches a certain point, but cope with being run ear= lier than that point by > > > > falling back to the syscall. > > > > > > > > That constrains the implementation, I guess, in that it can't set e= rrno, but just > > > > returning the negative errno from the syscall seems fine. > > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe= ) very early, but I > > > > would hope that some application of __attribute__((weak)) might cor= rectly get you a NULL > > > > prior to full relocations being complete. > > > > > > Right, this is what we had in the previous iteration of this series, > > > and it did work ok. But it wasn't as good since it meant ifunc > > > selectors always got stuck in the null/fallback case and were forced > > > to make the syscall. With this mechanism they get to take advantage o= f > > > the vDSO. > > > > > > > > > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are > > > > > already used to getting hwcap info via a parameter. Adding this s= econd > > > > > parameter, which also provides hwcap-like things, seems like a na= tural > > > > > extension. I didn't quite follow what you meant by the "extra hoo= ps > > > > > above". > > > > > > > > The check for null function pointer, for sure. But also consider h= ow __riscv_hwprobe is > > > > going to be used. > > > > > > > > It might be worth defining some helper functions for probing a sing= le key or a single > > > > field. E.g. > > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags) > > > > { > > > > struct riscv_hwprobe pair =3D { .key =3D key }; > > > > int err =3D __riscv_hwprobe(&pair, 1, 0, NULL, flags); > > > > if (err) > > > > return err; > > > > if (pair.key =3D=3D -1) > > > > return -ENOENT; > > > > return pair.value; > > > > } > > > > > > > > This implementation requires that no future hwprobe key define a va= lue which as a valid > > > > value in the errno range (or better, bit 63 unused). Alternately, = or additionally: > > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t = val, int flags) > > > > { > > > > struct riscv_hwprobe pair =3D { .key =3D key }; > > > > return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) =3D=3D 0 > > > > && pair.key !=3D -1 > > > > && (pair.value & mask) =3D=3D val); > > > > } > > > > > > > > These yield either > > > > > > > > int64_t v =3D __riscv_hwprobe_one_key(CPUPERF_0, 0); > > > > if (v >=3D 0 && (v & MISALIGNED_MASK) =3D=3D MISALIGNED_FAST) > > > > return __memcpy_noalignment; > > > > return __memcpy_generic; > > > > > > > > or > > > > > > > > if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISAL= IGNED_FAST, 0)) > > > > return __memcpy_noalignment; > > > > return __memcpy_generic; > > > > > > > > which to my mind looks much better for a pattern you'll be replicat= ing so very many times > > > > across all of the ifunc implementations in the system. > > > > > > Ah, I see. I could make a static inline function in the header that > > > looks something like this (mangled by gmail, sorry): > > > > > > /* Helper function usable from ifunc selectors that probes a single k= ey. */ > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func, > > > signed long long int key, > > > unsigned long long int *value) > > > { > > > struct riscv_hwprobe pair; > > > int rc; > > > > > > if (!hwprobe_func) > > > return -ENOSYS; > > > > > > pair.key =3D key; > > > rc =3D hwprobe_func(&pair, 1, 0, NULL, 0); > > > if (rc) { > > > return rc; > > > } > > > > > > if (pair.key < 0) { > > > return -ENOENT; > > > } > > > > > > *value =3D pair.value; > > > return 0; > > > } > > > > > > The ifunc selector would then be significantly cleaned up, looking > > > something like: > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &v= alue)) > > > return __memcpy_generic; > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) =3D=3D RISCV_HWPROBE_MISAL= IGNED_FAST) > > > return __memcpy_noalignment; > > > > (Android's libc maintainer here, having joined the list just to talk > > about risc-v ifuncs :-) ) > > > > has anyone thought about calling ifunc resolvers more like this... > > > > --same part of the dynamic loader that caches the two getauxval()s for = arm64-- > > static struct riscv_hwprobe probes[] =3D { > > {.value =3D RISCV_HWPROBE_KEY_MVENDORID}, > > {.value =3D RISCV_HWPROBE_KEY_MARCHID}, > > {.value =3D RISCV_HWPROBE_KEY_MIMPID}, > > {.value =3D RISCV_HWPROBE_KEY_BASE_BEHAVIOR}, > > {.value =3D RISCV_HWPROBE_KEY_IMA_EXT}, > > {.value =3D RISCV_HWPROBE_KEY_CPUPERF_0}, > > ... // every time a new key is added to the kernel, we add it here > > }; > > __riscv_hwprobe(...); // called once > > > > --part of the dynamic loader that calls ifunc resolvers-- > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes); > > > > this is similar to what we already have for arm64 (where there's a > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and > > potentially others), but more uniform, and avoiding the source > > (in)compatibility issues of adding new fields to a struct [even if it > > does have a size_t to "version" it like the arm64 ifunc struct]. > > > > yes, it means everyone pays to get all the hwprobes, but that gets > > amortized. and lookup in the ifunc resolver is simple and quick. if we > > know that the keys will be kept dense, we can even have code in ifunc > > resolvers like > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V= ) ... > > > > though personally for the "big ticket items" that get a letter to > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP), > > probe_count, probes_ptr)` to the resolver, but i hear that's > > controversial :-) > > Hello, welcome to the fun! :) (sorry for the delay. i've been thinking :-) ) > What you're describing here is almost exactly what we did inside the > vDSO function. The vDSO function acts as a front for a handful of > probe values that we've already completed and cached in userspace. We > opted to make it a function, rather than exposing the data itself via > vDSO, so that we had future flexibility in what elements we cached in > userspace and their storage format. We can update the kernel as needed > to cache the hottest things in userspace, even if that means > rearranging the data format, passing through some extra information, > or adding an extra snip of code. My hope is callers can directly > interact with the vDSO function (though maybe as Richard suggested > maybe with the help of a tidy inline helper), rather than trying to > add a second layer of userspace caching. on reflection i think i might be too focused on the FMV use case, in part because we're looking at those compiler-generated ifuncs for arm64 on Android atm. i think i'm imagining a world where there's a lot of that, and worrying about having to pay for the setup, call, and loop for each ifunc, and wondering why we don't just pay once instead. (as a bit of background context, Android "app" start is actually a dlopen() in a clone of an existing zygote process, and in general app launch time is one of the key metrics anyone who's serious is optimizing for. you'd be surprised how much of my life i spend explaining to people that if they want dlopen() to be faster, maybe they shouldn't ask us to run thousands of ELF constructors.) but... the more time i spend looking at what we actually need in third-party open source libraries right now i realize that libc and FMV (which is still a future thing for us anyway) are really the only _actual_ ifunc users. perhaps in part because macOS/iOS don't have ifuncs, all the libraries that are part of the OS itself, for example, are just doing their own thing with function pointers and pthread_once() or whatever. (i have yet to try to get any data on actual apps. i have no reason to think they'll be very different, but that could easily be skewed by popular middleware or a popular game engine using ifuncs, so i do plan on following up on that.) "how do they decide what to set that function pointer to?". well, it looks like in most cases cpuid on x86 and calls to getauxval() everywhere else. in some cases that's actually via some other library: https://github.com/pytorch/cpuinfo or https://github.com/google/cpu_features for example. so they have a layer of caching there, even in cases where they don't have a single function that sets all the function pointers. so assuming i don't find that apps look very different from the OS (that is: that apps use lots of ifuncs), i probably don't care at all until we get to FMV. and i probably don't care for FMV, because compiler-rt (or gcc's equivalent) will be the "caching layer" there. (and on Android it'll be a while before i have to worry about libc's ifuncs because we'll require V and not use ifuncs there for the foreseeable future.) so, yeah, given that i've adopted the "pass a null pointer rather than no arguments" convention you have, we have room for expansion if/when FMV is a big thing, and until then -- unless i'm shocked by what i find looking at actual apps -- i don't think i have any reason to believe that ifuncs matter that much, and if compiler-rt makes one __riscv_hwprobe() call per .so, that's probably fine. (i already spend a big chunk of my life advising people to just have one .so file, exporting nothing but a JNI_OnLoad symbol, so this will just make that advice even better advice :-) ) > -Evan