From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp001-out.apm-internet.net (smtp001-out.apm-internet.net [85.119.248.222]) by sourceware.org (Postfix) with ESMTPS id 3C2CB3858C2C for ; Sun, 14 May 2023 18:51:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3C2CB3858C2C Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=sandoe.co.uk Authentication-Results: sourceware.org; spf=none smtp.mailfrom=sandoe.co.uk Received: (qmail 69346 invoked from network); 14 May 2023 18:51:18 -0000 X-APM-Out-ID: 16840902786934 X-APM-Authkey: 257869/1(257869/1) 16 Received: from unknown (HELO smtpclient.apple) (81.138.1.83) by smtp001.apm-internet.net with SMTP; 14 May 2023 18:51:18 -0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.3\)) Subject: Re: [PATCH] c++: coroutines - Overlap variables in frame [PR105989] From: Iain Sandoe In-Reply-To: Date: Sun, 14 May 2023 19:51:18 +0100 Cc: GCC Patches Content-Transfer-Encoding: quoted-printable Message-Id: <0C0E1653-A64F-4F9F-8540-C53671DDB175@sandoe.co.uk> References: <4229A144-BA40-4653-A37D-171E28EAD6FF@sandoe.co.uk> To: =?utf-8?Q?Michal_Jankovi=C4=8D?= X-Mailer: Apple Mail (2.3696.120.41.1.3) X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00,KAM_COUK,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Michal, > On 14 May 2023, at 17:31, Michal Jankovi=C4=8D via Gcc-patches = wrote: > I do not currently have metrics for this, but I can look into = generating > them, To be clear, this is not in any way a precondition for patch acceptance but I am curious as to some idea of the improvements seen. > however > I currently do not know of any large open-source projects > using coroutines that I could use for this; I was thinking about using > cppcoro unit tests, but they mostly contain very simple coroutines. Even cppcoro would be some idea. Larger OSS projects I know of are folly (which I have out of tree = patches to enable the coroutines tests) and seastar (which I have no idea how to build at present). If you are interested to try building folly (note: it has a lot of = prereqs) then I can push my WIP branch somewhere. > I > have source for ~20k LoC proprietary project relying heavily on > coroutines (using boost::asio::awaitable), but here I cannot show the > source along with the numbers - would this be enough or should I look > for more open source projects? The internal project figures are also useful (IMO) even tho we cannot obviously repeat them. thanks Iain >=20 > thanks, > Michal >=20 > On May 14 2023, at 6:07 pm, Iain Sandoe wrote: >=20 >> Hi Michal, >>=20 >>> On 14 May 2023, at 16:36, Michal Jankovi=C4=8D >>> wrote: >>>=20 >>> Rebased the patch to GCC 14 trunk. Bootstrapped and regression = tested >>> again on x86_64-pc-linux-gnu, only difference is the new test = failing >>> without the patch. >>=20 >> (as previously noted, I am much in favour of this optimisation) >>=20 >> Do you have any metrics on the reductions in frame size for realistic = coroutines? >>=20 >> thanks >> Iain >>=20 >>>=20 >>> On Jul 13 2022, at 2:54 pm, Michal Jankovic >>> wrote: >>>=20 >>>> Hi Iain, >>>>=20 >>>> thanks for the info. I have some follow-up questions. >>>>=20 >>>> On Jul 12 2022, at 7:11 pm, Iain Sandoe wrote: >>>>=20 >>>>> Hi Michal, >>>>>=20 >>>>>> On 12 Jul 2022, at 16:14, Michal Jankovi=C4=8D >>>>>> wrote: >>>>>=20 >>>>>> One other related thing I would like to investigate is reducing = the >>>>>> number of compiler generated variables in the frame, particularly >>>>>> _Coro_destroy_fn and _Coro_self_handle. =20 >>>>>>=20 >>>>>> As I understand it, _Coro_destroy_fn just sets a flag in >>>>>> _Coro_resume_index and calls _Coro_resume_fn; it should be >>>>>> possible to >>>>>> move this logic to __builtin_coro_destroy, so that only = _Coro_resume_fn >>>>>> is stored in the frame; >>>>>=20 >>>>> That is a particular point about GCC=E2=80=99s implementation =E2=80= =A6 (it is not >>>>> neccesarily, or even >>>>> likely to be the same for other implementations) - see below. >>>>>=20 >>>>> I was intending to do experiment with making the = ramp/resume/destroy >>>>> value a parameter >>>>> to the actor function so that we would have something like - >>>>>=20 >>>>> ramp calls actor(frame, 0) >>>>> resume calls actor(frame, 1) >>>>> destroy calls actor(frame, 2) =20 >>>>> - the token values are illustrative, not intended to be a final = version. >>>>>=20 >>>>> I think that should allow for more inlining opportunites and = possibly >>>>> a way forward to >>>>> frame elision (a.k.a halo). >>>>>=20 >>>>>> this would however change the coroutine ABI - I don't know if = that's >>>>>> a problem. >>>>>=20 >>>>> The external ABI for the coroutine is the =20 >>>>> resume, >>>>> destroy pointers =20 >>>>> and the promise =20 >>>>> and that one can find each of these from the frame pointer. >>>>>=20 >>>>> This was agreed between the interested =E2=80=9Cvendors=E2=80=9D = so that one compiler >>>>> could invoke >>>>> coroutines built by another. So I do not think this is so much a >>>>> useful area to explore. >>>>>=20 >>>>=20 >>>> I understand. I still want to try to implement a more light-weight = frame >>>> layout with just one function pointer; would it be possible to = merge >>>> such a change if it was made opt-in via a compiler flag, eg >>>> `-fsmall-coroutine-frame`? My use-case for this is embedded = environments >>>> with very limited memory, and I do not care about interoperability = with >>>> other compilers there. =20 >>>>=20 >>>>> Also the intent is that an indirect call through the frame pointer = is >>>>> the most frequent >>>>> operation so should be the most efficient. =20 >>>>> resume() might be called many times, =20 >>>>> destroy() just once thus it is a cold code path =20 >>>>> - space can be important too - but interoperability was the goal = here. >>>>>=20 >>>>>> The _Coro_self_handle should be constructible on-demand from the >>>>>> frame address. >>>>>=20 >>>>> Yes, and in the header the relevant items are all constexpr - so = that >>>>> should happen in the >>>>> user=E2=80=99s code. I elected to have that value in the frame to = avoid >>>>> recreating it each time - I >>>>> suppose that is a trade-off of one oiptimisation c.f. another =E2=80= =A6 =20 >>>>=20 >>>> If the handle construction cannot be optimized out, and its thus =20= >>>> a tradeoff between frame size and number of instructions, then this >>>> could also be enabled by a hypothetical `-fsmall-coroutine-frame`. >>>>=20 >>>> Coming back to this: >>>>=20 >>>>>>> (the other related optimisation is to eliminate frame entries = for >>>>>>> scopes without any suspend >>>>>>> points - which has the potential to save even more space for = code with >>>>>>> sparse use of co_xxxx) >>>>=20 >>>> This would be nice; although it could encompassed by a more general = =20 >>>>=20 >>>> optimization - eliminate frame entries for all variables which are = not >>>>=20 >>>> accessed (directly or via pointer / reference) beyond a suspend = point. >>>> To be fair, I do not know how to get started on such an = optimization, >>>> or if it is even possible to do on the frontend. This would however = be >>>> immensely useful for reducing the frame size taken-up by = complicated >>>> co_await expressions (among other things), for example, if I have a >>>> composed operation: >>>>=20 >>>> co_await when_either(get_leaf_awaitable_1(), = get_leaf_awaitable_2()); >>>>=20 >>>> Right now, this creates space in the frame for the temporary 'leaf' = =20 >>>>=20 >>>> awaitables, which were already moved into the composed awaitable. >>>> If the awaitable has an operator co_await that returns the real = awaiter, >>>> the original awaitable is also stored in the frame, even if it =20= >>>> is not referenced by the awaiter; another unused object gets stored = if >>>>=20 >>>> the .await_transform() customization point was used. >>>>=20 >>>> What are your thoughts on the feasibility / difficulty of = implementing >>>> such an optimization? >>>>=20 >>>> Michal >>>>=20 >>>>>>=20 >>>>>> Do you have any advice / opinions on this before I try to >>>>>> implement it? >>>>>=20 >>>>> Hopefully, the notes above help. >>>>>=20 >>>>> I will rebase my latest code changes as soon as I have a chance = and >>>>> put them somewhere >>>>> for you to look at - basically, these are to try and address the >>>>> correctness issues we face, >>>>>=20 >>>>> Iain >>>>>=20 >>>>>=20 >>>>>>=20 >>>>>> Michal >>>>>>=20 >>>>>> On Jul 12 2022, at 4:08 pm, Iain Sandoe = wrote: >>>>>>=20 >>>>>>> Hi Michal, >>>>>>>=20 >>>>>>>> On 12 Jul 2022, at 14:35, Michal Jankovi=C4=8D via Gcc-patches >>>>>>>> wrote: >>>>>>>>=20 >>>>>>>> Currently, coroutine frames store all variables of a coroutine = separately, >>>>>>>> even if their lifetime does not overlap (they are in distinct >>>>>>>> scopes). This >>>>>>>> patch implements overlapping distinct variable scopes in the >>>>>>>> coroutine frame, >>>>>>>> by storing the frame fields in nested unions of structs. This = lowers >>>>>>>> the size >>>>>>>> of the frame for larger coroutines significantly, and makes = them >>>>>>>> more usable >>>>>>>> on systems with limited memory. >>>>>>>=20 >>>>>>> not a review (I will try to take a look at the weekend). >>>>>>>=20 >>>>>>> but =E2=80=A6 this is one of the two main optimisations on my = TODO - so cool >>>>>>> for doing it. >>>>>>>=20 >>>>>>> (the other related optimisation is to eliminate frame entries = for >>>>>>> scopes without any suspend >>>>>>> points - which has the potential to save even more space for = code with >>>>>>> sparse use of co_xxxx) >>>>>>>=20 >>>>>>> Iain >>>>>>>=20 >>>>>>>> Bootstrapped and regression tested on x86_64-pc-linux-gnu; new >>>>>>>> test fails >>>>>>>> before the patch and succeeds after with no regressions. >>>>>>>>=20 >>>>>>>> PR c++/105989 >>>>>>>>=20 >>>>>>>> gcc/cp/ChangeLog: >>>>>>>>=20 >>>>>>>> * coroutines.cc (struct local_var_info): Add = field_access_path. >>>>>>>> (build_local_var_frame_access_expr): New. >>>>>>>> (transform_local_var_uses): Use = build_local_var_frame_access_expr. >>>>>>>> (coro_make_frame_entry_id): New. >>>>>>>> (coro_make_frame_entry): Delegate to = coro_make_frame_entry_id. >>>>>>>> (struct local_vars_frame_data): Add orig, = field_access_path. >>>>>>>> (register_local_var_uses): Generate new frame layout. = Create access >>>>>>>> paths to vars. >>>>>>>> (morph_fn_to_coro): Set new fields in = local_vars_frame_data. =20 >>>>>>>>=20 >>>>>>>> gcc/testsuite/ChangeLog: >>>>>>>>=20 >>>>>>>> * g++.dg/coroutines/pr105989.C: New test. >>>>>>>>=20 >>>>>>>> >>>>>>>=20 >>>>>>>=20 >>>>>=20 >>>>>=20 >>> >>=20 >>=20