From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp002.apm-internet.net (smtp002.apm-internet.net [85.119.248.221]) by sourceware.org (Postfix) with ESMTPS id D67063858C53 for ; Sun, 14 May 2023 16:07:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D67063858C53 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=sandoe.co.uk Authentication-Results: sourceware.org; spf=none smtp.mailfrom=sandoe.co.uk Received: (qmail 7514 invoked from network); 14 May 2023 16:07:32 -0000 X-APM-Out-ID: 16840804520751 X-APM-Authkey: 257869/1(257869/1) 14 Received: from unknown (HELO smtpclient.apple) (81.138.1.83) by smtp002.apm-internet.net with SMTP; 14 May 2023 16:07:32 -0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.3\)) Subject: Re: [PATCH] c++: coroutines - Overlap variables in frame [PR105989] From: Iain Sandoe In-Reply-To: <412A866E-BE4E-4DA8-A10E-4410C01FB9FF@getmailspring.com> Date: Sun, 14 May 2023 17:07:32 +0100 Cc: GCC Patches Content-Transfer-Encoding: quoted-printable Message-Id: <4229A144-BA40-4653-A37D-171E28EAD6FF@sandoe.co.uk> References: <412A866E-BE4E-4DA8-A10E-4410C01FB9FF@getmailspring.com> To: =?utf-8?Q?Michal_Jankovi=C4=8D?= X-Mailer: Apple Mail (2.3696.120.41.1.3) X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00,KAM_COUK,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Michal, > On 14 May 2023, at 16:36, Michal Jankovi=C4=8D = wrote: >=20 > Rebased the patch to GCC 14 trunk. Bootstrapped and regression tested > again on x86_64-pc-linux-gnu, only difference is the new test failing > without the patch. (as previously noted, I am much in favour of this optimisation) Do you have any metrics on the reductions in frame size for realistic = coroutines? thanks Iain >=20 > On Jul 13 2022, at 2:54 pm, Michal Jankovic > wrote: >=20 >> Hi Iain, >>=20 >> thanks for the info. I have some follow-up questions. >>=20 >> On Jul 12 2022, at 7:11 pm, Iain Sandoe wrote: >>=20 >>> Hi Michal, >>>=20 >>>> On 12 Jul 2022, at 16:14, Michal Jankovi=C4=8D >>>> wrote: >>>=20 >>>> One other related thing I would like to investigate is reducing the >>>> number of compiler generated variables in the frame, particularly >>>> _Coro_destroy_fn and _Coro_self_handle. =20 >>>>=20 >>>> As I understand it, _Coro_destroy_fn just sets a flag in >>>> _Coro_resume_index and calls _Coro_resume_fn; it should be possible = to >>>> move this logic to __builtin_coro_destroy, so that only = _Coro_resume_fn >>>> is stored in the frame; >>>=20 >>> That is a particular point about GCC=E2=80=99s implementation =E2=80=A6= (it is not >>> neccesarily, or even >>> likely to be the same for other implementations) - see below. >>>=20 >>> I was intending to do experiment with making the ramp/resume/destroy >>> value a parameter >>> to the actor function so that we would have something like - >>>=20 >>> ramp calls actor(frame, 0) >>> resume calls actor(frame, 1) >>> destroy calls actor(frame, 2) =20 >>> - the token values are illustrative, not intended to be a final = version. >>>=20 >>> I think that should allow for more inlining opportunites and = possibly >>> a way forward to >>> frame elision (a.k.a halo). >>>=20 >>>> this would however change the coroutine ABI - I don't know if = that's >>>> a problem. >>>=20 >>> The external ABI for the coroutine is the =20 >>> resume, >>> destroy pointers =20 >>> and the promise =20 >>> and that one can find each of these from the frame pointer. >>>=20 >>> This was agreed between the interested =E2=80=9Cvendors=E2=80=9D so = that one compiler >>> could invoke >>> coroutines built by another. So I do not think this is so much a >>> useful area to explore. >>>=20 >>=20 >> I understand. I still want to try to implement a more light-weight = frame >> layout with just one function pointer; would it be possible to merge >> such a change if it was made opt-in via a compiler flag, eg >> `-fsmall-coroutine-frame`? My use-case for this is embedded = environments >> with very limited memory, and I do not care about interoperability = with >> other compilers there. =20 >>=20 >>> Also the intent is that an indirect call through the frame pointer = is >>> the most frequent >>> operation so should be the most efficient. =20 >>> resume() might be called many times, =20 >>> destroy() just once thus it is a cold code path =20 >>> - space can be important too - but interoperability was the goal = here. >>>=20 >>>> The _Coro_self_handle should be constructible on-demand from the >>>> frame address. >>>=20 >>> Yes, and in the header the relevant items are all constexpr - so = that >>> should happen in the >>> user=E2=80=99s code. I elected to have that value in the frame to = avoid >>> recreating it each time - I >>> suppose that is a trade-off of one oiptimisation c.f. another =E2=80=A6= =20 >>=20 >> If the handle construction cannot be optimized out, and its thus =20 >> a tradeoff between frame size and number of instructions, then this >> could also be enabled by a hypothetical `-fsmall-coroutine-frame`. >>=20 >> Coming back to this: >>=20 >>>>> (the other related optimisation is to eliminate frame entries for >>>>> scopes without any suspend >>>>> points - which has the potential to save even more space for code = with >>>>> sparse use of co_xxxx) >>=20 >> This would be nice; although it could encompassed by a more general =20= >> optimization - eliminate frame entries for all variables which are = not >>=20 >> accessed (directly or via pointer / reference) beyond a suspend = point. >> To be fair, I do not know how to get started on such an optimization, >> or if it is even possible to do on the frontend. This would however = be >> immensely useful for reducing the frame size taken-up by complicated >> co_await expressions (among other things), for example, if I have a >> composed operation: >>=20 >> co_await when_either(get_leaf_awaitable_1(), get_leaf_awaitable_2()); >>=20 >> Right now, this creates space in the frame for the temporary 'leaf' =20= >> awaitables, which were already moved into the composed awaitable. >> If the awaitable has an operator co_await that returns the real = awaiter, >> the original awaitable is also stored in the frame, even if it =20 >> is not referenced by the awaiter; another unused object gets stored = if >>=20 >> the .await_transform() customization point was used. >>=20 >> What are your thoughts on the feasibility / difficulty of = implementing >> such an optimization? >>=20 >> Michal >>=20 >>>>=20 >>>> Do you have any advice / opinions on this before I try to implement = it? >>>=20 >>> Hopefully, the notes above help. >>>=20 >>> I will rebase my latest code changes as soon as I have a chance and >>> put them somewhere >>> for you to look at - basically, these are to try and address the >>> correctness issues we face, >>>=20 >>> Iain >>>=20 >>>=20 >>>>=20 >>>> Michal >>>>=20 >>>> On Jul 12 2022, at 4:08 pm, Iain Sandoe wrote: >>>>=20 >>>>> Hi Michal, >>>>>=20 >>>>>> On 12 Jul 2022, at 14:35, Michal Jankovi=C4=8D via Gcc-patches >>>>>> wrote: >>>>>>=20 >>>>>> Currently, coroutine frames store all variables of a coroutine = separately, >>>>>> even if their lifetime does not overlap (they are in distinct >>>>>> scopes). This >>>>>> patch implements overlapping distinct variable scopes in the >>>>>> coroutine frame, >>>>>> by storing the frame fields in nested unions of structs. This = lowers >>>>>> the size >>>>>> of the frame for larger coroutines significantly, and makes them >>>>>> more usable >>>>>> on systems with limited memory. >>>>>=20 >>>>> not a review (I will try to take a look at the weekend). >>>>>=20 >>>>> but =E2=80=A6 this is one of the two main optimisations on my TODO = - so cool >>>>> for doing it. >>>>>=20 >>>>> (the other related optimisation is to eliminate frame entries for >>>>> scopes without any suspend >>>>> points - which has the potential to save even more space for code = with >>>>> sparse use of co_xxxx) >>>>>=20 >>>>> Iain >>>>>=20 >>>>>> Bootstrapped and regression tested on x86_64-pc-linux-gnu; new >>>>>> test fails >>>>>> before the patch and succeeds after with no regressions. >>>>>>=20 >>>>>> PR c++/105989 >>>>>>=20 >>>>>> gcc/cp/ChangeLog: >>>>>>=20 >>>>>> * coroutines.cc (struct local_var_info): Add field_access_path. >>>>>> (build_local_var_frame_access_expr): New. >>>>>> (transform_local_var_uses): Use = build_local_var_frame_access_expr. >>>>>> (coro_make_frame_entry_id): New. >>>>>> (coro_make_frame_entry): Delegate to coro_make_frame_entry_id. >>>>>> (struct local_vars_frame_data): Add orig, field_access_path. >>>>>> (register_local_var_uses): Generate new frame layout. Create = access >>>>>> paths to vars. >>>>>> (morph_fn_to_coro): Set new fields in local_vars_frame_data. =20= >>>>>>=20 >>>>>> gcc/testsuite/ChangeLog: >>>>>>=20 >>>>>> * g++.dg/coroutines/pr105989.C: New test. >>>>>>=20 >>>>>> >>>>>=20 >>>>>=20 >>>=20 >>>=20 >