From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x529.google.com (mail-ed1-x529.google.com [IPv6:2a00:1450:4864:20::529]) by sourceware.org (Postfix) with ESMTPS id 8EB943858284 for ; Wed, 13 Jul 2022 12:53:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8EB943858284 Received: by mail-ed1-x529.google.com with SMTP id fd6so13958986edb.5 for ; Wed, 13 Jul 2022 05:53:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:message-id:in-reply-to :references:subject:mime-version:content-transfer-encoding :content-disposition; bh=8dB/64A8M022N+J2QyQfe0wS5/R4je+eyWMN8paJj5s=; b=JRCxDt2Ip4fB1JRV+6JebAyTN87Yd8Ye+WbK9rSTdOZX4P1pHMs+Rw0KircN2C5HMd 3Ro7V8ECZWr4YOzSxsNTsRA2UKZXM77Rrxv8u7rUHD+f5JS2LO93U675yZ64OJKCHZdq GRSKwKSESIvnwbEs7NJhB4QJZ3F4QNLBOLeVBu3b/NvYaDKJg+t7D6hlMMbmbJdPxtEy JRA1Qw/sgKx8Gk0rVelrkDwDsGhz1gNZf70M393fWHT9Irtsk/Rs2Qbr383rpc2WaaaF DgTg1cK0OnkL/0LHstRCKnnYGOpiu4Zf08hs0GeQ79lyzaMRVPthR0g/fHOc43W94S5+ xC3g== X-Gm-Message-State: AJIora+0nv9uniWcW6t4hMXxEWWFzvs6hsUjczGNwm2WvTODDvHzsh5A Drz/yGIArwxjFXzRjJ2+9Ig= X-Google-Smtp-Source: AGRyM1tfhHJrcKjoCKfqgliIRR/Rcw5qoxZMJJyw4xQGqhHLSONG+AWq3w4miSiB//TwApfKM0AJhw== X-Received: by 2002:a05:6402:4306:b0:43a:b794:9f9f with SMTP id m6-20020a056402430600b0043ab7949f9fmr4679695edc.205.1657716803185; Wed, 13 Jul 2022 05:53:23 -0700 (PDT) Received: from michalj-sanezoo (ip-94-112-230-129.bb.vodafone.cz. [94.112.230.129]) by smtp.gmail.com with ESMTPSA id an12-20020a17090656cc00b0070b7875aa6asm4935710ejc.166.2022.07.13.05.53.22 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Wed, 13 Jul 2022 05:53:22 -0700 (PDT) Date: Wed, 13 Jul 2022 14:54:56 +0200 From: Michal Jankovic To: Iain Sandoe Cc: GCC Patches Message-ID: In-Reply-To: References: Subject: Re: [PATCH] c++: coroutines - Overlap variables in frame [PR105989] X-Mailer: Mailspring MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Spam-Status: No, score=2.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jul 2022 12:53:26 -0000 Hi Iain, thanks for the info. I have some follow-up questions. On Jul 12 2022, at 7:11 pm, Iain Sandoe wrote: > Hi Michal, > =20 >> On 12 Jul 2022, at 16:14, Michal Jankovi=C4=8D >> wrote: > =20 >> One other related thing I would like to investigate is reducing the >> number of compiler generated variables in the frame, particularly >> =5FCoro=5Fdestroy=5Ffn and =5FCoro=5Fself=5Fhandle. =20 >> =20 >> As I understand it, =5FCoro=5Fdestroy=5Ffn just sets a flag in >> =5FCoro=5Fresume=5Findex and calls =5FCoro=5Fresume=5Ffn; it should be= possible to >> move this logic to =5F=5Fbuiltin=5Fcoro=5Fdestroy, so that only =5FCor= o=5Fresume=5Ffn >> is stored in the frame; > =20 > That is a particular point about GCC=E2=80=99s implementation =E2=80=A6= (it is not > neccesarily, or even > likely to be the same for other implementations) - see below. > =20 > I was intending to do experiment with making the ramp/resume/destroy > value a parameter > to the actor function so that we would have something like - > =20 > ramp calls actor(frame, 0) > resume calls actor(frame, 1) > destroy calls actor(frame, 2) =20 > - the token values are illustrative, not intended to be a final version= . > =20 > I think that should allow for more inlining opportunites and possibly > a way forward to > frame elision (a.k.a halo). > =20 >> this would however change the coroutine ABI - I don't know if that's >> a problem. > =20 > The external ABI for the coroutine is the =20 > resume, > destroy pointers =20 > and the promise =20 > and that one can find each of these from the frame pointer. > =20 > This was agreed between the interested =E2=80=9Cvendors=E2=80=9D so tha= t one compiler > could invoke > coroutines built by another. So I do not think this is so much a > useful area to explore. > =20 I understand. I still want to try to implement a more light-weight frame layout with just one function pointer; would it be possible to merge such a change if it was made opt-in via a compiler flag, eg =60-fsmall-coroutine-frame=60=3F My use-case for this is embedded environ= ments with very limited memory, and I do not care about interoperability with other compilers there. =20 > Also the intent is that an indirect call through the frame pointer is > the most frequent > operation so should be the most efficient. =20 > resume() might be called many times, =20 > destroy() just once thus it is a cold code path =20 > - space can be important too - but interoperability was the goal here.= > =20 >> The =5FCoro=5Fself=5Fhandle should be constructible on-demand from the= >> frame address. > =20 > Yes, and in the header the relevant items are all constexpr - so that > should happen in the > user=E2=80=99s code. I elected to have that value in the frame to avoi= d > recreating it each time - I > suppose that is a trade-off of one oiptimisation c.f. another =E2=80=A6= =20 If the handle construction cannot be optimized out, and its thus =20 a tradeoff between frame size and number of instructions, then this could also be enabled by a hypothetical =60-fsmall-coroutine-frame=60. Coming back to this: >>> (the other related optimisation is to eliminate frame entries for >>> scopes without any suspend >>> points - which has the potential to save even more space for code wit= h >>> sparse use of co=5Fxxxx) This would be nice; although it could encompassed by a more general =20 optimization - eliminate frame entries for all variables which are not =20 accessed (directly or via pointer / reference) beyond a suspend point. To be fair, I do not know how to get started on such an optimization, or if it is even possible to do on the frontend. This would however be immensely useful for reducing the frame size taken-up by complicated co=5Fawait expressions (among other things), for example, if I have a composed operation: co=5Fawait when=5Feither(get=5Fleaf=5Fawaitable=5F1(), get=5Fleaf=5Fawait= able=5F2()); Right now, this creates space in the frame for the temporary 'leaf' =20 awaitables, which were already moved into the composed awaitable. If the awaitable has an operator co=5Fawait that returns the real awaiter= , the original awaitable is also stored in the frame, even if it =20 is not referenced by the awaiter; another unused object gets stored if =20 the .await=5Ftransform() customization point was used. What are your thoughts on the feasibility / difficulty of implementing such an optimization=3F Michal >> =20 >> Do you have any advice / opinions on this before I try to implement it= =3F > =20 > Hopefully, the notes above help. > =20 > I will rebase my latest code changes as soon as I have a chance and > put them somewhere > for you to look at - basically, these are to try and address the > correctness issues we face, > =20 > Iain > =20 > =20 >> =20 >> Michal >> =20 >> On Jul 12 2022, at 4:08 pm, Iain Sandoe wrote: >> =20 >>> Hi Michal, >>> =20 >>>> On 12 Jul 2022, at 14:35, Michal Jankovi=C4=8D via Gcc-patches >>>> wrote: >>>> =20 >>>> Currently, coroutine frames store all variables of a coroutine separ= ately, >>>> even if their lifetime does not overlap (they are in distinct >>>> scopes). This >>>> patch implements overlapping distinct variable scopes in the >>>> coroutine frame, >>>> by storing the frame fields in nested unions of structs. This lowers= >>>> the size >>>> of the frame for larger coroutines significantly, and makes them >>>> more usable >>>> on systems with limited memory. >>> =20 >>> not a review (I will try to take a look at the weekend). >>> =20 >>> but =E2=80=A6 this is one of the two main optimisations on my TODO - = so cool >>> for doing it. >>> =20 >>> (the other related optimisation is to eliminate frame entries for >>> scopes without any suspend >>> points - which has the potential to save even more space for code wit= h >>> sparse use of co=5Fxxxx) >>> =20 >>> Iain >>> =20 >>>> Bootstrapped and regression tested on x86=5F64-pc-linux-gnu; new tes= t fails >>>> before the patch and succeeds after with no regressions. >>>> =20 >>>> PR c++/105989 >>>> =20 >>>> gcc/cp/ChangeLog: >>>> =20 >>>> * coroutines.cc (struct local=5Fvar=5Finfo): Add field=5Faccess=5Fp= ath. >>>> (build=5Flocal=5Fvar=5Fframe=5Faccess=5Fexpr): New. >>>> (transform=5Flocal=5Fvar=5Fuses): Use build=5Flocal=5Fvar=5Fframe=5F= access=5Fexpr. >>>> (coro=5Fmake=5Fframe=5Fentry=5Fid): New. >>>> (coro=5Fmake=5Fframe=5Fentry): Delegate to coro=5Fmake=5Fframe=5Fen= try=5Fid. >>>> (struct local=5Fvars=5Fframe=5Fdata): Add orig, field=5Faccess=5Fpa= th. >>>> (register=5Flocal=5Fvar=5Fuses): Generate new frame layout. Create = access >>>> paths to vars. >>>> (morph=5Ffn=5Fto=5Fcoro): Set new fields in local=5Fvars=5Fframe=5F= data. =20 >>>> =20 >>>> gcc/testsuite/ChangeLog: >>>> =20 >>>> * g++.dg/coroutines/pr105989.C: New test. >>>> =20 >>>> >>> =20 >>> =20 > =20 >