From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id C4FAE385C331 for ; Tue, 14 Jun 2022 16:09:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C4FAE385C331 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 088F321B8C; Tue, 14 Jun 2022 16:09:44 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id F1A28139EC; Tue, 14 Jun 2022 16:09:43 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 3x31OseyqGLjcAAAMHmgww (envelope-from ); Tue, 14 Jun 2022 16:09:43 +0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Biener Mime-Version: 1.0 (1.0) Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask Date: Tue, 14 Jun 2022 18:09:43 +0200 Message-Id: <540D4924-1B10-457C-B50D-18C76E81BC13@suse.de> References: Cc: Richard Sandiford , nd In-Reply-To: To: Tamar Christina via Gcc-patches X-Mailer: iPhone Mail (19F77) X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2022 16:09:50 -0000 > Am 14.06.2022 um 17:58 schrieb Tamar Christina via Gcc-patches : >=20 > =EF=BB=BF >=20 >> -----Original Message----- >> From: Richard Sandiford >> Sent: Tuesday, June 14, 2022 2:43 PM >> To: Richard Biener >> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org; >> nd >> Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2 >> bitmask >>=20 >> Richard Biener writes: >>>> On Mon, 13 Jun 2022, Tamar Christina wrote: >>>=20 >>>>> -----Original Message----- >>>>> From: Richard Biener >>>>> Sent: Monday, June 13, 2022 12:48 PM >>>>> To: Tamar Christina >>>>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford >>>>> >>>>> Subject: RE: [PATCH 1/2]middle-end Support optimized division by >>>>> pow2 bitmask >>>>>=20 >>>>> On Mon, 13 Jun 2022, Tamar Christina wrote: >>>>>=20 >>>>>>> -----Original Message----- >>>>>>> From: Richard Biener >>>>>>> Sent: Monday, June 13, 2022 10:39 AM >>>>>>> To: Tamar Christina >>>>>>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford >>>>>>> >>>>>>> Subject: Re: [PATCH 1/2]middle-end Support optimized division >>>>>>> by >>>>>>> pow2 bitmask >>>>>>>=20 >>>>>>> On Mon, 13 Jun 2022, Richard Biener wrote: >>>>>>>=20 >>>>>>>> On Thu, 9 Jun 2022, Tamar Christina wrote: >>>>>>>>=20 >>>>>>>>> Hi All, >>>>>>>>>=20 >>>>>>>>> In plenty of image and video processing code it's common to >>>>>>>>> modify pixel values by a widening operation and then scale >>>>>>>>> them back into range >>>>>>> by dividing by 255. >>>>>>>>>=20 >>>>>>>>> This patch adds an optab to allow us to emit an optimized >>>>>>>>> sequence when doing an unsigned division that is equivalent to: >>>>>>>>>=20 >>>>>>>>> x =3D y / (2 ^ (bitsize (y)/2)-1 >>>>>>>>>=20 >>>>>>>>> Bootstrapped Regtested on aarch64-none-linux-gnu, >>>>>>>>> x86_64-pc-linux-gnu and no issues. >>>>>>>>>=20 >>>>>>>>> Ok for master? >>>>>>>>=20 >>>>>>>> Looking at 2/2 it seems that this is the wrong way to attack >>>>>>>> the problem. The ISA doesn't have such instruction so adding >>>>>>>> an optab looks premature. I suppose that there's no unsigned >>>>>>>> vector integer division and thus we open-code that in a different >> way? >>>>>>>> Isn't the correct thing then to fixup that open-coding if it >>>>>>>> is more >>>>> efficient? >>>>>>>=20 >>>>>>=20 >>>>>> The problem is that even if you fixup the open-coding it would >>>>>> need to be something target specific? The sequence of >>>>>> instructions we generate don't have a GIMPLE representation. So >>>>>> whatever is generated I'd have to fixup in RTL then. >>>>>=20 >>>>> What's the operation that doesn't have a GIMPLE representation? >>>>=20 >>>> For NEON use two operations: >>>> 1. Add High narrowing lowpart, essentially doing (a +w b) >>.n bitsize(= a)/2 >>>> Where the + widens and the >> narrows. So you give it two >>>> shorts, get a byte 2. Add widening add of lowpart so basically >>>> lowpart (a +w b) >>>>=20 >>>> For SVE2 we use a different sequence, we use two back-to-back >> sequences of: >>>> 1. Add narrow high part (bottom). In SVE the Top and Bottom instructio= ns >> select >>>> Even and odd elements of the vector rather than "top half" and "botto= m >> half". >>>>=20 >>>> So this instruction does : Add each vector element of the first sourc= e >> vector to the >>>> corresponding vector element of the second source vector, and place >> the most >>>> significant half of the result in the even-numbered half-width >> destination elements, >>>> while setting the odd-numbered elements to zero. >>>>=20 >>>> So there's an explicit permute in there. The instructions are >>>> sufficiently different that there wouldn't be a single GIMPLE >> representation. >>>=20 >>> I see. Are these also useful to express scalar integer division? >>>=20 >>> I'll defer to others to ack the special udiv_pow2_bitmask optab or >>> suggest some piecemail things other targets might be able to do as >>> well. It does look very special. I'd also bikeshed it to >>> udiv_pow2m1 since 'bitmask' is less obvious than 2^n-1 (assuming I >>> interpreted 'bitmask' correctly ;)). It seems to be even less general >>> since it is an unary op and the actual divisor is constrained by the >>> mode itself? >>=20 >> Yeah, those were my concerns as well. For n-bit numbers, the same kind o= f >> arithmetic transformation can be used for any 2^m-1 for m in [n/2, n), so= >> from a target-independent point of view, m=3D=3Dn/2 isn't particularly sp= ecial. >> Hard-coding one value of m would make sense if there was an underlying >> instruction that did exactly this, but like you say, there isn't. >>=20 >> Would a compromise be to define an optab for ADDHN and then add a vector >> pattern for this division that (at least initially) prefers ADDHN over th= e >> current approach whenever ADDHN is available? We could then adapt the >> conditions on the pattern if other targets also provide ADDHN but don't w= ant >> this transform. (I think the other instructions in the pattern already h= ave >> optabs.) >>=20 >> That still leaves open the question about what to do about SVE2, but the >> underlying problem there is that the vectoriser doesn't know about the B/= T >> layout. >=20 > Wouldn't it be better to just generalize the optab and to pass on the mask= ? You could implement udivvhiN3 as well, but we=E2=80=99d need to make sure to= test predicates which should make sure only supported constants are let thr= ough. > I'd prefer to do that than teach the vectorizer about ADDHN (which can't b= e > easily done now) let alone teaching it about B/T. It also seems somewhat= > unnecessary to diverge the implementation here in the mid-end. After all, > you can generate better SSE code here as well, so focusing on generating I= SA > specific code from here for each ISA seems like the wrong approach to me. >=20 > Thanks, > Tamar >=20 >>=20 >> Thanks, >> Richard