Ping... I attached a rebased version since there was a merge conflict in the xordi3 pattern, otherwise the patch is still identical. It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 early when the target has no neon or iwmmxt. Thanks Bernd. On 11/28/16 20:42, Bernd Edlinger wrote: > On 11/25/16 12:30, Ramana Radhakrishnan wrote: >> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >> wrote: >>> Hi! >>> >>> This improves the stack usage on the sha512 test case for the case >>> without hardware fpu and without iwmmxt by splitting all di-mode >>> patterns right while expanding which is similar to what the >>> shift-pattern >>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well as >>> thumb1. >>> >> >> I would go further and do this in the absence of Neon, the VFP unit >> being there doesn't help with DImode operations i.e. we do not have 64 >> bit integer arithmetic instructions without Neon. The main reason why >> we have the DImode patterns split so late is to give a chance for >> folks who want to do 64 bit arithmetic in Neon a chance to make this >> work as well as support some of the 64 bit Neon intrinsics which IIRC >> map down to these instructions. Doing this just for soft-float doesn't >> improve the default case only. I don't usually test iwmmxt and I'm not >> sure who has the ability to do so, thus keeping this restriction for >> iwMMX is fine. >> >> > > Yes I understand, thanks for pointing that out. > > I was not aware what iwmmxt exists at all, but I noticed that most > 64bit expansions work completely different, and would break if we split > the pattern early. > > I can however only look at the assembler outout for iwmmxt, and make > sure that the stack usage does not get worse. > > Thus the new version of the patch keeps only thumb1, neon and iwmmxt as > it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack > for the test cases, and vfp and soft-float at around 270 bytes stack > usage. > >>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>> >>> Note this also splits many ldrd/strd instructions and therefore I will >>> post a followup-patch that mitigates this effect by enabling the >>> ldrd/strd >>> peephole optimization after the necessary reg-testing. >>> >>> >>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >> >> What do you mean by arm-linux-gnueabihf - when folks say that I >> interpret it as --with-arch=armv7-a --with-float=hard >> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >> >> If you've really bootstrapped and regtested it on armhf, doesn't this >> patch as it stand have no effect there i.e. no change ? >> arm-linux-gnueabihf usually means to me someone has configured with >> --with-float=hard, so there are no regressions in the hard float ABI >> case, >> > > I know it proves little. When I say arm-linux-gnueabihf > I do in fact mean --enable-languages=all,ada,go,obj-c++ > --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 > --with-float=hard. > > My main interest in the stack usage is of course not because of linux, > but because of eCos where we have very small task stacks and in fact > no fpu support by the O/S at all, so that patch is exactly what we need. > > > Bootstrapped and reg-tested on arm-linux-gnueabihf > Is it OK for trunk? > > > Thanks > Bernd.