public inbox for libc-stable@sourceware.org
 help / color / mirror / Atom feed
From: Sunil Pandey <skpgkp2@gmail.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>, libc-stable@sourceware.org
Cc: "H.J. Lu" <hjl.tools@gmail.com>,
	GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v8 1/2] x86: Update large memcpy case in memmove-vec-unaligned-erms.S
Date: Wed, 27 Apr 2022 16:46:24 -0700	[thread overview]
Message-ID: <CAMAf5_fvZViZ0ea2fFY=jJiniMB0Ec27akL2+sh4f0DnNyGE7A@mail.gmail.com> (raw)
In-Reply-To: <CAFUsyfJ-ygcG0m_cvujb0wiuruYt0MyT06eqvZ4H_CN7SG1NOw@mail.gmail.com>

On Fri, Apr 16, 2021 at 12:25 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Fri, Apr 16, 2021 at 1:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Fri, Apr 16, 2021 at 9:35 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > > LGTM.  Please commit it.
> > >
> > > Are you saying that to me or someone else? If its to me what do you
> > > mean, is the patch not enough?
> >
> > I will commit it for you.
>
> Thanks! Are you planning on accepting the bench / testing changes as well?
>
> >
> > > > Thanks.
> > >
> > > On Fri, Apr 16, 2021 at 8:59 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Sat, Apr 03, 2021 at 04:12:15AM -0400, Noah Goldstein wrote:
> > > > > From: noah <goldstein.w.n@gmail.com>
> > > > >
> > > > > No Bug. This commit updates the large memcpy case (no overlap). The
> > > > > update is to perform memcpy on either 2 or 4 contiguous pages at
> > > > > once. This 1) helps to alleviate the affects of false memory aliasing
> > > > > when destination and source have a close 4k alignment and 2) In most
> > > > > cases and for most DRAM units is a modestly more efficient access
> > > > > pattern. These changes are a clear performance improvement for
> > > > > VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy,
> > > > > test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all
> > > > > pass.
> > > > >
> > > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > ---
> > > > > Issue was alignment related AFAICT. Added `.p2align 4` infront of the
> > > > > loops and no longer see any meaningful regression.
> > > > >
> > > > > Also added back the temporal stores for the tail. Saw a regression
> > > > > when doing these tests.
> > > > >
> > > > > Two tables below for skylake and icelake numbers for the areas around
> > > > > where you saw the regression. Below is all data from the tests.
> > > > >
> > > > > N = 10.
> > > > >
> > > > > Skylake
> > > > > Len         ,align1      ,align2      ,new mean    ,old mean
> > > > > 4103        ,0           ,64          ,84.5        ,88.6
> > > > > 4111        ,0           ,3           ,99.0        ,99.9
> > > > > 4127        ,3           ,0           ,102.1       ,102.3
> > > > > 4159        ,3           ,7           ,88.7        ,90.9
> > > > > 4223        ,9           ,5           ,88.1        ,87.4
> > > > > 8199        ,0           ,64          ,146.7       ,150.2
> > > > > 8207        ,0           ,3           ,167.9       ,168.5
> > > > > 8223        ,3           ,0           ,168.5       ,168.1
> > > > > 8255        ,3           ,7           ,157.0       ,159.2
> > > > > 8319        ,9           ,5           ,155.5       ,155.7
> > > > > 16391       ,0           ,64          ,286.2       ,288.8
> > > > > 16399       ,0           ,3           ,307.0       ,308.7
> > > > > 16415       ,3           ,0           ,307.4       ,307.6
> > > > > 16447       ,3           ,7           ,294.6       ,295.5
> > > > > 16511       ,9           ,5           ,291.5       ,462.1
> > > > > 32775       ,0           ,64          ,603.4       ,601.5
> > > > > 32783       ,0           ,3           ,604.8       ,606.4
> > > > > 32799       ,3           ,0           ,603.0       ,604.1
> > > > > 32831       ,3           ,7           ,600.2       ,737.3
> > > > > 32895       ,9           ,5           ,604.4       ,599.5
> > > > > 65543       ,0           ,64          ,1873.5      ,1854.3
> > > > > 65551       ,0           ,3           ,1862.9      ,1846.6
> > > > > 65567       ,3           ,0           ,1885.5      ,1966.0
> > > > > 65599       ,3           ,7           ,1833.2      ,1833.1
> > > > > 65663       ,9           ,5           ,1884.9      ,1887.4
> > > > > 131079      ,0           ,64          ,3944.3      ,3949.4
> > > > > 131087      ,0           ,3           ,3927.3      ,3913.3
> > > > > 131103      ,3           ,0           ,4415.8      ,4169.4
> > > > > 131135      ,3           ,7           ,4224.5      ,4157.6
> > > > > 131199      ,9           ,5           ,5974.0      ,4983.8
> > > > > 262151      ,0           ,64          ,11050.2     ,10620.6
> > > > > 262159      ,0           ,3           ,9932.8      ,10037.3
> > > > > 262175      ,3           ,0           ,10188.8     ,9206.6
> > > > > 262207      ,3           ,7           ,9633.3      ,9216.7
> > > > > 262271      ,9           ,5           ,9732.7      ,9345.3
> > > > > 524295      ,0           ,64          ,24823.9     ,24880.7
> > > > > 524303      ,0           ,3           ,24514.0     ,24556.7
> > > > > 524319      ,3           ,0           ,23974.4     ,24219.9
> > > > > 524351      ,3           ,7           ,24159.7     ,24207.0
> > > > > 524415      ,9           ,5           ,23946.5     ,24142.8
> > > > >
> > > > > Icelake:
> > > > > Len         ,align1      ,align2      ,new mean    ,old mean
> > > > > 4103        ,0           ,64          ,50.2        ,63.7
> > > > > 4111        ,0           ,3           ,63.7        ,65.1
> > > > > 4127        ,3           ,0           ,68.2        ,69.4
> > > > > 4159        ,3           ,7           ,59.6        ,68.0
> > > > > 4223        ,9           ,5           ,68.2        ,66.8
> > > > > 8199        ,0           ,64          ,92.1        ,89.9
> > > > > 8207        ,0           ,3           ,119.7       ,118.3
> > > > > 8223        ,3           ,0           ,119.1       ,120.9
> > > > > 8255        ,3           ,7           ,122.9       ,123.7
> > > > > 8319        ,9           ,5           ,122.1       ,121.8
> > > > > 16391       ,0           ,64          ,162.7       ,158.0
> > > > > 16399       ,0           ,3           ,227.6       ,234.1
> > > > > 16415       ,3           ,0           ,230.8       ,232.7
> > > > > 16447       ,3           ,7           ,226.8       ,232.6
> > > > > 16511       ,9           ,5           ,233.4       ,233.8
> > > > > 32775       ,0           ,64          ,312.2       ,301.8
> > > > > 32783       ,0           ,3           ,449.7       ,450.0
> > > > > 32799       ,3           ,0           ,452.7       ,455.9
> > > > > 32831       ,3           ,7           ,449.8       ,458.0
> > > > > 32895       ,9           ,5           ,456.3       ,459.4
> > > > > 65543       ,0           ,64          ,1460.6      ,1463.9
> > > > > 65551       ,0           ,3           ,1462.0      ,1465.4
> > > > > 65567       ,3           ,0           ,1466.6      ,1480.4
> > > > > 65599       ,3           ,7           ,1488.0      ,1488.9
> > > > > 65663       ,9           ,5           ,1680.8      ,1499.5
> > > > > 131079      ,0           ,64          ,2988.5      ,3010.1
> > > > > 131087      ,0           ,3           ,2995.5      ,2996.4
> > > > > 131103      ,3           ,0           ,3006.2      ,3000.5
> > > > > 131135      ,3           ,7           ,3032.4      ,3073.7
> > > > > 131199      ,9           ,5           ,3010.4      ,3027.4
> > > > > 262151      ,0           ,64          ,6143.2      ,6079.1
> > > > > 262159      ,0           ,3           ,6085.1      ,6075.8
> > > > > 262175      ,3           ,0           ,6088.0      ,6064.9
> > > > > 262207      ,3           ,7           ,6018.7      ,6023.5
> > > > > 262271      ,9           ,5           ,6019.8      ,5959.2
> > > > > 524295      ,0           ,64          ,14464.2     ,14095.1
> > > > > 524303      ,0           ,3           ,14761.6     ,14050.2
> > > > > 524319      ,3           ,0           ,14534.1     ,14087.5
> > > > > 524351      ,3           ,7           ,14147.7     ,13903.8
> > > > > 524415      ,9           ,5           ,14157.0     ,13982.9
> > > > >
> > > > >
> > > > >
> > > > > cpu         ,version     ,Len         ,align1      ,align2      ,new mean    ,old mean
> > > > > skylake     ,avx         ,4103        ,0           ,64          ,84.5        ,88.6
> > > > > skylake     ,avx         ,4111        ,0           ,3           ,99.0        ,99.9
> > > > > skylake     ,avx         ,4127        ,3           ,0           ,102.1       ,102.3
> > > > > skylake     ,avx         ,4159        ,3           ,7           ,88.7        ,90.9
> > > > > skylake     ,avx         ,4223        ,9           ,5           ,88.1        ,87.4
> > > > > skylake     ,avx         ,8199        ,0           ,64          ,146.7       ,150.2
> > > > > skylake     ,avx         ,8207        ,0           ,3           ,167.9       ,168.5
> > > > > skylake     ,avx         ,8223        ,3           ,0           ,168.5       ,168.1
> > > > > skylake     ,avx         ,8255        ,3           ,7           ,157.0       ,159.2
> > > > > skylake     ,avx         ,8319        ,9           ,5           ,155.5       ,155.7
> > > > > skylake     ,avx         ,16391       ,0           ,64          ,286.2       ,288.8
> > > > > skylake     ,avx         ,16399       ,0           ,3           ,307.0       ,308.7
> > > > > skylake     ,avx         ,16415       ,3           ,0           ,307.4       ,307.6
> > > > > skylake     ,avx         ,16447       ,3           ,7           ,294.6       ,295.5
> > > > > skylake     ,avx         ,16511       ,9           ,5           ,291.5       ,462.1
> > > > > skylake     ,avx         ,32775       ,0           ,64          ,603.4       ,601.5
> > > > > skylake     ,avx         ,32783       ,0           ,3           ,604.8       ,606.4
> > > > > skylake     ,avx         ,32799       ,3           ,0           ,603.0       ,604.1
> > > > > skylake     ,avx         ,32831       ,3           ,7           ,600.2       ,737.3
> > > > > skylake     ,avx         ,32895       ,9           ,5           ,604.4       ,599.5
> > > > > skylake     ,avx         ,65543       ,0           ,64          ,1873.5      ,1854.3
> > > > > skylake     ,avx         ,65551       ,0           ,3           ,1862.9      ,1846.6
> > > > > skylake     ,avx         ,65567       ,3           ,0           ,1885.5      ,1966.0
> > > > > skylake     ,avx         ,65599       ,3           ,7           ,1833.2      ,1833.1
> > > > > skylake     ,avx         ,65663       ,9           ,5           ,1884.9      ,1887.4
> > > > > skylake     ,avx         ,131079      ,0           ,64          ,3944.3      ,3949.4
> > > > > skylake     ,avx         ,131087      ,0           ,3           ,3927.3      ,3913.3
> > > > > skylake     ,avx         ,131103      ,3           ,0           ,4415.8      ,4169.4
> > > > > skylake     ,avx         ,131135      ,3           ,7           ,4224.5      ,4157.6
> > > > > skylake     ,avx         ,131199      ,9           ,5           ,5974.0      ,4983.8
> > > > > skylake     ,avx         ,262151      ,0           ,64          ,11050.2     ,10620.6
> > > > > skylake     ,avx         ,262159      ,0           ,3           ,9932.8      ,10037.3
> > > > > skylake     ,avx         ,262175      ,3           ,0           ,10188.8     ,9206.6
> > > > > skylake     ,avx         ,262207      ,3           ,7           ,9633.3      ,9216.7
> > > > > skylake     ,avx         ,262271      ,9           ,5           ,9732.7      ,9345.3
> > > > > skylake     ,avx         ,524295      ,0           ,64          ,24823.9     ,24880.7
> > > > > skylake     ,avx         ,524303      ,0           ,3           ,24514.0     ,24556.7
> > > > > skylake     ,avx         ,524319      ,3           ,0           ,23974.4     ,24219.9
> > > > > skylake     ,avx         ,524351      ,3           ,7           ,24159.7     ,24207.0
> > > > > skylake     ,avx         ,524415      ,9           ,5           ,23946.5     ,24142.8
> > > > > skylake     ,avx         ,1048583     ,0           ,64          ,49163.9     ,49454.6
> > > > > skylake     ,avx         ,1048591     ,0           ,3           ,49879.3     ,49400.8
> > > > > skylake     ,avx         ,1048607     ,3           ,0           ,49738.0     ,48864.6
> > > > > skylake     ,avx         ,1048639     ,3           ,7           ,48804.0     ,47588.5
> > > > > skylake     ,avx         ,1048703     ,9           ,5           ,49629.4     ,49796.3
> > > > > skylake     ,avx         ,2097159     ,0           ,64          ,98271.7     ,96330.6
> > > > > skylake     ,avx         ,2097167     ,0           ,3           ,97801.8     ,98638.1
> > > > > skylake     ,avx         ,2097183     ,3           ,0           ,98041.1     ,99287.6
> > > > > skylake     ,avx         ,2097215     ,3           ,7           ,96629.5     ,96521.9
> > > > > skylake     ,avx         ,2097279     ,9           ,5           ,98961.8     ,98909.8
> > > > > skylake     ,avx         ,4194311     ,0           ,64          ,194667.7    ,195377.1
> > > > > skylake     ,avx         ,4194319     ,0           ,3           ,194919.5    ,198576.2
> > > > > skylake     ,avx         ,4194335     ,3           ,0           ,192949.8    ,194584.7
> > > > > skylake     ,avx         ,4194367     ,3           ,7           ,189943.5    ,189177.9
> > > > > skylake     ,avx         ,4194431     ,9           ,5           ,192479.1    ,196494.2
> > > > > skylake     ,avx         ,8388615     ,0           ,64          ,588671.6    ,587215.4
> > > > > skylake     ,avx         ,8388623     ,0           ,3           ,581640.7    ,582812.5
> > > > > skylake     ,avx         ,8388639     ,3           ,0           ,549811.9    ,544697.6
> > > > > skylake     ,avx         ,8388671     ,3           ,7           ,591155.0    ,577951.8
> > > > > skylake     ,avx         ,8388735     ,9           ,5           ,547583.2    ,545133.3
> > > > > skylake     ,avx         ,16777223    ,0           ,64          ,1787503.0   ,1811146.0
> > > > > skylake     ,avx         ,16777231    ,0           ,3           ,1758671.0   ,1756343.0
> > > > > skylake     ,avx         ,16777247    ,3           ,0           ,1691781.0   ,1694661.0
> > > > > skylake     ,avx         ,16777279    ,3           ,7           ,1768150.0   ,1754785.0
> > > > > skylake     ,avx         ,16777343    ,9           ,5           ,1695179.0   ,1710794.0
> > > > > skylake     ,sse2        ,4103        ,0           ,64          ,150.8       ,150.5
> > > > > skylake     ,sse2        ,4111        ,0           ,3           ,156.8       ,158.4
> > > > > skylake     ,sse2        ,4127        ,3           ,0           ,99.7        ,99.4
> > > > > skylake     ,sse2        ,4159        ,3           ,7           ,154.8       ,154.5
> > > > > skylake     ,sse2        ,4223        ,9           ,5           ,137.3       ,137.2
> > > > > skylake     ,sse2        ,8199        ,0           ,64          ,284.8       ,285.5
> > > > > skylake     ,sse2        ,8207        ,0           ,3           ,296.0       ,296.1
> > > > > skylake     ,sse2        ,8223        ,3           ,0           ,168.0       ,168.2
> > > > > skylake     ,sse2        ,8255        ,3           ,7           ,293.0       ,292.4
> > > > > skylake     ,sse2        ,8319        ,9           ,5           ,251.3       ,250.7
> > > > > skylake     ,sse2        ,16391       ,0           ,64          ,561.3       ,608.3
> > > > > skylake     ,sse2        ,16399       ,0           ,3           ,571.0       ,574.8
> > > > > skylake     ,sse2        ,16415       ,3           ,0           ,305.4       ,305.0
> > > > > skylake     ,sse2        ,16447       ,3           ,7           ,563.2       ,565.0
> > > > > skylake     ,sse2        ,16511       ,9           ,5           ,477.1       ,475.1
> > > > > skylake     ,sse2        ,32775       ,0           ,64          ,1128.2      ,1131.7
> > > > > skylake     ,sse2        ,32783       ,0           ,3           ,1126.6      ,1131.0
> > > > > skylake     ,sse2        ,32799       ,3           ,0           ,587.6       ,590.8
> > > > > skylake     ,sse2        ,32831       ,3           ,7           ,1130.6      ,1126.2
> > > > > skylake     ,sse2        ,32895       ,9           ,5           ,957.6       ,953.0
> > > > > skylake     ,sse2        ,65543       ,0           ,64          ,2718.9      ,2704.2
> > > > > skylake     ,sse2        ,65551       ,0           ,3           ,2724.1      ,2725.0
> > > > > skylake     ,sse2        ,65567       ,3           ,0           ,1888.4      ,1914.3
> > > > > skylake     ,sse2        ,65599       ,3           ,7           ,2787.6      ,2748.7
> > > > > skylake     ,sse2        ,65663       ,9           ,5           ,2400.5      ,2369.4
> > > > > skylake     ,sse2        ,131079      ,0           ,64          ,5603.3      ,5654.9
> > > > > skylake     ,sse2        ,131087      ,0           ,3           ,5939.3      ,5871.4
> > > > > skylake     ,sse2        ,131103      ,3           ,0           ,4272.4      ,4190.0
> > > > > skylake     ,sse2        ,131135      ,3           ,7           ,7601.4      ,7524.6
> > > > > skylake     ,sse2        ,131199      ,9           ,5           ,7022.1      ,6864.7
> > > > > skylake     ,sse2        ,262151      ,0           ,64          ,13736.2     ,14030.0
> > > > > skylake     ,sse2        ,262159      ,0           ,3           ,12407.3     ,12334.1
> > > > > skylake     ,sse2        ,262175      ,3           ,0           ,9661.1      ,9249.4
> > > > > skylake     ,sse2        ,262207      ,3           ,7           ,12850.2     ,12351.6
> > > > > skylake     ,sse2        ,262271      ,9           ,5           ,10792.6     ,10435.8
> > > > > skylake     ,sse2        ,524295      ,0           ,64          ,27754.5     ,28177.7
> > > > > skylake     ,sse2        ,524303      ,0           ,3           ,27766.2     ,28152.0
> > > > > skylake     ,sse2        ,524319      ,3           ,0           ,24030.9     ,24438.3
> > > > > skylake     ,sse2        ,524351      ,3           ,7           ,27787.5     ,27933.0
> > > > > skylake     ,sse2        ,524415      ,9           ,5           ,24263.2     ,25249.1
> > > > > skylake     ,sse2        ,1048583     ,0           ,64          ,56199.9     ,56039.8
> > > > > skylake     ,sse2        ,1048591     ,0           ,3           ,56750.2     ,58889.7
> > > > > skylake     ,sse2        ,1048607     ,3           ,0           ,56394.0     ,55115.3
> > > > > skylake     ,sse2        ,1048639     ,3           ,7           ,57233.1     ,57473.8
> > > > > skylake     ,sse2        ,1048703     ,9           ,5           ,56324.3     ,55917.9
> > > > > skylake     ,sse2        ,2097159     ,0           ,64          ,113234.8    ,114346.4
> > > > > skylake     ,sse2        ,2097167     ,0           ,3           ,114373.1    ,115522.5
> > > > > skylake     ,sse2        ,2097183     ,3           ,0           ,108113.3    ,108513.3
> > > > > skylake     ,sse2        ,2097215     ,3           ,7           ,116863.6    ,116549.9
> > > > > skylake     ,sse2        ,2097279     ,9           ,5           ,108945.1    ,108843.7
> > > > > skylake     ,sse2        ,4194311     ,0           ,64          ,230250.1    ,232350.0
> > > > > skylake     ,sse2        ,4194319     ,0           ,3           ,231895.3    ,235055.6
> > > > > skylake     ,sse2        ,4194335     ,3           ,0           ,218442.8    ,219199.8
> > > > > skylake     ,sse2        ,4194367     ,3           ,7           ,242564.2    ,235587.7
> > > > > skylake     ,sse2        ,4194431     ,9           ,5           ,224167.4    ,215261.8
> > > > > skylake     ,sse2        ,8388615     ,0           ,64          ,679801.8    ,674832.0
> > > > > skylake     ,sse2        ,8388623     ,0           ,3           ,684913.2    ,685238.7
> > > > > skylake     ,sse2        ,8388639     ,3           ,0           ,644865.4    ,631388.6
> > > > > skylake     ,sse2        ,8388671     ,3           ,7           ,698700.9    ,689316.1
> > > > > skylake     ,sse2        ,8388735     ,9           ,5           ,644820.2    ,631366.8
> > > > > skylake     ,sse2        ,16777223    ,0           ,64          ,1877984.0   ,1876437.0
> > > > > skylake     ,sse2        ,16777231    ,0           ,3           ,1898086.0   ,1913053.0
> > > > > skylake     ,sse2        ,16777247    ,3           ,0           ,1857018.0   ,1866949.0
> > > > > skylake     ,sse2        ,16777279    ,3           ,7           ,1914905.0   ,1897134.0
> > > > > skylake     ,sse2        ,16777343    ,9           ,5           ,1859937.0   ,1881939.0
> > > > > icelake     ,avx512      ,4103        ,0           ,64          ,75.2        ,75.8
> > > > > icelake     ,avx512      ,4111        ,0           ,3           ,56.9        ,56.4
> > > > > icelake     ,avx512      ,4127        ,3           ,0           ,59.1        ,59.6
> > > > > icelake     ,avx512      ,4159        ,3           ,7           ,50.7        ,51.3
> > > > > icelake     ,avx512      ,4223        ,9           ,5           ,59.2        ,58.9
> > > > > icelake     ,avx512      ,8199        ,0           ,64          ,67.8        ,63.9
> > > > > icelake     ,avx512      ,8207        ,0           ,3           ,89.0        ,89.9
> > > > > icelake     ,avx512      ,8223        ,3           ,0           ,90.2        ,90.1
> > > > > icelake     ,avx512      ,8255        ,3           ,7           ,82.6        ,84.9
> > > > > icelake     ,avx512      ,8319        ,9           ,5           ,91.5        ,92.8
> > > > > icelake     ,avx512      ,16391       ,0           ,64          ,118.0       ,117.6
> > > > > icelake     ,avx512      ,16399       ,0           ,3           ,156.5       ,157.0
> > > > > icelake     ,avx512      ,16415       ,3           ,0           ,157.4       ,157.3
> > > > > icelake     ,avx512      ,16447       ,3           ,7           ,151.0       ,151.6
> > > > > icelake     ,avx512      ,16511       ,9           ,5           ,159.1       ,159.6
> > > > > icelake     ,avx512      ,32775       ,0           ,64          ,231.8       ,230.8
> > > > > icelake     ,avx512      ,32783       ,0           ,3           ,297.8       ,299.3
> > > > > icelake     ,avx512      ,32799       ,3           ,0           ,299.1       ,299.0
> > > > > icelake     ,avx512      ,32831       ,3           ,7           ,293.5       ,295.4
> > > > > icelake     ,avx512      ,32895       ,9           ,5           ,300.3       ,302.5
> > > > > icelake     ,avx512      ,65543       ,0           ,64          ,1473.4      ,1479.2
> > > > > icelake     ,avx512      ,65551       ,0           ,3           ,1438.2      ,1445.3
> > > > > icelake     ,avx512      ,65567       ,3           ,0           ,1450.3      ,1463.8
> > > > > icelake     ,avx512      ,65599       ,3           ,7           ,1469.0      ,1473.8
> > > > > icelake     ,avx512      ,65663       ,9           ,5           ,1480.0      ,1483.5
> > > > > icelake     ,avx512      ,131079      ,0           ,64          ,3015.1      ,3037.5
> > > > > icelake     ,avx512      ,131087      ,0           ,3           ,2952.3      ,2960.4
> > > > > icelake     ,avx512      ,131103      ,3           ,0           ,2966.2      ,2964.4
> > > > > icelake     ,avx512      ,131135      ,3           ,7           ,2961.6      ,3047.9
> > > > > icelake     ,avx512      ,131199      ,9           ,5           ,2967.4      ,3183.8
> > > > > icelake     ,avx512      ,262151      ,0           ,64          ,6206.0      ,6141.5
> > > > > icelake     ,avx512      ,262159      ,0           ,3           ,5990.8      ,5959.2
> > > > > icelake     ,avx512      ,262175      ,3           ,0           ,5976.7      ,5963.8
> > > > > icelake     ,avx512      ,262207      ,3           ,7           ,5939.5      ,5924.3
> > > > > icelake     ,avx512      ,262271      ,9           ,5           ,5944.6      ,5990.3
> > > > > icelake     ,avx512      ,524295      ,0           ,64          ,14726.7     ,14307.0
> > > > > icelake     ,avx512      ,524303      ,0           ,3           ,14344.2     ,14040.5
> > > > > icelake     ,avx512      ,524319      ,3           ,0           ,14175.0     ,13862.2
> > > > > icelake     ,avx512      ,524351      ,3           ,7           ,14261.4     ,13821.5
> > > > > icelake     ,avx512      ,524415      ,9           ,5           ,14266.5     ,14064.7
> > > > > icelake     ,avx512      ,1048583     ,0           ,64          ,35211.4     ,35414.6
> > > > > icelake     ,avx512      ,1048591     ,0           ,3           ,35156.8     ,35591.2
> > > > > icelake     ,avx512      ,1048607     ,3           ,0           ,35273.1     ,35503.3
> > > > > icelake     ,avx512      ,1048639     ,3           ,7           ,35255.8     ,35725.0
> > > > > icelake     ,avx512      ,1048703     ,9           ,5           ,35703.6     ,36289.9
> > > > > icelake     ,avx512      ,2097159     ,0           ,64          ,72613.9     ,72063.2
> > > > > icelake     ,avx512      ,2097167     ,0           ,3           ,72301.6     ,73504.2
> > > > > icelake     ,avx512      ,2097183     ,3           ,0           ,73448.8     ,72133.6
> > > > > icelake     ,avx512      ,2097215     ,3           ,7           ,73762.9     ,72825.8
> > > > > icelake     ,avx512      ,2097279     ,9           ,5           ,72097.3     ,72914.6
> > > > > icelake     ,avx512      ,4194311     ,0           ,64          ,144793.4    ,144182.1
> > > > > icelake     ,avx512      ,4194319     ,0           ,3           ,143710.3    ,145063.3
> > > > > icelake     ,avx512      ,4194335     ,3           ,0           ,146722.1    ,144046.4
> > > > > icelake     ,avx512      ,4194367     ,3           ,7           ,144267.0    ,144874.6
> > > > > icelake     ,avx512      ,4194431     ,9           ,5           ,143808.2    ,144560.0
> > > > > icelake     ,avx512      ,8388615     ,0           ,64          ,427993.4    ,424521.5
> > > > > icelake     ,avx512      ,8388623     ,0           ,3           ,470267.1    ,473290.8
> > > > > icelake     ,avx512      ,8388639     ,3           ,0           ,457179.7    ,461797.7
> > > > > icelake     ,avx512      ,8388671     ,3           ,7           ,472507.9    ,481561.4
> > > > > icelake     ,avx512      ,8388735     ,9           ,5           ,463611.9    ,467388.7
> > > > > icelake     ,avx512      ,16777223    ,0           ,64          ,1490426.0   ,1526996.0
> > > > > icelake     ,avx512      ,16777231    ,0           ,3           ,1516687.0   ,1517095.0
> > > > > icelake     ,avx512      ,16777247    ,3           ,0           ,1497688.0   ,1512766.0
> > > > > icelake     ,avx512      ,16777279    ,3           ,7           ,1512331.0   ,1524317.0
> > > > > icelake     ,avx512      ,16777343    ,9           ,5           ,1498908.0   ,1500526.0
> > > > > icelake     ,avx         ,4103        ,0           ,64          ,50.2        ,63.7
> > > > > icelake     ,avx         ,4111        ,0           ,3           ,63.7        ,65.1
> > > > > icelake     ,avx         ,4127        ,3           ,0           ,68.2        ,69.4
> > > > > icelake     ,avx         ,4159        ,3           ,7           ,59.6        ,68.0
> > > > > icelake     ,avx         ,4223        ,9           ,5           ,68.2        ,66.8
> > > > > icelake     ,avx         ,8199        ,0           ,64          ,92.1        ,89.9
> > > > > icelake     ,avx         ,8207        ,0           ,3           ,119.7       ,118.3
> > > > > icelake     ,avx         ,8223        ,3           ,0           ,119.1       ,120.9
> > > > > icelake     ,avx         ,8255        ,3           ,7           ,122.9       ,123.7
> > > > > icelake     ,avx         ,8319        ,9           ,5           ,122.1       ,121.8
> > > > > icelake     ,avx         ,16391       ,0           ,64          ,162.7       ,158.0
> > > > > icelake     ,avx         ,16399       ,0           ,3           ,227.6       ,234.1
> > > > > icelake     ,avx         ,16415       ,3           ,0           ,230.8       ,232.7
> > > > > icelake     ,avx         ,16447       ,3           ,7           ,226.8       ,232.6
> > > > > icelake     ,avx         ,16511       ,9           ,5           ,233.4       ,233.8
> > > > > icelake     ,avx         ,32775       ,0           ,64          ,312.2       ,301.8
> > > > > icelake     ,avx         ,32783       ,0           ,3           ,449.7       ,450.0
> > > > > icelake     ,avx         ,32799       ,3           ,0           ,452.7       ,455.9
> > > > > icelake     ,avx         ,32831       ,3           ,7           ,449.8       ,458.0
> > > > > icelake     ,avx         ,32895       ,9           ,5           ,456.3       ,459.4
> > > > > icelake     ,avx         ,65543       ,0           ,64          ,1460.6      ,1463.9
> > > > > icelake     ,avx         ,65551       ,0           ,3           ,1462.0      ,1465.4
> > > > > icelake     ,avx         ,65567       ,3           ,0           ,1466.6      ,1480.4
> > > > > icelake     ,avx         ,65599       ,3           ,7           ,1488.0      ,1488.9
> > > > > icelake     ,avx         ,65663       ,9           ,5           ,1680.8      ,1499.5
> > > > > icelake     ,avx         ,131079      ,0           ,64          ,2988.5      ,3010.1
> > > > > icelake     ,avx         ,131087      ,0           ,3           ,2995.5      ,2996.4
> > > > > icelake     ,avx         ,131103      ,3           ,0           ,3006.2      ,3000.5
> > > > > icelake     ,avx         ,131135      ,3           ,7           ,3032.4      ,3073.7
> > > > > icelake     ,avx         ,131199      ,9           ,5           ,3010.4      ,3027.4
> > > > > icelake     ,avx         ,262151      ,0           ,64          ,6143.2      ,6079.1
> > > > > icelake     ,avx         ,262159      ,0           ,3           ,6085.1      ,6075.8
> > > > > icelake     ,avx         ,262175      ,3           ,0           ,6088.0      ,6064.9
> > > > > icelake     ,avx         ,262207      ,3           ,7           ,6018.7      ,6023.5
> > > > > icelake     ,avx         ,262271      ,9           ,5           ,6019.8      ,5959.2
> > > > > icelake     ,avx         ,524295      ,0           ,64          ,14464.2     ,14095.1
> > > > > icelake     ,avx         ,524303      ,0           ,3           ,14761.6     ,14050.2
> > > > > icelake     ,avx         ,524319      ,3           ,0           ,14534.1     ,14087.5
> > > > > icelake     ,avx         ,524351      ,3           ,7           ,14147.7     ,13903.8
> > > > > icelake     ,avx         ,524415      ,9           ,5           ,14157.0     ,13982.9
> > > > > icelake     ,avx         ,1048583     ,0           ,64          ,36599.0     ,37461.4
> > > > > icelake     ,avx         ,1048591     ,0           ,3           ,36717.8     ,37454.9
> > > > > icelake     ,avx         ,1048607     ,3           ,0           ,36821.2     ,37343.3
> > > > > icelake     ,avx         ,1048639     ,3           ,7           ,36958.0     ,37507.2
> > > > > icelake     ,avx         ,1048703     ,9           ,5           ,36869.2     ,37413.1
> > > > > icelake     ,avx         ,2097159     ,0           ,64          ,74765.8     ,75330.9
> > > > > icelake     ,avx         ,2097167     ,0           ,3           ,75175.4     ,74891.9
> > > > > icelake     ,avx         ,2097183     ,3           ,0           ,75451.4     ,74787.7
> > > > > icelake     ,avx         ,2097215     ,3           ,7           ,75394.8     ,75839.1
> > > > > icelake     ,avx         ,2097279     ,9           ,5           ,75099.2     ,75421.2
> > > > > icelake     ,avx         ,4194311     ,0           ,64          ,146809.6    ,146619.4
> > > > > icelake     ,avx         ,4194319     ,0           ,3           ,148866.4    ,149898.2
> > > > > icelake     ,avx         ,4194335     ,3           ,0           ,148719.7    ,150165.4
> > > > > icelake     ,avx         ,4194367     ,3           ,7           ,150600.1    ,150925.9
> > > > > icelake     ,avx         ,4194431     ,9           ,5           ,149457.3    ,150519.2
> > > > > icelake     ,avx         ,8388615     ,0           ,64          ,412709.8    ,423666.1
> > > > > icelake     ,avx         ,8388623     ,0           ,3           ,423717.4    ,424418.2
> > > > > icelake     ,avx         ,8388639     ,3           ,0           ,414387.5    ,413445.6
> > > > > icelake     ,avx         ,8388671     ,3           ,7           ,449010.7    ,417553.5
> > > > > icelake     ,avx         ,8388735     ,9           ,5           ,414128.6    ,411815.3
> > > > > icelake     ,avx         ,16777223    ,0           ,64          ,1490032.0   ,1510004.0
> > > > > icelake     ,avx         ,16777231    ,0           ,3           ,1379638.0   ,1422097.0
> > > > > icelake     ,avx         ,16777247    ,3           ,0           ,1418930.0   ,1367557.0
> > > > > icelake     ,avx         ,16777279    ,3           ,7           ,1515152.0   ,1500176.0
> > > > > icelake     ,avx         ,16777343    ,9           ,5           ,1344117.0   ,1411795.0
> > > > > icelake     ,sse2        ,4103        ,0           ,64          ,113.2       ,114.6
> > > > > icelake     ,sse2        ,4111        ,0           ,3           ,121.5       ,120.4
> > > > > icelake     ,sse2        ,4127        ,3           ,0           ,1700.5      ,1771.5
> > > > > icelake     ,sse2        ,4159        ,3           ,7           ,119.3       ,118.8
> > > > > icelake     ,sse2        ,4223        ,9           ,5           ,1739.7      ,1735.2
> > > > > icelake     ,sse2        ,8199        ,0           ,64          ,207.0       ,203.9
> > > > > icelake     ,sse2        ,8207        ,0           ,3           ,225.5       ,220.8
> > > > > icelake     ,sse2        ,8223        ,3           ,0           ,3444.3      ,3743.5
> > > > > icelake     ,sse2        ,8255        ,3           ,7           ,219.9       ,216.8
> > > > > icelake     ,sse2        ,8319        ,9           ,5           ,4117.1      ,3487.3
> > > > > icelake     ,sse2        ,16391       ,0           ,64          ,397.1       ,394.3
> > > > > icelake     ,sse2        ,16399       ,0           ,3           ,439.6       ,428.6
> > > > > icelake     ,sse2        ,16415       ,3           ,0           ,6997.0      ,7031.2
> > > > > icelake     ,sse2        ,16447       ,3           ,7           ,426.8       ,421.8
> > > > > icelake     ,sse2        ,16511       ,9           ,5           ,7037.6      ,7038.3
> > > > > icelake     ,sse2        ,32775       ,0           ,64          ,790.9       ,779.0
> > > > > icelake     ,sse2        ,32783       ,0           ,3           ,863.1       ,849.6
> > > > > icelake     ,sse2        ,32799       ,3           ,0           ,14043.0     ,14390.9
> > > > > icelake     ,sse2        ,32831       ,3           ,7           ,841.6       ,833.1
> > > > > icelake     ,sse2        ,32895       ,9           ,5           ,14277.6     ,14344.2
> > > > > icelake     ,sse2        ,65543       ,0           ,64          ,1897.0      ,1897.3
> > > > > icelake     ,sse2        ,65551       ,0           ,3           ,1927.1      ,1955.4
> > > > > icelake     ,sse2        ,65567       ,3           ,0           ,28834.7     ,28727.8
> > > > > icelake     ,sse2        ,65599       ,3           ,7           ,1961.4      ,1969.7
> > > > > icelake     ,sse2        ,65663       ,9           ,5           ,28867.6     ,29019.8
> > > > > icelake     ,sse2        ,131079      ,0           ,64          ,3879.3      ,3872.6
> > > > > icelake     ,sse2        ,131087      ,0           ,3           ,3955.3      ,3990.7
> > > > > icelake     ,sse2        ,131103      ,3           ,0           ,58001.8     ,60567.9
> > > > > icelake     ,sse2        ,131135      ,3           ,7           ,3951.5      ,4002.6
> > > > > icelake     ,sse2        ,131199      ,9           ,5           ,57886.7     ,58391.4
> > > > > icelake     ,sse2        ,262151      ,0           ,64          ,7851.4      ,7894.7
> > > > > icelake     ,sse2        ,262159      ,0           ,3           ,7947.5      ,8016.2
> > > > > icelake     ,sse2        ,262175      ,3           ,0           ,115036.2    ,115968.6
> > > > > icelake     ,sse2        ,262207      ,3           ,7           ,7883.9      ,7814.1
> > > > > icelake     ,sse2        ,262271      ,9           ,5           ,113776.4    ,119733.6
> > > > > icelake     ,sse2        ,524295      ,0           ,64          ,17198.1     ,16974.9
> > > > > icelake     ,sse2        ,524303      ,0           ,3           ,17402.2     ,17096.3
> > > > > icelake     ,sse2        ,524319      ,3           ,0           ,223980.4    ,225889.9
> > > > > icelake     ,sse2        ,524351      ,3           ,7           ,17034.9     ,16910.3
> > > > > icelake     ,sse2        ,524415      ,9           ,5           ,224027.7    ,224962.5
> > > > > icelake     ,sse2        ,1048583     ,0           ,64          ,38822.3     ,39178.6
> > > > > icelake     ,sse2        ,1048591     ,0           ,3           ,41686.7     ,40247.4
> > > > > icelake     ,sse2        ,1048607     ,3           ,0           ,38814.8     ,39323.3
> > > > > icelake     ,sse2        ,1048639     ,3           ,7           ,39568.3     ,41325.7
> > > > > icelake     ,sse2        ,1048703     ,9           ,5           ,39354.2     ,39637.9
> > > > > icelake     ,sse2        ,2097159     ,0           ,64          ,84074.7     ,84543.1
> > > > > icelake     ,sse2        ,2097167     ,0           ,3           ,83665.7     ,82358.2
> > > > > icelake     ,sse2        ,2097183     ,3           ,0           ,81817.8     ,79638.9
> > > > > icelake     ,sse2        ,2097215     ,3           ,7           ,83649.1     ,83497.6
> > > > > icelake     ,sse2        ,2097279     ,9           ,5           ,80287.6     ,79980.9
> > > > > icelake     ,sse2        ,4194311     ,0           ,64          ,165409.8    ,168343.1
> > > > > icelake     ,sse2        ,4194319     ,0           ,3           ,165216.7    ,177632.0
> > > > > icelake     ,sse2        ,4194335     ,3           ,0           ,158718.7    ,160342.2
> > > > > icelake     ,sse2        ,4194367     ,3           ,7           ,167944.9    ,167204.4
> > > > > icelake     ,sse2        ,4194431     ,9           ,5           ,161530.1    ,164839.7
> > > > > icelake     ,sse2        ,8388615     ,0           ,64          ,626504.3    ,629858.5
> > > > > icelake     ,sse2        ,8388623     ,0           ,3           ,623969.5    ,631509.1
> > > > > icelake     ,sse2        ,8388639     ,3           ,0           ,599366.7    ,600016.0
> > > > > icelake     ,sse2        ,8388671     ,3           ,7           ,619964.2    ,619113.2
> > > > > icelake     ,sse2        ,8388735     ,9           ,5           ,595338.1    ,604172.4
> > > > > icelake     ,sse2        ,16777223    ,0           ,64          ,1709597.0   ,1725184.0
> > > > > icelake     ,sse2        ,16777231    ,0           ,3           ,1725452.0   ,1719746.0
> > > > > icelake     ,sse2        ,16777247    ,3           ,0           ,1614269.0   ,1607164.0
> > > > > icelake     ,sse2        ,16777279    ,3           ,7           ,1705295.0   ,1733018.0
> > > > > icelake     ,sse2        ,16777343    ,9           ,5           ,1604197.0   ,1595690.0
> > > > >
> > > > >
> > > > >  .../multiarch/memmove-vec-unaligned-erms.S    | 338 ++++++++++++++----
> > > > >  1 file changed, 265 insertions(+), 73 deletions(-)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > > > > index 897a3d9762..5e4a071f16 100644
> > > > > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > > > > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > > > > @@ -35,7 +35,16 @@
> > > > >        __x86_rep_movsb_stop_threshold, then REP MOVSB will be used.
> > > > >     7. If size >= __x86_shared_non_temporal_threshold and there is no
> > > > >        overlap between destination and source, use non-temporal store
> > > > > -      instead of aligned store.  */
> > > > > +      instead of aligned store copying from either 2 or 4 pages at
> > > > > +      once.
> > > > > +   8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold
> > > > > +      and source and destination do not page alias, copy from 2 pages
> > > > > +      at once using non-temporal stores. Page aliasing in this case is
> > > > > +      considered true if destination's page alignment - sources' page
> > > > > +      alignment is less than 8 * VEC_SIZE.
> > > > > +   9. If size >= 16 * __x86_shared_non_temporal_threshold or source
> > > > > +      and destination do page alias copy from 4 pages at once using
> > > > > +      non-temporal stores.  */
> > > > >
> > > > >  #include <sysdep.h>
> > > > >
> > > > > @@ -67,6 +76,34 @@
> > > > >  # endif
> > > > >  #endif
> > > > >
> > > > > +#ifndef PAGE_SIZE
> > > > > +# define PAGE_SIZE 4096
> > > > > +#endif
> > > > > +
> > > > > +#if PAGE_SIZE != 4096
> > > > > +# error Unsupported PAGE_SIZE
> > > > > +#endif
> > > > > +
> > > > > +#ifndef LOG_PAGE_SIZE
> > > > > +# define LOG_PAGE_SIZE 12
> > > > > +#endif
> > > > > +
> > > > > +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE)
> > > > > +# error Invalid LOG_PAGE_SIZE
> > > > > +#endif
> > > > > +
> > > > > +/* Byte per page for large_memcpy inner loop.  */
> > > > > +#if VEC_SIZE == 64
> > > > > +# define LARGE_LOAD_SIZE (VEC_SIZE * 2)
> > > > > +#else
> > > > > +# define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> > > > > +#endif
> > > > > +
> > > > > +/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> > > > > +#ifndef LOG_4X_MEMCPY_THRESH
> > > > > +# define LOG_4X_MEMCPY_THRESH 4
> > > > > +#endif
> > > > > +
> > > > >  /* Avoid short distance rep movsb only with non-SSE vector.  */
> > > > >  #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB
> > > > >  # define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16)
> > > > > @@ -106,6 +143,28 @@
> > > > >  # error Unsupported PREFETCH_SIZE!
> > > > >  #endif
> > > > >
> > > > > +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2)
> > > > > +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \
> > > > > +     VMOVU   (offset)base, vec0; \
> > > > > +     VMOVU   ((offset) + VEC_SIZE)base, vec1;
> > > > > +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \
> > > > > +     VMOVNT  vec0, (offset)base; \
> > > > > +     VMOVNT  vec1, ((offset) + VEC_SIZE)base;
> > > > > +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4)
> > > > > +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \
> > > > > +     VMOVU   (offset)base, vec0; \
> > > > > +     VMOVU   ((offset) + VEC_SIZE)base, vec1; \
> > > > > +     VMOVU   ((offset) + VEC_SIZE * 2)base, vec2; \
> > > > > +     VMOVU   ((offset) + VEC_SIZE * 3)base, vec3;
> > > > > +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \
> > > > > +     VMOVNT  vec0, (offset)base; \
> > > > > +     VMOVNT  vec1, ((offset) + VEC_SIZE)base; \
> > > > > +     VMOVNT  vec2, ((offset) + VEC_SIZE * 2)base; \
> > > > > +     VMOVNT  vec3, ((offset) + VEC_SIZE * 3)base;
> > > > > +#else
> > > > > +# error Invalid LARGE_LOAD_SIZE
> > > > > +#endif
> > > > > +
> > > > >  #ifndef SECTION
> > > > >  # error SECTION is not defined!
> > > > >  #endif
> > > > > @@ -393,6 +452,15 @@ L(last_4x_vec):
> > > > >       VZEROUPPER_RETURN
> > > > >
> > > > >  L(more_8x_vec):
> > > > > +     /* Check if non-temporal move candidate.  */
> > > > > +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > > > > +     /* Check non-temporal store threshold.  */
> > > > > +     cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > > +     ja      L(large_memcpy_2x)
> > > > > +#endif
> > > > > +     /* Entry if rdx is greater than non-temporal threshold but there
> > > > > +       is overlap.  */
> > > > > +L(more_8x_vec_check):
> > > > >       cmpq    %rsi, %rdi
> > > > >       ja      L(more_8x_vec_backward)
> > > > >       /* Source == destination is less common.  */
> > > > > @@ -419,24 +487,21 @@ L(more_8x_vec):
> > > > >       subq    %r8, %rdi
> > > > >       /* Adjust length.  */
> > > > >       addq    %r8, %rdx
> > > > > -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > > > > -     /* Check non-temporal store threshold.  */
> > > > > -     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > > -     ja      L(large_forward)
> > > > > -#endif
> > > > > +
> > > > > +     .p2align 4
> > > > >  L(loop_4x_vec_forward):
> > > > >       /* Copy 4 * VEC a time forward.  */
> > > > >       VMOVU   (%rsi), %VEC(0)
> > > > >       VMOVU   VEC_SIZE(%rsi), %VEC(1)
> > > > >       VMOVU   (VEC_SIZE * 2)(%rsi), %VEC(2)
> > > > >       VMOVU   (VEC_SIZE * 3)(%rsi), %VEC(3)
> > > > > -     addq    $(VEC_SIZE * 4), %rsi
> > > > > -     subq    $(VEC_SIZE * 4), %rdx
> > > > > +     subq    $-(VEC_SIZE * 4), %rsi
> > > > > +     addq    $-(VEC_SIZE * 4), %rdx
> > > > >       VMOVA   %VEC(0), (%rdi)
> > > > >       VMOVA   %VEC(1), VEC_SIZE(%rdi)
> > > > >       VMOVA   %VEC(2), (VEC_SIZE * 2)(%rdi)
> > > > >       VMOVA   %VEC(3), (VEC_SIZE * 3)(%rdi)
> > > > > -     addq    $(VEC_SIZE * 4), %rdi
> > > > > +     subq    $-(VEC_SIZE * 4), %rdi
> > > > >       cmpq    $(VEC_SIZE * 4), %rdx
> > > > >       ja      L(loop_4x_vec_forward)
> > > > >       /* Store the last 4 * VEC.  */
> > > > > @@ -470,24 +535,21 @@ L(more_8x_vec_backward):
> > > > >       subq    %r8, %r9
> > > > >       /* Adjust length.  */
> > > > >       subq    %r8, %rdx
> > > > > -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > > > > -     /* Check non-temporal store threshold.  */
> > > > > -     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > > -     ja      L(large_backward)
> > > > > -#endif
> > > > > +
> > > > > +     .p2align 4
> > > > >  L(loop_4x_vec_backward):
> > > > >       /* Copy 4 * VEC a time backward.  */
> > > > >       VMOVU   (%rcx), %VEC(0)
> > > > >       VMOVU   -VEC_SIZE(%rcx), %VEC(1)
> > > > >       VMOVU   -(VEC_SIZE * 2)(%rcx), %VEC(2)
> > > > >       VMOVU   -(VEC_SIZE * 3)(%rcx), %VEC(3)
> > > > > -     subq    $(VEC_SIZE * 4), %rcx
> > > > > -     subq    $(VEC_SIZE * 4), %rdx
> > > > > +     addq    $-(VEC_SIZE * 4), %rcx
> > > > > +     addq    $-(VEC_SIZE * 4), %rdx
> > > > >       VMOVA   %VEC(0), (%r9)
> > > > >       VMOVA   %VEC(1), -VEC_SIZE(%r9)
> > > > >       VMOVA   %VEC(2), -(VEC_SIZE * 2)(%r9)
> > > > >       VMOVA   %VEC(3), -(VEC_SIZE * 3)(%r9)
> > > > > -     subq    $(VEC_SIZE * 4), %r9
> > > > > +     addq    $-(VEC_SIZE * 4), %r9
> > > > >       cmpq    $(VEC_SIZE * 4), %rdx
> > > > >       ja      L(loop_4x_vec_backward)
> > > > >       /* Store the first 4 * VEC.  */
> > > > > @@ -500,72 +562,202 @@ L(loop_4x_vec_backward):
> > > > >       VZEROUPPER_RETURN
> > > > >
> > > > >  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > > > > -L(large_forward):
> > > > > +     .p2align 4
> > > > > +L(large_memcpy_2x):
> > > > > +     /* Compute absolute value of difference between source and
> > > > > +        destination.  */
> > > > > +     movq    %rdi, %r9
> > > > > +     subq    %rsi, %r9
> > > > > +     movq    %r9, %r8
> > > > > +     leaq    -1(%r9), %rcx
> > > > > +     sarq    $63, %r8
> > > > > +     xorq    %r8, %r9
> > > > > +     subq    %r8, %r9
> > > > >       /* Don't use non-temporal store if there is overlap between
> > > > > -        destination and source since destination may be in cache
> > > > > -        when source is loaded.  */
> > > > > -     leaq    (%rdi, %rdx), %r10
> > > > > -     cmpq    %r10, %rsi
> > > > > -     jb      L(loop_4x_vec_forward)
> > > > > -L(loop_large_forward):
> > > > > +        destination and source since destination may be in cache when
> > > > > +        source is loaded.  */
> > > > > +     cmpq    %r9, %rdx
> > > > > +     ja      L(more_8x_vec_check)
> > > > > +
> > > > > +     /* Cache align destination. First store the first 64 bytes then
> > > > > +        adjust alignments.  */
> > > > > +     VMOVU   (%rsi), %VEC(8)
> > > > > +#if VEC_SIZE < 64
> > > > > +     VMOVU   VEC_SIZE(%rsi), %VEC(9)
> > > > > +#if VEC_SIZE < 32
> > > > > +     VMOVU   (VEC_SIZE * 2)(%rsi), %VEC(10)
> > > > > +     VMOVU   (VEC_SIZE * 3)(%rsi), %VEC(11)
> > > > > +#endif
> > > > > +#endif
> > > > > +     VMOVU   %VEC(8), (%rdi)
> > > > > +#if VEC_SIZE < 64
> > > > > +     VMOVU   %VEC(9), VEC_SIZE(%rdi)
> > > > > +#if VEC_SIZE < 32
> > > > > +     VMOVU   %VEC(10), (VEC_SIZE * 2)(%rdi)
> > > > > +     VMOVU   %VEC(11), (VEC_SIZE * 3)(%rdi)
> > > > > +#endif
> > > > > +#endif
> > > > > +     /* Adjust source, destination, and size.  */
> > > > > +     movq    %rdi, %r8
> > > > > +     andq    $63, %r8
> > > > > +     /* Get the negative of offset for alignment.  */
> > > > > +     subq    $64, %r8
> > > > > +     /* Adjust source.  */
> > > > > +     subq    %r8, %rsi
> > > > > +     /* Adjust destination which should be aligned now.  */
> > > > > +     subq    %r8, %rdi
> > > > > +     /* Adjust length.  */
> > > > > +     addq    %r8, %rdx
> > > > > +
> > > > > +     /* Test if source and destination addresses will alias. If they do
> > > > > +        the larger pipeline in large_memcpy_4x alleviated the
> > > > > +        performance drop.  */
> > > > > +     testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> > > > > +     jz      L(large_memcpy_4x)
> > > > > +
> > > > > +     movq    %rdx, %r10
> > > > > +     shrq    $LOG_4X_MEMCPY_THRESH, %r10
> > > > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> > > > > +     jae     L(large_memcpy_4x)
> > > > > +
> > > > > +     /* edx will store remainder size for copying tail.  */
> > > > > +     andl    $(PAGE_SIZE * 2 - 1), %edx
> > > > > +     /* r10 stores outer loop counter.  */
> > > > > +     shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > > > > +     /* Copy 4x VEC at a time from 2 pages.  */
> > > > > +     .p2align 4
> > > > > +L(loop_large_memcpy_2x_outer):
> > > > > +     /* ecx stores inner loop counter.  */
> > > > > +     movl    $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
> > > > > +L(loop_large_memcpy_2x_inner):
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE)
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2)
> > > > > +     /* Load vectors from rsi.  */
> > > > > +     LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> > > > > +     LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> > > > > +     subq    $-LARGE_LOAD_SIZE, %rsi
> > > > > +     /* Non-temporal store vectors to rdi.  */
> > > > > +     STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> > > > > +     STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> > > > > +     subq    $-LARGE_LOAD_SIZE, %rdi
> > > > > +     decl    %ecx
> > > > > +     jnz     L(loop_large_memcpy_2x_inner)
> > > > > +     addq    $PAGE_SIZE, %rdi
> > > > > +     addq    $PAGE_SIZE, %rsi
> > > > > +     decq    %r10
> > > > > +     jne     L(loop_large_memcpy_2x_outer)
> > > > > +     sfence
> > > > > +
> > > > > +     /* Check if only last 4 loads are needed.  */
> > > > > +     cmpl    $(VEC_SIZE * 4), %edx
> > > > > +     jbe     L(large_memcpy_2x_end)
> > > > > +
> > > > > +     /* Handle the last 2 * PAGE_SIZE bytes.  */
> > > > > +L(loop_large_memcpy_2x_tail):
> > > > >       /* Copy 4 * VEC a time forward with non-temporal stores.  */
> > > > > -     PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
> > > > > -     PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3)
> > > > > +     PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
> > > > > +     PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
> > > > >       VMOVU   (%rsi), %VEC(0)
> > > > >       VMOVU   VEC_SIZE(%rsi), %VEC(1)
> > > > >       VMOVU   (VEC_SIZE * 2)(%rsi), %VEC(2)
> > > > >       VMOVU   (VEC_SIZE * 3)(%rsi), %VEC(3)
> > > > > -     addq    $PREFETCHED_LOAD_SIZE, %rsi
> > > > > -     subq    $PREFETCHED_LOAD_SIZE, %rdx
> > > > > -     VMOVNT  %VEC(0), (%rdi)
> > > > > -     VMOVNT  %VEC(1), VEC_SIZE(%rdi)
> > > > > -     VMOVNT  %VEC(2), (VEC_SIZE * 2)(%rdi)
> > > > > -     VMOVNT  %VEC(3), (VEC_SIZE * 3)(%rdi)
> > > > > -     addq    $PREFETCHED_LOAD_SIZE, %rdi
> > > > > -     cmpq    $PREFETCHED_LOAD_SIZE, %rdx
> > > > > -     ja      L(loop_large_forward)
> > > > > -     sfence
> > > > > +     subq    $-(VEC_SIZE * 4), %rsi
> > > > > +     addl    $-(VEC_SIZE * 4), %edx
> > > > > +     VMOVA   %VEC(0), (%rdi)
> > > > > +     VMOVA   %VEC(1), VEC_SIZE(%rdi)
> > > > > +     VMOVA   %VEC(2), (VEC_SIZE * 2)(%rdi)
> > > > > +     VMOVA   %VEC(3), (VEC_SIZE * 3)(%rdi)
> > > > > +     subq    $-(VEC_SIZE * 4), %rdi
> > > > > +     cmpl    $(VEC_SIZE * 4), %edx
> > > > > +     ja      L(loop_large_memcpy_2x_tail)
> > > > > +
> > > > > +L(large_memcpy_2x_end):
> > > > >       /* Store the last 4 * VEC.  */
> > > > > -     VMOVU   %VEC(5), (%rcx)
> > > > > -     VMOVU   %VEC(6), -VEC_SIZE(%rcx)
> > > > > -     VMOVU   %VEC(7), -(VEC_SIZE * 2)(%rcx)
> > > > > -     VMOVU   %VEC(8), -(VEC_SIZE * 3)(%rcx)
> > > > > -     /* Store the first VEC.  */
> > > > > -     VMOVU   %VEC(4), (%r11)
> > > > > +     VMOVU   -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0)
> > > > > +     VMOVU   -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1)
> > > > > +     VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2)
> > > > > +     VMOVU   -VEC_SIZE(%rsi, %rdx), %VEC(3)
> > > > > +
> > > > > +     VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx)
> > > > > +     VMOVU   %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx)
> > > > > +     VMOVU   %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx)
> > > > > +     VMOVU   %VEC(3), -VEC_SIZE(%rdi, %rdx)
> > > > >       VZEROUPPER_RETURN
> > > > >
> > > > > -L(large_backward):
> > > > > -     /* Don't use non-temporal store if there is overlap between
> > > > > -        destination and source since destination may be in cache
> > > > > -        when source is loaded.  */
> > > > > -     leaq    (%rcx, %rdx), %r10
> > > > > -     cmpq    %r10, %r9
> > > > > -     jb      L(loop_4x_vec_backward)
> > > > > -L(loop_large_backward):
> > > > > -     /* Copy 4 * VEC a time backward with non-temporal stores.  */
> > > > > -     PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 2)
> > > > > -     PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 3)
> > > > > -     VMOVU   (%rcx), %VEC(0)
> > > > > -     VMOVU   -VEC_SIZE(%rcx), %VEC(1)
> > > > > -     VMOVU   -(VEC_SIZE * 2)(%rcx), %VEC(2)
> > > > > -     VMOVU   -(VEC_SIZE * 3)(%rcx), %VEC(3)
> > > > > -     subq    $PREFETCHED_LOAD_SIZE, %rcx
> > > > > -     subq    $PREFETCHED_LOAD_SIZE, %rdx
> > > > > -     VMOVNT  %VEC(0), (%r9)
> > > > > -     VMOVNT  %VEC(1), -VEC_SIZE(%r9)
> > > > > -     VMOVNT  %VEC(2), -(VEC_SIZE * 2)(%r9)
> > > > > -     VMOVNT  %VEC(3), -(VEC_SIZE * 3)(%r9)
> > > > > -     subq    $PREFETCHED_LOAD_SIZE, %r9
> > > > > -     cmpq    $PREFETCHED_LOAD_SIZE, %rdx
> > > > > -     ja      L(loop_large_backward)
> > > > > +     .p2align 4
> > > > > +L(large_memcpy_4x):
> > > > > +     movq    %rdx, %r10
> > > > > +     /* edx will store remainder size for copying tail.  */
> > > > > +     andl    $(PAGE_SIZE * 4 - 1), %edx
> > > > > +     /* r10 stores outer loop counter.  */
> > > > > +     shrq    $(LOG_PAGE_SIZE + 2), %r10
> > > > > +     /* Copy 4x VEC at a time from 4 pages.  */
> > > > > +     .p2align 4
> > > > > +L(loop_large_memcpy_4x_outer):
> > > > > +     /* ecx stores inner loop counter.  */
> > > > > +     movl    $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
> > > > > +L(loop_large_memcpy_4x_inner):
> > > > > +     /* Only one prefetch set per page as doing 4 pages give more time
> > > > > +        for prefetcher to keep up.  */
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE)
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE)
> > > > > +     PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE)
> > > > > +     /* Load vectors from rsi.  */
> > > > > +     LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> > > > > +     LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> > > > > +     LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11))
> > > > > +     LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15))
> > > > > +     subq    $-LARGE_LOAD_SIZE, %rsi
> > > > > +     /* Non-temporal store vectors to rdi.  */
> > > > > +     STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> > > > > +     STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> > > > > +     STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11))
> > > > > +     STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15))
> > > > > +     subq    $-LARGE_LOAD_SIZE, %rdi
> > > > > +     decl    %ecx
> > > > > +     jnz     L(loop_large_memcpy_4x_inner)
> > > > > +     addq    $(PAGE_SIZE * 3), %rdi
> > > > > +     addq    $(PAGE_SIZE * 3), %rsi
> > > > > +     decq    %r10
> > > > > +     jne     L(loop_large_memcpy_4x_outer)
> > > > >       sfence
> > > > > -     /* Store the first 4 * VEC.  */
> > > > > -     VMOVU   %VEC(4), (%rdi)
> > > > > -     VMOVU   %VEC(5), VEC_SIZE(%rdi)
> > > > > -     VMOVU   %VEC(6), (VEC_SIZE * 2)(%rdi)
> > > > > -     VMOVU   %VEC(7), (VEC_SIZE * 3)(%rdi)
> > > > > -     /* Store the last VEC.  */
> > > > > -     VMOVU   %VEC(8), (%r11)
> > > > > +     /* Check if only last 4 loads are needed.  */
> > > > > +     cmpl    $(VEC_SIZE * 4), %edx
> > > > > +     jbe     L(large_memcpy_4x_end)
> > > > > +
> > > > > +     /* Handle the last 4  * PAGE_SIZE bytes.  */
> > > > > +L(loop_large_memcpy_4x_tail):
> > > > > +     /* Copy 4 * VEC a time forward with non-temporal stores.  */
> > > > > +     PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
> > > > > +     PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
> > > > > +     VMOVU   (%rsi), %VEC(0)
> > > > > +     VMOVU   VEC_SIZE(%rsi), %VEC(1)
> > > > > +     VMOVU   (VEC_SIZE * 2)(%rsi), %VEC(2)
> > > > > +     VMOVU   (VEC_SIZE * 3)(%rsi), %VEC(3)
> > > > > +     subq    $-(VEC_SIZE * 4), %rsi
> > > > > +     addl    $-(VEC_SIZE * 4), %edx
> > > > > +     VMOVA   %VEC(0), (%rdi)
> > > > > +     VMOVA   %VEC(1), VEC_SIZE(%rdi)
> > > > > +     VMOVA   %VEC(2), (VEC_SIZE * 2)(%rdi)
> > > > > +     VMOVA   %VEC(3), (VEC_SIZE * 3)(%rdi)
> > > > > +     subq    $-(VEC_SIZE * 4), %rdi
> > > > > +     cmpl    $(VEC_SIZE * 4), %edx
> > > > > +     ja      L(loop_large_memcpy_4x_tail)
> > > > > +
> > > > > +L(large_memcpy_4x_end):
> > > > > +     /* Store the last 4 * VEC.  */
> > > > > +     VMOVU   -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0)
> > > > > +     VMOVU   -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1)
> > > > > +     VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2)
> > > > > +     VMOVU   -VEC_SIZE(%rsi, %rdx), %VEC(3)
> > > > > +
> > > > > +     VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx)
> > > > > +     VMOVU   %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx)
> > > > > +     VMOVU   %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx)
> > > > > +     VMOVU   %VEC(3), -VEC_SIZE(%rdi, %rdx)
> > > > >       VZEROUPPER_RETURN
> > > > >  #endif
> > > > >  END (MEMMOVE_SYMBOL (__memmove, unaligned_erms))
> > > > > --
> > > > > 2.29.2
> > > > >
> > > >
> > > > LGTM.  Please commit it.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > H.J.
> >
> >
> >
> > --
> > H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

           reply	other threads:[~2022-04-27 23:47 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <CAFUsyfJ-ygcG0m_cvujb0wiuruYt0MyT06eqvZ4H_CN7SG1NOw@mail.gmail.com>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMAf5_fvZViZ0ea2fFY=jJiniMB0Ec27akL2+sh4f0DnNyGE7A@mail.gmail.com' \
    --to=skpgkp2@gmail.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=hjl.tools@gmail.com \
    --cc=libc-alpha@sourceware.org \
    --cc=libc-stable@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).