public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Enable EVEX strcmp
@ 2021-11-01 12:54 H.J. Lu
  2021-11-01 12:54 ` [PATCH 1/2] x86-64: Improve EVEX strcmp with masked load H.J. Lu
  2021-11-01 12:54 ` [PATCH 2/2] x86-64: Remove Prefer_AVX2_STRCMP H.J. Lu
  0 siblings, 2 replies; 5+ messages in thread
From: H.J. Lu @ 2021-11-01 12:54 UTC (permalink / raw)
  To: libc-alpha

Remove Prefer_AVX2_STRCMP to enable EVEX strcmp.  When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL.  EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.

bench-strcmp data on Tiger Lake:

Function: strcmp
Variant: default
                                    __strcmp_avx2	__strcmp_evex
=======================================================================
        length=1, align1=1, align2=1:        23.69	       25.56	
        length=1, align1=1, align2=1:        24.62	       23.43	
        length=1, align1=1, align2=1:        23.87	       23.43	
        length=2, align1=2, align2=2:         6.82	        6.61	
        length=2, align1=2, align2=2:         5.38	        5.98	
        length=2, align1=2, align2=2:         6.86	        6.85	
        length=3, align1=3, align2=3:         6.85	        6.86	
        length=3, align1=3, align2=3:         5.98	        5.98	
        length=3, align1=3, align2=3:         5.98	        6.10	
        length=4, align1=4, align2=4:         6.58	        5.98	
        length=4, align1=4, align2=4:         6.37	        5.98	
        length=4, align1=4, align2=4:         6.58	        5.98	
        length=5, align1=5, align2=5:         5.98	        5.98	
        length=5, align1=5, align2=5:         6.06	        6.82	
        length=5, align1=5, align2=5:         5.98	        5.98	
        length=6, align1=6, align2=6:         6.58	        5.98	
        length=6, align1=6, align2=6:         6.58	        6.06	
        length=6, align1=6, align2=6:         5.98	        5.98	
        length=7, align1=7, align2=7:         5.98	        5.98	
        length=7, align1=7, align2=7:         5.98	        6.05	
        length=7, align1=7, align2=7:         5.98	        5.98	
        length=8, align1=8, align2=8:         5.38	        5.38	
        length=8, align1=8, align2=8:         5.98	        5.38	
        length=8, align1=8, align2=8:         5.98	        5.38	
        length=9, align1=9, align2=9:         5.38	        5.38	
        length=9, align1=9, align2=9:         5.38	        5.38	
        length=9, align1=9, align2=9:         4.78	        5.38	
     length=10, align1=10, align2=10:         6.05	        5.40	
     length=10, align1=10, align2=10:         5.38	        5.38	
     length=10, align1=10, align2=10:         5.38	        5.38	
     length=11, align1=11, align2=11:         4.78	        5.38	
     length=11, align1=11, align2=11:         4.78	        5.38	
     length=11, align1=11, align2=11:         4.78	        5.38	
     length=12, align1=12, align2=12:         4.86	        5.38	
     length=12, align1=12, align2=12:         5.98	        5.38	
     length=12, align1=12, align2=12:         5.98	        5.38	
     length=13, align1=13, align2=13:         5.98	        5.38	
     length=13, align1=13, align2=13:         4.78	        5.38	
     length=13, align1=13, align2=13:         4.78	        5.38	
     length=14, align1=14, align2=14:         5.98	        5.38	
     length=14, align1=14, align2=14:         5.47	        5.38	
     length=14, align1=14, align2=14:         5.38	        5.38	
     length=15, align1=15, align2=15:         5.38	        5.38	
     length=15, align1=15, align2=15:         5.98	        5.38	
     length=15, align1=15, align2=15:         6.05	        5.38	
     length=16, align1=16, align2=16:         4.79	        4.79	
     length=16, align1=16, align2=16:         4.78	        4.78	
     length=16, align1=16, align2=16:         5.38	        4.79	
     length=17, align1=17, align2=17:         6.58	        7.18	
     length=17, align1=17, align2=17:         6.58	        7.18	
     length=17, align1=17, align2=17:         6.58	        7.20	
     length=18, align1=18, align2=18:         6.58	        7.20	
     length=18, align1=18, align2=18:         6.58	        7.20	
     length=18, align1=18, align2=18:         6.58	        7.20	
     length=19, align1=19, align2=19:         6.58	        7.20	
     length=19, align1=19, align2=19:         6.58	        7.18	
     length=19, align1=19, align2=19:         6.58	        7.20	
     length=20, align1=20, align2=20:         6.58	        7.18	
     length=20, align1=20, align2=20:         6.58	        7.17	
     length=20, align1=20, align2=20:         6.58	        7.18	
     length=21, align1=21, align2=21:         6.58	        7.07	
     length=21, align1=21, align2=21:         7.18	        5.98	
     length=21, align1=21, align2=21:         7.18	        5.98	
     length=22, align1=22, align2=22:         6.58	        5.98	
     length=22, align1=22, align2=22:         7.18	        5.98	
     length=22, align1=22, align2=22:         7.18	        6.06	
     length=23, align1=23, align2=23:         6.58	        5.98	
     length=23, align1=23, align2=23:         6.58	        5.98	
     length=23, align1=23, align2=23:         6.58	        5.98	
     length=24, align1=24, align2=24:         4.86	        4.79	
     length=24, align1=24, align2=24:         5.38	        4.79	
     length=24, align1=24, align2=24:         5.38	        4.79	
     length=25, align1=25, align2=25:         4.78	        4.79	
     length=25, align1=25, align2=25:         5.38	        4.79	
     length=25, align1=25, align2=25:         5.38	        4.78	
     length=26, align1=26, align2=26:         5.46	        4.78	
     length=26, align1=26, align2=26:         5.38	        4.79	
     length=26, align1=26, align2=26:         5.38	        4.78	
     length=27, align1=27, align2=27:         4.78	        4.79	
     length=27, align1=27, align2=27:         4.78	        4.78	
     length=27, align1=27, align2=27:         4.78	        4.79	
     length=28, align1=28, align2=28:         5.38	        4.79	
     length=28, align1=28, align2=28:         4.78	        4.79	
     length=28, align1=28, align2=28:         5.38	        4.78	
     length=29, align1=29, align2=29:         4.78	        4.79	
     length=29, align1=29, align2=29:         5.38	        4.78	
     length=29, align1=29, align2=29:         4.78	        4.79	
     length=30, align1=30, align2=30:         4.78	        4.86	
     length=30, align1=30, align2=30:         5.38	        4.79	
     length=30, align1=30, align2=30:         4.78	        4.79	
     length=31, align1=31, align2=31:         4.78	        4.86	
     length=31, align1=31, align2=31:         5.38	        4.78	
     length=31, align1=31, align2=31:         5.38	        4.78	
        length=4, align1=0, align2=0:         6.00	        5.39	
        length=4, align1=0, align2=0:         6.00	        5.38	
        length=4, align1=0, align2=0:         6.00	        5.38	
        length=4, align1=0, align2=0:         5.98	        5.38	
        length=4, align1=0, align2=0:         6.02	        5.38	
        length=4, align1=0, align2=0:         5.98	        5.38	
        length=4, align1=0, align2=1:         5.98	        5.98	
        length=4, align1=1, align2=2:         5.38	        5.98	
        length=8, align1=0, align2=0:         5.98	        5.38	
        length=8, align1=0, align2=0:         6.02	        5.38	
        length=8, align1=0, align2=0:         6.00	        5.38	
        length=8, align1=0, align2=0:         6.00	        5.38	
        length=8, align1=0, align2=0:         6.02	        5.38	
        length=8, align1=0, align2=0:         5.98	        5.38	
        length=8, align1=0, align2=2:         5.98	        5.98	
        length=8, align1=2, align2=3:         5.38	        5.98	
       length=16, align1=0, align2=0:         5.38	        4.79	
       length=16, align1=0, align2=0:         5.38	        4.78	
       length=16, align1=0, align2=0:         4.87	        4.78	
       length=16, align1=0, align2=0:         5.38	        4.79	
       length=16, align1=0, align2=0:         4.78	        4.79	
       length=16, align1=0, align2=0:         5.38	        4.79	
       length=16, align1=0, align2=3:         6.00	        5.38	
       length=16, align1=3, align2=4:         5.98	        5.38	
       length=32, align1=0, align2=0:         7.82	        5.99	
       length=32, align1=0, align2=0:         7.71	        6.58	
       length=32, align1=0, align2=0:         6.44	        4.79	
       length=32, align1=0, align2=0:         6.81	        4.79	
       length=32, align1=0, align2=0:         6.53	        4.79	
       length=32, align1=0, align2=0:         6.33	        4.79	
       length=32, align1=0, align2=4:         8.61	        4.78	
       length=32, align1=4, align2=5:         6.74	        5.49	
       length=64, align1=0, align2=0:         9.67	        8.24	
       length=64, align1=0, align2=0:        11.11	        8.23	
       length=64, align1=0, align2=0:        10.00	        6.88	
       length=64, align1=0, align2=0:        12.82	        6.88	
       length=64, align1=0, align2=0:        10.42	        7.88	
       length=64, align1=0, align2=0:        10.37	        6.88	
       length=64, align1=0, align2=5:        11.08	        6.88	
       length=64, align1=5, align2=6:         9.29	        6.88	
      length=128, align1=0, align2=0:        14.06	       14.08	
      length=128, align1=0, align2=0:        14.23	       14.14	
      length=128, align1=0, align2=0:         8.41	        7.48	
      length=128, align1=0, align2=0:        10.55	        7.48	
      length=128, align1=0, align2=0:         8.45	        7.48	
      length=128, align1=0, align2=0:         9.38	        7.48	
      length=128, align1=0, align2=6:         8.44	        7.48	
      length=128, align1=6, align2=7:         8.66	        7.48	
      length=256, align1=0, align2=0:        16.54	       17.55	
      length=256, align1=0, align2=0:        16.42	       17.49	
      length=256, align1=0, align2=0:        17.03	       17.47	
      length=256, align1=0, align2=0:        17.57	       17.49	
      length=256, align1=0, align2=0:        16.63	       17.47	
      length=256, align1=0, align2=0:        17.88	       17.54	
      length=256, align1=0, align2=7:        20.20	       19.18	
      length=256, align1=7, align2=8:        20.17	       19.14	
      length=512, align1=0, align2=0:        25.17	       24.51	
      length=512, align1=0, align2=0:        24.60	       24.38	
      length=512, align1=0, align2=0:        24.53	       24.52	
      length=512, align1=0, align2=0:        25.71	       24.34	
      length=512, align1=0, align2=0:        24.55	       24.48	
      length=512, align1=0, align2=0:        25.15	       24.44	
      length=512, align1=0, align2=8:        25.97	       25.90	
      length=512, align1=8, align2=9:        25.88	       25.92	
     length=1024, align1=0, align2=0:        40.13	       36.75	
     length=1024, align1=0, align2=0:        39.84	       36.63	
     length=1024, align1=0, align2=0:        40.50	       36.84	
     length=1024, align1=0, align2=0:        40.16	       36.76	
     length=1024, align1=0, align2=0:        39.72	       36.76	
     length=1024, align1=0, align2=0:        40.67	       36.76	
     length=1024, align1=0, align2=9:        40.57	       39.59	
    length=1024, align1=9, align2=10:        40.66	       39.60	
       length=16, align1=1, align2=2:         6.59	        7.18	
       length=16, align1=2, align2=1:         7.18	        7.18	
       length=16, align1=1, align2=2:         5.39	        5.38	
       length=16, align1=2, align2=1:         5.97	        5.40	
       length=16, align1=1, align2=2:         5.41	        5.38	
       length=16, align1=2, align2=1:         5.98	        5.38	
       length=32, align1=2, align2=4:         8.81	        7.18	
       length=32, align1=4, align2=2:         8.79	        7.18	
       length=32, align1=2, align2=4:         7.57	        4.79	
       length=32, align1=4, align2=2:         6.79	        4.79	
       length=32, align1=2, align2=4:         7.03	        4.78	
       length=32, align1=4, align2=2:         7.04	        4.78	
       length=64, align1=3, align2=6:        10.00	        8.38	
       length=64, align1=6, align2=3:         8.89	        9.57	
       length=64, align1=3, align2=6:         9.31	        6.88	
       length=64, align1=6, align2=3:        10.06	        6.88	
       length=64, align1=3, align2=6:         9.38	        6.88	
       length=64, align1=6, align2=3:        10.42	        6.88	
      length=128, align1=4, align2=8:        17.36	       16.15	
      length=128, align1=8, align2=4:        14.30	       14.50	
      length=128, align1=4, align2=8:         8.48	        7.48	
      length=128, align1=8, align2=4:         8.78	        7.48	
      length=128, align1=4, align2=8:         8.45	        7.48	
      length=128, align1=8, align2=4:         8.57	        7.55	
     length=256, align1=5, align2=10:        20.73	       19.26	
     length=256, align1=10, align2=5:        16.81	       18.56	
     length=256, align1=5, align2=10:        20.44	       19.14	
     length=256, align1=10, align2=5:        16.76	       18.57	
     length=256, align1=5, align2=10:        20.03	       19.22	
     length=256, align1=10, align2=5:        17.01	       18.55	
     length=512, align1=6, align2=12:        26.50	       25.81	
     length=512, align1=12, align2=6:        24.64	       25.61	
     length=512, align1=6, align2=12:        26.23	       25.90	
     length=512, align1=12, align2=6:        24.78	       25.70	
     length=512, align1=6, align2=12:        25.85	       25.90	
     length=512, align1=12, align2=6:        25.98	       25.71	
    length=1024, align1=7, align2=14:        40.62	       39.69	
    length=1024, align1=14, align2=7:        39.74	       39.06	
    length=1024, align1=7, align2=14:        40.70	       39.58	
    length=1024, align1=14, align2=7:        40.16	       39.04	
    length=1024, align1=7, align2=14:        40.62	       39.65	
    length=1024, align1=14, align2=7:        39.68	       39.12	
length=128, align1=8063, align2=8063:        14.19	       14.43	
length=128, align1=8063, align2=8062:        14.57	       14.48	
length=129, align1=8062, align2=8063:        17.52	       16.06	
length=129, align1=8062, align2=8062:        14.13	       14.08	
length=129, align1=8062, align2=8062:        14.16	       14.08	
length=129, align1=8062, align2=8061:        15.59	       14.54	
length=130, align1=8061, align2=8062:        17.53	       16.14	
length=130, align1=8061, align2=8061:        14.66	       14.08	
length=130, align1=8061, align2=8061:        13.80	       14.09	
length=130, align1=8061, align2=8060:        14.28	       14.47	
length=131, align1=8060, align2=8061:        17.84	       16.11	
length=131, align1=8060, align2=8060:        14.08	       14.07	
length=131, align1=8060, align2=8060:        14.02	       14.07	
length=131, align1=8060, align2=8059:        15.05	       14.48	
length=132, align1=8059, align2=8060:        17.46	       16.10	
length=132, align1=8059, align2=8059:        13.99	       14.07	
length=132, align1=8059, align2=8059:        14.01	       14.08	
length=132, align1=8059, align2=8058:        14.54	       14.54	
length=133, align1=8058, align2=8059:        17.38	       16.17	
length=133, align1=8058, align2=8058:        14.14	       14.08	
length=133, align1=8058, align2=8058:        13.88	       14.06	
length=133, align1=8058, align2=8057:        14.66	       14.47	
length=134, align1=8057, align2=8058:        17.45	       16.13	
length=134, align1=8057, align2=8057:        14.10	       14.07	
length=134, align1=8057, align2=8057:        14.54	       14.07	
length=134, align1=8057, align2=8056:        14.58	       14.49	
length=135, align1=8056, align2=8057:        17.65	       16.10	
length=135, align1=8056, align2=8056:        13.91	       14.08	
length=135, align1=8056, align2=8056:        14.16	       14.07	
length=135, align1=8056, align2=8055:        15.19	       14.74	
length=136, align1=8055, align2=8056:        18.17	       16.10	
length=136, align1=8055, align2=8055:        14.68	       14.64	
length=136, align1=8055, align2=8055:        14.58	       14.64	
length=136, align1=8055, align2=8054:        15.21	       15.03	
length=137, align1=8054, align2=8055:        17.75	       16.22	
length=137, align1=8054, align2=8054:        14.51	       14.62	
length=137, align1=8054, align2=8054:        15.15	       14.69	
length=137, align1=8054, align2=8053:        15.11	       14.94	
length=138, align1=8053, align2=8054:        18.13	       16.22	
length=138, align1=8053, align2=8053:        14.61	       14.70	
length=138, align1=8053, align2=8053:        14.41	       14.70	
length=138, align1=8053, align2=8052:        14.96	       14.94	
length=139, align1=8052, align2=8053:        17.98	       16.21	
length=139, align1=8052, align2=8052:        14.63	       14.68	
length=139, align1=8052, align2=8052:        15.30	       14.62	
length=139, align1=8052, align2=8051:        15.20	       14.95	
length=140, align1=8051, align2=8052:        17.66	       16.13	
length=140, align1=8051, align2=8051:        14.60	       14.68	
length=140, align1=8051, align2=8051:        14.58	       14.62	
length=140, align1=8051, align2=8050:        15.51	       14.94	
length=141, align1=8050, align2=8051:        17.41	       16.14	
length=141, align1=8050, align2=8050:        14.77	       14.71	
length=141, align1=8050, align2=8050:        14.50	       14.62	
length=141, align1=8050, align2=8049:        14.95	       14.97	
length=142, align1=8049, align2=8050:        17.55	       16.14	
length=142, align1=8049, align2=8049:        14.46	       14.70	
length=142, align1=8049, align2=8049:        14.60	       14.61	
length=142, align1=8049, align2=8048:        14.77	       14.78	
length=143, align1=8048, align2=8049:        18.15	       16.15	
length=143, align1=8048, align2=8048:        13.92	       14.02	
length=143, align1=8048, align2=8048:        13.88	       14.02	
length=143, align1=8048, align2=8047:        14.11	       14.32	
length=144, align1=8047, align2=8048:        17.64	       16.19	
length=144, align1=8047, align2=8047:        14.20	       13.96	
length=144, align1=8047, align2=8047:        14.03	       13.95	
length=144, align1=8047, align2=8046:        14.36	       14.32	
length=145, align1=8046, align2=8047:        17.82	       16.11	
length=145, align1=8046, align2=8046:        14.39	       13.95	
length=145, align1=8046, align2=8046:        13.88	       13.95	
length=145, align1=8046, align2=8045:        14.55	       14.33	
length=146, align1=8045, align2=8046:        18.02	       16.10	
length=146, align1=8045, align2=8045:        13.91	       13.95	
length=146, align1=8045, align2=8045:        13.77	       13.95	
length=146, align1=8045, align2=8044:        14.26	       14.32	
length=147, align1=8044, align2=8045:        17.43	       16.17	
length=147, align1=8044, align2=8044:        14.02	       14.01	
length=147, align1=8044, align2=8044:        13.99	       13.89	
length=147, align1=8044, align2=8043:        14.40	       14.32	
length=148, align1=8043, align2=8044:        17.57	       16.08	
length=148, align1=8043, align2=8043:        14.00	       13.95	
length=148, align1=8043, align2=8043:        14.18	       13.95	
length=148, align1=8043, align2=8042:        14.66	       14.33	
length=149, align1=8042, align2=8043:        17.50	       16.20	
length=149, align1=8042, align2=8042:        13.87	       13.95	
length=149, align1=8042, align2=8042:        14.12	       13.96	
length=149, align1=8042, align2=8041:        14.74	       14.32	
length=150, align1=8041, align2=8042:        17.63	       16.13	
length=150, align1=8041, align2=8041:        13.87	       13.95	
length=150, align1=8041, align2=8041:        13.73	       13.94	
length=150, align1=8041, align2=8040:        14.31	       14.34	
length=151, align1=8040, align2=8041:        18.46	       16.09	
length=151, align1=8040, align2=8040:        15.37	       13.95	
length=151, align1=8040, align2=8040:        14.01	       13.95	
length=151, align1=8040, align2=8039:        14.25	       14.32	
length=152, align1=8039, align2=8040:        17.70	       16.11	
length=152, align1=8039, align2=8039:        13.89	       14.03	
length=152, align1=8039, align2=8039:        14.49	       14.02	
length=152, align1=8039, align2=8038:        14.31	       14.39	
length=153, align1=8038, align2=8039:        17.62	       16.10	
length=153, align1=8038, align2=8038:        13.75	       13.95	
length=153, align1=8038, align2=8038:        14.00	       13.94	
length=153, align1=8038, align2=8037:        14.25	       14.33	
length=154, align1=8037, align2=8038:        18.33	       16.11	
length=154, align1=8037, align2=8037:        14.12	       13.96	
length=154, align1=8037, align2=8037:        14.08	       13.95	
length=154, align1=8037, align2=8036:        15.15	       14.33	
length=155, align1=8036, align2=8037:        17.66	       16.09	
length=155, align1=8036, align2=8036:        14.22	       14.01	
length=155, align1=8036, align2=8036:        13.87	       14.02	
length=155, align1=8036, align2=8035:        14.63	       14.32	
length=156, align1=8035, align2=8036:        17.57	       16.10	
length=156, align1=8035, align2=8035:        14.00	       13.96	
length=156, align1=8035, align2=8035:        13.88	       13.95	
length=156, align1=8035, align2=8034:        14.79	       14.41	
length=157, align1=8034, align2=8035:        17.74	       16.15	
length=157, align1=8034, align2=8034:        14.13	       13.94	
length=157, align1=8034, align2=8034:        14.86	       13.95	
length=157, align1=8034, align2=8033:        14.35	       14.33	
length=158, align1=8033, align2=8034:        17.68	       16.16	
length=158, align1=8033, align2=8033:        13.94	       13.94	
H.J. Lu (2):
  x86-64: Improve EVEX strcmp with masked load
  x86-64: Remove Prefer_AVX2_STRCMP

 sysdeps/x86/cpu-features.c                    |   8 -
 sysdeps/x86/cpu-tunables.c                    |   2 -
 ...cpu-features-preferred_feature_index_1.def |   1 -
 sysdeps/x86_64/multiarch/strcmp-evex.S        | 461 +++++++++---------
 sysdeps/x86_64/multiarch/strcmp.c             |   3 +-
 sysdeps/x86_64/multiarch/strncmp.c            |   3 +-
 6 files changed, 245 insertions(+), 233 deletions(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] x86-64: Improve EVEX strcmp with masked load
  2021-11-01 12:54 [PATCH 0/2] Enable EVEX strcmp H.J. Lu
@ 2021-11-01 12:54 ` H.J. Lu
  2022-04-23  1:30   ` Sunil Pandey
  2021-11-01 12:54 ` [PATCH 2/2] x86-64: Remove Prefer_AVX2_STRCMP H.J. Lu
  1 sibling, 1 reply; 5+ messages in thread
From: H.J. Lu @ 2021-11-01 12:54 UTC (permalink / raw)
  To: libc-alpha

In strcmp-evex.S, to compare 2 32-byte strings, replace

        VMOVU   (%rdi, %rdx), %YMM0
        VMOVU   (%rsi, %rdx), %YMM1
        /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
        VPCMP   $4, %YMM0, %YMM1, %k0
        VPCMP   $0, %YMMZERO, %YMM0, %k1
        VPCMP   $0, %YMMZERO, %YMM1, %k2
        /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
        kord    %k1, %k2, %k1
        /* Each bit in K1 represents a NULL or a mismatch.  */
        kord    %k0, %k1, %k1
        kmovd   %k1, %ecx
        testl   %ecx, %ecx
        jne     L(last_vector)

with

        VMOVU   (%rdi, %rdx), %YMM0
        VPTESTM %YMM0, %YMM0, %k2
        /* Each bit cleared in K1 represents a mismatch or a null CHAR
           in YMM0 and 32 bytes at (%rsi, %rdx).  */
        VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
        kmovd   %k1, %ecx
        incl    %ecx
        jne     L(last_vector)

It makes EVEX strcmp faster than AVX2 strcmp by up to 30% on Tiger Lake
and Ice Lake.

Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 461 +++++++++++++------------
 1 file changed, 243 insertions(+), 218 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index 459eeed09f..0bea318abd 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -41,6 +41,8 @@
 # ifdef USE_AS_WCSCMP
 /* Compare packed dwords.  */
 #  define VPCMP		vpcmpd
+#  define VPMINU	vpminud
+#  define VPTESTM	vptestmd
 #  define SHIFT_REG32	r8d
 #  define SHIFT_REG64	r8
 /* 1 dword char == 4 bytes.  */
@@ -48,6 +50,8 @@
 # else
 /* Compare packed bytes.  */
 #  define VPCMP		vpcmpb
+#  define VPMINU	vpminub
+#  define VPTESTM	vptestmb
 #  define SHIFT_REG32	ecx
 #  define SHIFT_REG64	rcx
 /* 1 byte char == 1 byte.  */
@@ -67,6 +71,9 @@
 # define YMM5		ymm22
 # define YMM6		ymm23
 # define YMM7		ymm24
+# define YMM8		ymm25
+# define YMM9		ymm26
+# define YMM10		ymm27
 
 /* Warning!
            wcscmp/wcsncmp have to use SIGNED comparison for elements.
@@ -76,7 +83,7 @@
 /* The main idea of the string comparison (byte or dword) using 256-bit
    EVEX instructions consists of comparing (VPCMP) two ymm vectors. The
    latter can be on either packed bytes or dwords depending on
-   USE_AS_WCSCMP. In order to check the null char, algorithm keeps the
+   USE_AS_WCSCMP. In order to check the null CHAR, algorithm keeps the
    matched bytes/dwords, requiring 5 EVEX instructions (3 VPCMP and 2
    KORD). In general, the costs of comparing VEC_SIZE bytes (32-bytes)
    are 3 VPCMP and 2 KORD instructions, together with VMOVU and ktestd
@@ -113,27 +120,21 @@ ENTRY (STRCMP)
 	jg	L(cross_page)
 	/* Start comparing 4 vectors.  */
 	VMOVU	(%rdi), %YMM0
-	VMOVU	(%rsi), %YMM1
 
-	/* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
-	VPCMP	$4, %YMM0, %YMM1, %k0
+	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
+	VPTESTM	%YMM0, %YMM0, %k2
 
-	/* Check for NULL in YMM0.  */
-	VPCMP	$0, %YMMZERO, %YMM0, %k1
-	/* Check for NULL in YMM1.  */
-	VPCMP	$0, %YMMZERO, %YMM1, %k2
-	/* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
-	kord	%k1, %k2, %k1
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in YMM0 and 32 bytes at (%rsi).  */
+	VPCMP	$0, (%rsi), %YMM0, %k1{%k2}
 
-	/* Each bit in K1 represents:
-	   1. A mismatch in YMM0 and YMM1.  Or
-	   2. A NULL in YMM0 or YMM1.
-	 */
-	kord	%k0, %k1, %k1
-
-	ktestd	%k1, %k1
-	je	L(next_3_vectors)
 	kmovd	%k1, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
+	je	L(next_3_vectors)
 	tzcntl	%ecx, %edx
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
@@ -162,9 +163,7 @@ L(return):
 # endif
 	ret
 
-	.p2align 4
 L(return_vec_size):
-	kmovd	%k1, %ecx
 	tzcntl	%ecx, %edx
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
@@ -200,9 +199,7 @@ L(return_vec_size):
 # endif
 	ret
 
-	.p2align 4
 L(return_2_vec_size):
-	kmovd	%k1, %ecx
 	tzcntl	%ecx, %edx
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
@@ -238,9 +235,7 @@ L(return_2_vec_size):
 # endif
 	ret
 
-	.p2align 4
 L(return_3_vec_size):
-	kmovd	%k1, %ecx
 	tzcntl	%ecx, %edx
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
@@ -279,43 +274,45 @@ L(return_3_vec_size):
 	.p2align 4
 L(next_3_vectors):
 	VMOVU	VEC_SIZE(%rdi), %YMM0
-	VMOVU	VEC_SIZE(%rsi), %YMM1
-	/* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
-	VPCMP	$4, %YMM0, %YMM1, %k0
-	VPCMP	$0, %YMMZERO, %YMM0, %k1
-	VPCMP	$0, %YMMZERO, %YMM1, %k2
-	/* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	kord	%k0, %k1, %k1
-	ktestd	%k1, %k1
+	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in YMM0 and 32 bytes at VEC_SIZE(%rsi).  */
+	VPCMP	$0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
 	jne	L(return_vec_size)
 
-	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM2
-	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM3
-	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM4
-	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM5
-
-	/* Each bit in K0 represents a mismatch in YMM2 and YMM4.  */
-	VPCMP	$4, %YMM2, %YMM4, %k0
-	VPCMP	$0, %YMMZERO, %YMM2, %k1
-	VPCMP	$0, %YMMZERO, %YMM4, %k2
-	/* Each bit in K1 represents a NULL in YMM2 or YMM4.  */
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	kord	%k0, %k1, %k1
-	ktestd	%k1, %k1
+	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM0
+	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
+	VPCMP	$0, (VEC_SIZE * 2)(%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
 	jne	L(return_2_vec_size)
 
-	/* Each bit in K0 represents a mismatch in YMM3 and YMM5.  */
-	VPCMP	$4, %YMM3, %YMM5, %k0
-	VPCMP	$0, %YMMZERO, %YMM3, %k1
-	VPCMP	$0, %YMMZERO, %YMM5, %k2
-	/* Each bit in K1 represents a NULL in YMM3 or YMM5.  */
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	kord	%k0, %k1, %k1
-	ktestd	%k1, %k1
+	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM0
+	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
+	VPCMP	$0, (VEC_SIZE * 3)(%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
 	jne	L(return_3_vec_size)
 L(main_loop_header):
 	leaq	(VEC_SIZE * 4)(%rdi), %rdx
@@ -365,56 +362,51 @@ L(back_to_loop):
 	VMOVA	VEC_SIZE(%rax), %YMM2
 	VMOVA	(VEC_SIZE * 2)(%rax), %YMM4
 	VMOVA	(VEC_SIZE * 3)(%rax), %YMM6
-	VMOVU	(%rdx), %YMM1
-	VMOVU	VEC_SIZE(%rdx), %YMM3
-	VMOVU	(VEC_SIZE * 2)(%rdx), %YMM5
-	VMOVU	(VEC_SIZE * 3)(%rdx), %YMM7
-
-	VPCMP	$4, %YMM0, %YMM1, %k0
-	VPCMP	$0, %YMMZERO, %YMM0, %k1
-	VPCMP	$0, %YMMZERO, %YMM1, %k2
-	kord	%k1, %k2, %k1
-	/* Each bit in K4 represents a NULL or a mismatch in YMM0 and
-	   YMM1.  */
-	kord	%k0, %k1, %k4
-
-	VPCMP	$4, %YMM2, %YMM3, %k0
-	VPCMP	$0, %YMMZERO, %YMM2, %k1
-	VPCMP	$0, %YMMZERO, %YMM3, %k2
-	kord	%k1, %k2, %k1
-	/* Each bit in K5 represents a NULL or a mismatch in YMM2 and
-	   YMM3.  */
-	kord	%k0, %k1, %k5
-
-	VPCMP	$4, %YMM4, %YMM5, %k0
-	VPCMP	$0, %YMMZERO, %YMM4, %k1
-	VPCMP	$0, %YMMZERO, %YMM5, %k2
-	kord	%k1, %k2, %k1
-	/* Each bit in K6 represents a NULL or a mismatch in YMM4 and
-	   YMM5.  */
-	kord	%k0, %k1, %k6
-
-	VPCMP	$4, %YMM6, %YMM7, %k0
-	VPCMP	$0, %YMMZERO, %YMM6, %k1
-	VPCMP	$0, %YMMZERO, %YMM7, %k2
-	kord	%k1, %k2, %k1
-	/* Each bit in K7 represents a NULL or a mismatch in YMM6 and
-	   YMM7.  */
-	kord	%k0, %k1, %k7
-
-	kord	%k4, %k5, %k0
-	kord	%k6, %k7, %k1
-
-	/* Test each mask (32 bits) individually because for VEC_SIZE
-	   == 32 is not possible to OR the four masks and keep all bits
-	   in a 64-bit integer register, differing from SSE2 strcmp
-	   where ORing is possible.  */
-	kortestd %k0, %k1
-	je	L(loop)
-	ktestd	%k4, %k4
+
+	VPMINU	%YMM0, %YMM2, %YMM8
+	VPMINU	%YMM4, %YMM6, %YMM9
+
+	/* A zero CHAR in YMM8 means that there is a null CHAR.  */
+	VPMINU	%YMM8, %YMM9, %YMM8
+
+	/* Each bit set in K1 represents a non-null CHAR in YMM8.  */
+	VPTESTM	%YMM8, %YMM8, %k1
+
+	/* (YMM ^ YMM): A non-zero CHAR represents a mismatch.  */
+	vpxorq	(%rdx), %YMM0, %YMM1
+	vpxorq	VEC_SIZE(%rdx), %YMM2, %YMM3
+	vpxorq	(VEC_SIZE * 2)(%rdx), %YMM4, %YMM5
+	vpxorq	(VEC_SIZE * 3)(%rdx), %YMM6, %YMM7
+
+	vporq	%YMM1, %YMM3, %YMM9
+	vporq	%YMM5, %YMM7, %YMM10
+
+	/* A non-zero CHAR in YMM9 represents a mismatch.  */
+	vporq	%YMM9, %YMM10, %YMM9
+
+	/* Each bit cleared in K0 represents a mismatch or a null CHAR.  */
+	VPCMP	$0, %YMMZERO, %YMM9, %k0{%k1}
+	kmovd   %k0, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
+	je	 L(loop)
+
+	/* Each bit set in K1 represents a non-null CHAR in YMM0.  */
+	VPTESTM	%YMM0, %YMM0, %k1
+	/* Each bit cleared in K0 represents a mismatch or a null CHAR
+	   in YMM0 and (%rdx).  */
+	VPCMP	$0, %YMMZERO, %YMM1, %k0{%k1}
+	kmovd	%k0, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
 	je	L(test_vec)
-	kmovd	%k4, %edi
-	tzcntl	%edi, %ecx
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
 	sall	$2, %ecx
@@ -456,9 +448,18 @@ L(test_vec):
 	cmpq	$VEC_SIZE, %r11
 	jbe	L(zero)
 # endif
-	ktestd	%k5, %k5
+	/* Each bit set in K1 represents a non-null CHAR in YMM2.  */
+	VPTESTM	%YMM2, %YMM2, %k1
+	/* Each bit cleared in K0 represents a mismatch or a null CHAR
+	   in YMM2 and VEC_SIZE(%rdx).  */
+	VPCMP	$0, %YMMZERO, %YMM3, %k0{%k1}
+	kmovd	%k0, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
 	je	L(test_2_vec)
-	kmovd	%k5, %ecx
 	tzcntl	%ecx, %edi
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
@@ -502,9 +503,18 @@ L(test_2_vec):
 	cmpq	$(VEC_SIZE * 2), %r11
 	jbe	L(zero)
 # endif
-	ktestd	%k6, %k6
+	/* Each bit set in K1 represents a non-null CHAR in YMM4.  */
+	VPTESTM	%YMM4, %YMM4, %k1
+	/* Each bit cleared in K0 represents a mismatch or a null CHAR
+	   in YMM4 and (VEC_SIZE * 2)(%rdx).  */
+	VPCMP	$0, %YMMZERO, %YMM5, %k0{%k1}
+	kmovd	%k0, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
 	je	L(test_3_vec)
-	kmovd	%k6, %ecx
 	tzcntl	%ecx, %edi
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
@@ -548,8 +558,18 @@ L(test_3_vec):
 	cmpq	$(VEC_SIZE * 3), %r11
 	jbe	L(zero)
 # endif
-	kmovd	%k7, %esi
-	tzcntl	%esi, %ecx
+	/* Each bit set in K1 represents a non-null CHAR in YMM6.  */
+	VPTESTM	%YMM6, %YMM6, %k1
+	/* Each bit cleared in K0 represents a mismatch or a null CHAR
+	   in YMM6 and (VEC_SIZE * 3)(%rdx).  */
+	VPCMP	$0, %YMMZERO, %YMM7, %k0{%k1}
+	kmovd	%k0, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
 	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
 	sall	$2, %ecx
@@ -605,39 +625,51 @@ L(loop_cross_page):
 
 	VMOVU	(%rax, %r10), %YMM2
 	VMOVU	VEC_SIZE(%rax, %r10), %YMM3
-	VMOVU	(%rdx, %r10), %YMM4
-	VMOVU	VEC_SIZE(%rdx, %r10), %YMM5
-
-	VPCMP	$4, %YMM4, %YMM2, %k0
-	VPCMP	$0, %YMMZERO, %YMM2, %k1
-	VPCMP	$0, %YMMZERO, %YMM4, %k2
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch in YMM2 and
-	   YMM4.  */
-	kord	%k0, %k1, %k1
-
-	VPCMP	$4, %YMM5, %YMM3, %k3
-	VPCMP	$0, %YMMZERO, %YMM3, %k4
-	VPCMP	$0, %YMMZERO, %YMM5, %k5
-	kord	%k4, %k5, %k4
-	/* Each bit in K3 represents a NULL or a mismatch in YMM3 and
-	   YMM5.  */
-	kord	%k3, %k4, %k3
+
+	/* Each bit set in K2 represents a non-null CHAR in YMM2.  */
+	VPTESTM	%YMM2, %YMM2, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in YMM2 and 32 bytes at (%rdx, %r10).  */
+	VPCMP	$0, (%rdx, %r10), %YMM2, %k1{%k2}
+	kmovd	%k1, %r9d
+	/* Don't use subl since it is the lower 16/32 bits of RDI
+	   below.  */
+	notl	%r9d
+# ifdef USE_AS_WCSCMP
+	/* Only last 8 bits are valid.  */
+	andl	$0xff, %r9d
+# endif
+
+	/* Each bit set in K4 represents a non-null CHAR in YMM3.  */
+	VPTESTM	%YMM3, %YMM3, %k4
+	/* Each bit cleared in K3 represents a mismatch or a null CHAR
+	   in YMM3 and 32 bytes at VEC_SIZE(%rdx, %r10).  */
+	VPCMP	$0, VEC_SIZE(%rdx, %r10), %YMM3, %k3{%k4}
+	kmovd	%k3, %edi
+# ifdef USE_AS_WCSCMP
+	/* Don't use subl since it is the upper 8 bits of EDI below.  */
+	notl	%edi
+	andl	$0xff, %edi
+# else
+	incl	%edi
+# endif
 
 # ifdef USE_AS_WCSCMP
-	/* NB: Each bit in K1/K3 represents 4-byte element.  */
-	kshiftlw $8, %k3, %k2
+	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
+	sall	$8, %edi
 	/* NB: Divide shift count by 4 since each bit in K1 represent 4
 	   bytes.  */
 	movl	%ecx, %SHIFT_REG32
 	sarl	$2, %SHIFT_REG32
+
+	/* Each bit in EDI represents a null CHAR or a mismatch.  */
+	orl	%r9d, %edi
 # else
-	kshiftlq $32, %k3, %k2
-# endif
+	salq	$32, %rdi
 
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	korq	%k1, %k2, %k1
-	kmovq	%k1, %rdi
+	/* Each bit in RDI represents a null CHAR or a mismatch.  */
+	orq	%r9, %rdi
+# endif
 
 	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
 	shrxq	%SHIFT_REG64, %rdi, %rdi
@@ -682,35 +714,45 @@ L(loop_cross_page_2_vec):
 	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
 	VMOVU	(VEC_SIZE * 2)(%rax, %r10), %YMM0
 	VMOVU	(VEC_SIZE * 3)(%rax, %r10), %YMM1
-	VMOVU	(VEC_SIZE * 2)(%rdx, %r10), %YMM2
-	VMOVU	(VEC_SIZE * 3)(%rdx, %r10), %YMM3
-
-	VPCMP	$4, %YMM0, %YMM2, %k0
-	VPCMP	$0, %YMMZERO, %YMM0, %k1
-	VPCMP	$0, %YMMZERO, %YMM2, %k2
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch in YMM0 and
-	   YMM2.  */
-	kord	%k0, %k1, %k1
-
-	VPCMP	$4, %YMM1, %YMM3, %k3
-	VPCMP	$0, %YMMZERO, %YMM1, %k4
-	VPCMP	$0, %YMMZERO, %YMM3, %k5
-	kord	%k4, %k5, %k4
-	/* Each bit in K3 represents a NULL or a mismatch in YMM1 and
-	   YMM3.  */
-	kord	%k3, %k4, %k3
 
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rdx, %r10).  */
+	VPCMP	$0, (VEC_SIZE * 2)(%rdx, %r10), %YMM0, %k1{%k2}
+	kmovd	%k1, %r9d
+	/* Don't use subl since it is the lower 16/32 bits of RDI
+	   below.  */
+	notl	%r9d
 # ifdef USE_AS_WCSCMP
-	/* NB: Each bit in K1/K3 represents 4-byte element.  */
-	kshiftlw $8, %k3, %k2
+	/* Only last 8 bits are valid.  */
+	andl	$0xff, %r9d
+# endif
+
+	VPTESTM	%YMM1, %YMM1, %k4
+	/* Each bit cleared in K3 represents a mismatch or a null CHAR
+	   in YMM1 and 32 bytes at (VEC_SIZE * 3)(%rdx, %r10).  */
+	VPCMP	$0, (VEC_SIZE * 3)(%rdx, %r10), %YMM1, %k3{%k4}
+	kmovd	%k3, %edi
+# ifdef USE_AS_WCSCMP
+	/* Don't use subl since it is the upper 8 bits of EDI below.  */
+	notl	%edi
+	andl	$0xff, %edi
 # else
-	kshiftlq $32, %k3, %k2
+	incl	%edi
 # endif
 
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	korq	%k1, %k2, %k1
-	kmovq	%k1, %rdi
+# ifdef USE_AS_WCSCMP
+	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
+	sall	$8, %edi
+
+	/* Each bit in EDI represents a null CHAR or a mismatch.  */
+	orl	%r9d, %edi
+# else
+	salq	$32, %rdi
+
+	/* Each bit in RDI represents a null CHAR or a mismatch.  */
+	orq	%r9, %rdi
+# endif
 
 	xorl	%r8d, %r8d
 	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
@@ -719,12 +761,15 @@ L(loop_cross_page_2_vec):
 	/* R8 has number of bytes skipped.  */
 	movl	%ecx, %r8d
 # ifdef USE_AS_WCSCMP
-	/* NB: Divide shift count by 4 since each bit in K1 represent 4
+	/* NB: Divide shift count by 4 since each bit in RDI represent 4
 	   bytes.  */
 	sarl	$2, %ecx
-# endif
+	/* Skip ECX bytes.  */
+	shrl	%cl, %edi
+# else
 	/* Skip ECX bytes.  */
 	shrq	%cl, %rdi
+# endif
 1:
 	/* Before jumping back to the loop, set ESI to the number of
 	   VEC_SIZE * 4 blocks before page crossing.  */
@@ -808,7 +853,7 @@ L(cross_page_loop):
 	movzbl	(%rdi, %rdx), %eax
 	movzbl	(%rsi, %rdx), %ecx
 # endif
-	/* Check null char.  */
+	/* Check null CHAR.  */
 	testl	%eax, %eax
 	jne	L(cross_page_loop)
 	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
@@ -891,18 +936,17 @@ L(cross_page):
 	jg	L(cross_page_1_vector)
 L(loop_1_vector):
 	VMOVU	(%rdi, %rdx), %YMM0
-	VMOVU	(%rsi, %rdx), %YMM1
-
-	/* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
-	VPCMP	$4, %YMM0, %YMM1, %k0
-	VPCMP	$0, %YMMZERO, %YMM0, %k1
-	VPCMP	$0, %YMMZERO, %YMM1, %k2
-	/* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	kord	%k0, %k1, %k1
+
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in YMM0 and 32 bytes at (%rsi, %rdx).  */
+	VPCMP	$0, (%rsi, %rdx), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-	testl	%ecx, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+# else
+	incl	%ecx
+# endif
 	jne	L(last_vector)
 
 	addl	$VEC_SIZE, %edx
@@ -921,18 +965,17 @@ L(cross_page_1_vector):
 	cmpl	$(PAGE_SIZE - 16), %eax
 	jg	L(cross_page_1_xmm)
 	VMOVU	(%rdi, %rdx), %XMM0
-	VMOVU	(%rsi, %rdx), %XMM1
-
-	/* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
-	VPCMP	$4, %XMM0, %XMM1, %k0
-	VPCMP	$0, %XMMZERO, %XMM0, %k1
-	VPCMP	$0, %XMMZERO, %XMM1, %k2
-	/* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
-	korw	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	korw	%k0, %k1, %k1
-	kmovw	%k1, %ecx
-	testl	%ecx, %ecx
+
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in XMM0 and 16 bytes at (%rsi, %rdx).  */
+	VPCMP	$0, (%rsi, %rdx), %XMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+# ifdef USE_AS_WCSCMP
+	subl	$0xf, %ecx
+# else
+	subl	$0xffff, %ecx
+# endif
 	jne	L(last_vector)
 
 	addl	$16, %edx
@@ -955,25 +998,16 @@ L(cross_page_1_xmm):
 	vmovq	(%rdi, %rdx), %XMM0
 	vmovq	(%rsi, %rdx), %XMM1
 
-	/* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
-	VPCMP	$4, %XMM0, %XMM1, %k0
-	VPCMP	$0, %XMMZERO, %XMM0, %k1
-	VPCMP	$0, %XMMZERO, %XMM1, %k2
-	/* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	kord	%k0, %k1, %k1
-	kmovd	%k1, %ecx
-
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in XMM0 and XMM1.  */
+	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
+	kmovb	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	/* Only last 2 bits are valid.  */
-	andl	$0x3, %ecx
+	subl	$0x3, %ecx
 # else
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %ecx
+	subl	$0xff, %ecx
 # endif
-
-	testl	%ecx, %ecx
 	jne	L(last_vector)
 
 	addl	$8, %edx
@@ -992,25 +1026,16 @@ L(cross_page_8bytes):
 	vmovd	(%rdi, %rdx), %XMM0
 	vmovd	(%rsi, %rdx), %XMM1
 
-	/* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
-	VPCMP	$4, %XMM0, %XMM1, %k0
-	VPCMP	$0, %XMMZERO, %XMM0, %k1
-	VPCMP	$0, %XMMZERO, %XMM1, %k2
-	/* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
-	kord	%k1, %k2, %k1
-	/* Each bit in K1 represents a NULL or a mismatch.  */
-	kord	%k0, %k1, %k1
+	VPTESTM	%YMM0, %YMM0, %k2
+	/* Each bit cleared in K1 represents a mismatch or a null CHAR
+	   in XMM0 and XMM1.  */
+	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-
 # ifdef USE_AS_WCSCMP
-	/* Only the last bit is valid.  */
-	andl	$0x1, %ecx
+	subl	$0x1, %ecx
 # else
-	/* Only last 4 bits are valid.  */
-	andl	$0xf, %ecx
+	subl	$0xf, %ecx
 # endif
-
-	testl	%ecx, %ecx
 	jne	L(last_vector)
 
 	addl	$4, %edx
-- 
2.33.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/2] x86-64: Remove Prefer_AVX2_STRCMP
  2021-11-01 12:54 [PATCH 0/2] Enable EVEX strcmp H.J. Lu
  2021-11-01 12:54 ` [PATCH 1/2] x86-64: Improve EVEX strcmp with masked load H.J. Lu
@ 2021-11-01 12:54 ` H.J. Lu
  2022-04-23  1:34   ` Sunil Pandey
  1 sibling, 1 reply; 5+ messages in thread
From: H.J. Lu @ 2021-11-01 12:54 UTC (permalink / raw)
  To: libc-alpha

Remove Prefer_AVX2_STRCMP to enable EVEX strcmp.  When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL.  EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
---
 sysdeps/x86/cpu-features.c                                | 8 --------
 sysdeps/x86/cpu-tunables.c                                | 2 --
 .../include/cpu-features-preferred_feature_index_1.def    | 1 -
 sysdeps/x86_64/multiarch/strcmp.c                         | 3 +--
 sysdeps/x86_64/multiarch/strncmp.c                        | 3 +--
 5 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 645bba6314..be2498b2e7 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -546,14 +546,6 @@ init_cpu_features (struct cpu_features *cpu_features)
 	  if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
 	    cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
 	      |= bit_arch_Prefer_No_VZEROUPPER;
-
-	  /* Since to compare 2 32-byte strings, 256-bit EVEX strcmp
-	     requires 2 loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp
-	     requires 1 load, 2 VPCMPEQs, 1 VPMINU and 1 VPMOVMSKB,
-	     AVX2 strcmp is faster than EVEX strcmp.  */
-	  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2))
-	    cpu_features->preferred[index_arch_Prefer_AVX2_STRCMP]
-	      |= bit_arch_Prefer_AVX2_STRCMP;
 	}
 
       /* Avoid avoid short distance REP MOVSB on processor with FSRM.  */
diff --git a/sysdeps/x86/cpu-tunables.c b/sysdeps/x86/cpu-tunables.c
index 00fe5045eb..61b05e5b1d 100644
--- a/sysdeps/x86/cpu-tunables.c
+++ b/sysdeps/x86/cpu-tunables.c
@@ -239,8 +239,6 @@ TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp)
 	      CHECK_GLIBC_IFUNC_PREFERRED_BOTH (n, cpu_features,
 						Fast_Copy_Backward,
 						disable, 18);
-	      CHECK_GLIBC_IFUNC_PREFERRED_NEED_BOTH
-		(n, cpu_features, Prefer_AVX2_STRCMP, AVX2, disable, 18);
 	    }
 	  break;
 	case 19:
diff --git a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
index d7c93f00c5..1530d594b3 100644
--- a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
+++ b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
@@ -32,5 +32,4 @@ BIT (Prefer_ERMS)
 BIT (Prefer_No_AVX512)
 BIT (MathVec_Prefer_No_AVX512)
 BIT (Prefer_FSRM)
-BIT (Prefer_AVX2_STRCMP)
 BIT (Avoid_Short_Distance_REP_MOVSB)
diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
index 62b7abeeee..7c2901bf44 100644
--- a/sysdeps/x86_64/multiarch/strcmp.c
+++ b/sysdeps/x86_64/multiarch/strcmp.c
@@ -43,8 +43,7 @@ IFUNC_SELECTOR (void)
     {
       if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
 	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
-	  && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
-	  && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_AVX2_STRCMP))
+	  && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
 	return OPTIMIZE (evex);
 
       if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
diff --git a/sysdeps/x86_64/multiarch/strncmp.c b/sysdeps/x86_64/multiarch/strncmp.c
index 60ba0fe356..f94a421784 100644
--- a/sysdeps/x86_64/multiarch/strncmp.c
+++ b/sysdeps/x86_64/multiarch/strncmp.c
@@ -43,8 +43,7 @@ IFUNC_SELECTOR (void)
     {
       if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
 	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
-	  && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
-	  && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_AVX2_STRCMP))
+	  && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
 	return OPTIMIZE (evex);
 
       if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
-- 
2.33.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] x86-64: Improve EVEX strcmp with masked load
  2021-11-01 12:54 ` [PATCH 1/2] x86-64: Improve EVEX strcmp with masked load H.J. Lu
@ 2022-04-23  1:30   ` Sunil Pandey
  0 siblings, 0 replies; 5+ messages in thread
From: Sunil Pandey @ 2022-04-23  1:30 UTC (permalink / raw)
  To: H.J. Lu, libc-stable; +Cc: GNU C Library

On Mon, Nov 1, 2021 at 5:56 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> In strcmp-evex.S, to compare 2 32-byte strings, replace
>
>         VMOVU   (%rdi, %rdx), %YMM0
>         VMOVU   (%rsi, %rdx), %YMM1
>         /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
>         VPCMP   $4, %YMM0, %YMM1, %k0
>         VPCMP   $0, %YMMZERO, %YMM0, %k1
>         VPCMP   $0, %YMMZERO, %YMM1, %k2
>         /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
>         kord    %k1, %k2, %k1
>         /* Each bit in K1 represents a NULL or a mismatch.  */
>         kord    %k0, %k1, %k1
>         kmovd   %k1, %ecx
>         testl   %ecx, %ecx
>         jne     L(last_vector)
>
> with
>
>         VMOVU   (%rdi, %rdx), %YMM0
>         VPTESTM %YMM0, %YMM0, %k2
>         /* Each bit cleared in K1 represents a mismatch or a null CHAR
>            in YMM0 and 32 bytes at (%rsi, %rdx).  */
>         VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
>         kmovd   %k1, %ecx
>         incl    %ecx
>         jne     L(last_vector)
>
> It makes EVEX strcmp faster than AVX2 strcmp by up to 30% on Tiger Lake
> and Ice Lake.
>
> Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  sysdeps/x86_64/multiarch/strcmp-evex.S | 461 +++++++++++++------------
>  1 file changed, 243 insertions(+), 218 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
> index 459eeed09f..0bea318abd 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-evex.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
> @@ -41,6 +41,8 @@
>  # ifdef USE_AS_WCSCMP
>  /* Compare packed dwords.  */
>  #  define VPCMP                vpcmpd
> +#  define VPMINU       vpminud
> +#  define VPTESTM      vptestmd
>  #  define SHIFT_REG32  r8d
>  #  define SHIFT_REG64  r8
>  /* 1 dword char == 4 bytes.  */
> @@ -48,6 +50,8 @@
>  # else
>  /* Compare packed bytes.  */
>  #  define VPCMP                vpcmpb
> +#  define VPMINU       vpminub
> +#  define VPTESTM      vptestmb
>  #  define SHIFT_REG32  ecx
>  #  define SHIFT_REG64  rcx
>  /* 1 byte char == 1 byte.  */
> @@ -67,6 +71,9 @@
>  # define YMM5          ymm22
>  # define YMM6          ymm23
>  # define YMM7          ymm24
> +# define YMM8          ymm25
> +# define YMM9          ymm26
> +# define YMM10         ymm27
>
>  /* Warning!
>             wcscmp/wcsncmp have to use SIGNED comparison for elements.
> @@ -76,7 +83,7 @@
>  /* The main idea of the string comparison (byte or dword) using 256-bit
>     EVEX instructions consists of comparing (VPCMP) two ymm vectors. The
>     latter can be on either packed bytes or dwords depending on
> -   USE_AS_WCSCMP. In order to check the null char, algorithm keeps the
> +   USE_AS_WCSCMP. In order to check the null CHAR, algorithm keeps the
>     matched bytes/dwords, requiring 5 EVEX instructions (3 VPCMP and 2
>     KORD). In general, the costs of comparing VEC_SIZE bytes (32-bytes)
>     are 3 VPCMP and 2 KORD instructions, together with VMOVU and ktestd
> @@ -113,27 +120,21 @@ ENTRY (STRCMP)
>         jg      L(cross_page)
>         /* Start comparing 4 vectors.  */
>         VMOVU   (%rdi), %YMM0
> -       VMOVU   (%rsi), %YMM1
>
> -       /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
> -       VPCMP   $4, %YMM0, %YMM1, %k0
> +       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
> +       VPTESTM %YMM0, %YMM0, %k2
>
> -       /* Check for NULL in YMM0.  */
> -       VPCMP   $0, %YMMZERO, %YMM0, %k1
> -       /* Check for NULL in YMM1.  */
> -       VPCMP   $0, %YMMZERO, %YMM1, %k2
> -       /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
> -       kord    %k1, %k2, %k1
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in YMM0 and 32 bytes at (%rsi).  */
> +       VPCMP   $0, (%rsi), %YMM0, %k1{%k2}
>
> -       /* Each bit in K1 represents:
> -          1. A mismatch in YMM0 and YMM1.  Or
> -          2. A NULL in YMM0 or YMM1.
> -        */
> -       kord    %k0, %k1, %k1
> -
> -       ktestd  %k1, %k1
> -       je      L(next_3_vectors)
>         kmovd   %k1, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
> +       je      L(next_3_vectors)
>         tzcntl  %ecx, %edx
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> @@ -162,9 +163,7 @@ L(return):
>  # endif
>         ret
>
> -       .p2align 4
>  L(return_vec_size):
> -       kmovd   %k1, %ecx
>         tzcntl  %ecx, %edx
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> @@ -200,9 +199,7 @@ L(return_vec_size):
>  # endif
>         ret
>
> -       .p2align 4
>  L(return_2_vec_size):
> -       kmovd   %k1, %ecx
>         tzcntl  %ecx, %edx
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> @@ -238,9 +235,7 @@ L(return_2_vec_size):
>  # endif
>         ret
>
> -       .p2align 4
>  L(return_3_vec_size):
> -       kmovd   %k1, %ecx
>         tzcntl  %ecx, %edx
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> @@ -279,43 +274,45 @@ L(return_3_vec_size):
>         .p2align 4
>  L(next_3_vectors):
>         VMOVU   VEC_SIZE(%rdi), %YMM0
> -       VMOVU   VEC_SIZE(%rsi), %YMM1
> -       /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
> -       VPCMP   $4, %YMM0, %YMM1, %k0
> -       VPCMP   $0, %YMMZERO, %YMM0, %k1
> -       VPCMP   $0, %YMMZERO, %YMM1, %k2
> -       /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       kord    %k0, %k1, %k1
> -       ktestd  %k1, %k1
> +       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in YMM0 and 32 bytes at VEC_SIZE(%rsi).  */
> +       VPCMP   $0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
>         jne     L(return_vec_size)
>
> -       VMOVU   (VEC_SIZE * 2)(%rdi), %YMM2
> -       VMOVU   (VEC_SIZE * 3)(%rdi), %YMM3
> -       VMOVU   (VEC_SIZE * 2)(%rsi), %YMM4
> -       VMOVU   (VEC_SIZE * 3)(%rsi), %YMM5
> -
> -       /* Each bit in K0 represents a mismatch in YMM2 and YMM4.  */
> -       VPCMP   $4, %YMM2, %YMM4, %k0
> -       VPCMP   $0, %YMMZERO, %YMM2, %k1
> -       VPCMP   $0, %YMMZERO, %YMM4, %k2
> -       /* Each bit in K1 represents a NULL in YMM2 or YMM4.  */
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       kord    %k0, %k1, %k1
> -       ktestd  %k1, %k1
> +       VMOVU   (VEC_SIZE * 2)(%rdi), %YMM0
> +       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
> +       VPCMP   $0, (VEC_SIZE * 2)(%rsi), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
>         jne     L(return_2_vec_size)
>
> -       /* Each bit in K0 represents a mismatch in YMM3 and YMM5.  */
> -       VPCMP   $4, %YMM3, %YMM5, %k0
> -       VPCMP   $0, %YMMZERO, %YMM3, %k1
> -       VPCMP   $0, %YMMZERO, %YMM5, %k2
> -       /* Each bit in K1 represents a NULL in YMM3 or YMM5.  */
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       kord    %k0, %k1, %k1
> -       ktestd  %k1, %k1
> +       VMOVU   (VEC_SIZE * 3)(%rdi), %YMM0
> +       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
> +       VPCMP   $0, (VEC_SIZE * 3)(%rsi), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
>         jne     L(return_3_vec_size)
>  L(main_loop_header):
>         leaq    (VEC_SIZE * 4)(%rdi), %rdx
> @@ -365,56 +362,51 @@ L(back_to_loop):
>         VMOVA   VEC_SIZE(%rax), %YMM2
>         VMOVA   (VEC_SIZE * 2)(%rax), %YMM4
>         VMOVA   (VEC_SIZE * 3)(%rax), %YMM6
> -       VMOVU   (%rdx), %YMM1
> -       VMOVU   VEC_SIZE(%rdx), %YMM3
> -       VMOVU   (VEC_SIZE * 2)(%rdx), %YMM5
> -       VMOVU   (VEC_SIZE * 3)(%rdx), %YMM7
> -
> -       VPCMP   $4, %YMM0, %YMM1, %k0
> -       VPCMP   $0, %YMMZERO, %YMM0, %k1
> -       VPCMP   $0, %YMMZERO, %YMM1, %k2
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K4 represents a NULL or a mismatch in YMM0 and
> -          YMM1.  */
> -       kord    %k0, %k1, %k4
> -
> -       VPCMP   $4, %YMM2, %YMM3, %k0
> -       VPCMP   $0, %YMMZERO, %YMM2, %k1
> -       VPCMP   $0, %YMMZERO, %YMM3, %k2
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K5 represents a NULL or a mismatch in YMM2 and
> -          YMM3.  */
> -       kord    %k0, %k1, %k5
> -
> -       VPCMP   $4, %YMM4, %YMM5, %k0
> -       VPCMP   $0, %YMMZERO, %YMM4, %k1
> -       VPCMP   $0, %YMMZERO, %YMM5, %k2
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K6 represents a NULL or a mismatch in YMM4 and
> -          YMM5.  */
> -       kord    %k0, %k1, %k6
> -
> -       VPCMP   $4, %YMM6, %YMM7, %k0
> -       VPCMP   $0, %YMMZERO, %YMM6, %k1
> -       VPCMP   $0, %YMMZERO, %YMM7, %k2
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K7 represents a NULL or a mismatch in YMM6 and
> -          YMM7.  */
> -       kord    %k0, %k1, %k7
> -
> -       kord    %k4, %k5, %k0
> -       kord    %k6, %k7, %k1
> -
> -       /* Test each mask (32 bits) individually because for VEC_SIZE
> -          == 32 is not possible to OR the four masks and keep all bits
> -          in a 64-bit integer register, differing from SSE2 strcmp
> -          where ORing is possible.  */
> -       kortestd %k0, %k1
> -       je      L(loop)
> -       ktestd  %k4, %k4
> +
> +       VPMINU  %YMM0, %YMM2, %YMM8
> +       VPMINU  %YMM4, %YMM6, %YMM9
> +
> +       /* A zero CHAR in YMM8 means that there is a null CHAR.  */
> +       VPMINU  %YMM8, %YMM9, %YMM8
> +
> +       /* Each bit set in K1 represents a non-null CHAR in YMM8.  */
> +       VPTESTM %YMM8, %YMM8, %k1
> +
> +       /* (YMM ^ YMM): A non-zero CHAR represents a mismatch.  */
> +       vpxorq  (%rdx), %YMM0, %YMM1
> +       vpxorq  VEC_SIZE(%rdx), %YMM2, %YMM3
> +       vpxorq  (VEC_SIZE * 2)(%rdx), %YMM4, %YMM5
> +       vpxorq  (VEC_SIZE * 3)(%rdx), %YMM6, %YMM7
> +
> +       vporq   %YMM1, %YMM3, %YMM9
> +       vporq   %YMM5, %YMM7, %YMM10
> +
> +       /* A non-zero CHAR in YMM9 represents a mismatch.  */
> +       vporq   %YMM9, %YMM10, %YMM9
> +
> +       /* Each bit cleared in K0 represents a mismatch or a null CHAR.  */
> +       VPCMP   $0, %YMMZERO, %YMM9, %k0{%k1}
> +       kmovd   %k0, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
> +       je       L(loop)
> +
> +       /* Each bit set in K1 represents a non-null CHAR in YMM0.  */
> +       VPTESTM %YMM0, %YMM0, %k1
> +       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> +          in YMM0 and (%rdx).  */
> +       VPCMP   $0, %YMMZERO, %YMM1, %k0{%k1}
> +       kmovd   %k0, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
>         je      L(test_vec)
> -       kmovd   %k4, %edi
> -       tzcntl  %edi, %ecx
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
>         sall    $2, %ecx
> @@ -456,9 +448,18 @@ L(test_vec):
>         cmpq    $VEC_SIZE, %r11
>         jbe     L(zero)
>  # endif
> -       ktestd  %k5, %k5
> +       /* Each bit set in K1 represents a non-null CHAR in YMM2.  */
> +       VPTESTM %YMM2, %YMM2, %k1
> +       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> +          in YMM2 and VEC_SIZE(%rdx).  */
> +       VPCMP   $0, %YMMZERO, %YMM3, %k0{%k1}
> +       kmovd   %k0, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
>         je      L(test_2_vec)
> -       kmovd   %k5, %ecx
>         tzcntl  %ecx, %edi
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> @@ -502,9 +503,18 @@ L(test_2_vec):
>         cmpq    $(VEC_SIZE * 2), %r11
>         jbe     L(zero)
>  # endif
> -       ktestd  %k6, %k6
> +       /* Each bit set in K1 represents a non-null CHAR in YMM4.  */
> +       VPTESTM %YMM4, %YMM4, %k1
> +       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> +          in YMM4 and (VEC_SIZE * 2)(%rdx).  */
> +       VPCMP   $0, %YMMZERO, %YMM5, %k0{%k1}
> +       kmovd   %k0, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
>         je      L(test_3_vec)
> -       kmovd   %k6, %ecx
>         tzcntl  %ecx, %edi
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> @@ -548,8 +558,18 @@ L(test_3_vec):
>         cmpq    $(VEC_SIZE * 3), %r11
>         jbe     L(zero)
>  # endif
> -       kmovd   %k7, %esi
> -       tzcntl  %esi, %ecx
> +       /* Each bit set in K1 represents a non-null CHAR in YMM6.  */
> +       VPTESTM %YMM6, %YMM6, %k1
> +       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> +          in YMM6 and (VEC_SIZE * 3)(%rdx).  */
> +       VPCMP   $0, %YMMZERO, %YMM7, %k0{%k1}
> +       kmovd   %k0, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_WCSCMP
>         /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
>         sall    $2, %ecx
> @@ -605,39 +625,51 @@ L(loop_cross_page):
>
>         VMOVU   (%rax, %r10), %YMM2
>         VMOVU   VEC_SIZE(%rax, %r10), %YMM3
> -       VMOVU   (%rdx, %r10), %YMM4
> -       VMOVU   VEC_SIZE(%rdx, %r10), %YMM5
> -
> -       VPCMP   $4, %YMM4, %YMM2, %k0
> -       VPCMP   $0, %YMMZERO, %YMM2, %k1
> -       VPCMP   $0, %YMMZERO, %YMM4, %k2
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch in YMM2 and
> -          YMM4.  */
> -       kord    %k0, %k1, %k1
> -
> -       VPCMP   $4, %YMM5, %YMM3, %k3
> -       VPCMP   $0, %YMMZERO, %YMM3, %k4
> -       VPCMP   $0, %YMMZERO, %YMM5, %k5
> -       kord    %k4, %k5, %k4
> -       /* Each bit in K3 represents a NULL or a mismatch in YMM3 and
> -          YMM5.  */
> -       kord    %k3, %k4, %k3
> +
> +       /* Each bit set in K2 represents a non-null CHAR in YMM2.  */
> +       VPTESTM %YMM2, %YMM2, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in YMM2 and 32 bytes at (%rdx, %r10).  */
> +       VPCMP   $0, (%rdx, %r10), %YMM2, %k1{%k2}
> +       kmovd   %k1, %r9d
> +       /* Don't use subl since it is the lower 16/32 bits of RDI
> +          below.  */
> +       notl    %r9d
> +# ifdef USE_AS_WCSCMP
> +       /* Only last 8 bits are valid.  */
> +       andl    $0xff, %r9d
> +# endif
> +
> +       /* Each bit set in K4 represents a non-null CHAR in YMM3.  */
> +       VPTESTM %YMM3, %YMM3, %k4
> +       /* Each bit cleared in K3 represents a mismatch or a null CHAR
> +          in YMM3 and 32 bytes at VEC_SIZE(%rdx, %r10).  */
> +       VPCMP   $0, VEC_SIZE(%rdx, %r10), %YMM3, %k3{%k4}
> +       kmovd   %k3, %edi
> +# ifdef USE_AS_WCSCMP
> +       /* Don't use subl since it is the upper 8 bits of EDI below.  */
> +       notl    %edi
> +       andl    $0xff, %edi
> +# else
> +       incl    %edi
> +# endif
>
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Each bit in K1/K3 represents 4-byte element.  */
> -       kshiftlw $8, %k3, %k2
> +       /* NB: Each bit in EDI/R9D represents 4-byte element.  */
> +       sall    $8, %edi
>         /* NB: Divide shift count by 4 since each bit in K1 represent 4
>            bytes.  */
>         movl    %ecx, %SHIFT_REG32
>         sarl    $2, %SHIFT_REG32
> +
> +       /* Each bit in EDI represents a null CHAR or a mismatch.  */
> +       orl     %r9d, %edi
>  # else
> -       kshiftlq $32, %k3, %k2
> -# endif
> +       salq    $32, %rdi
>
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       korq    %k1, %k2, %k1
> -       kmovq   %k1, %rdi
> +       /* Each bit in RDI represents a null CHAR or a mismatch.  */
> +       orq     %r9, %rdi
> +# endif
>
>         /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
>         shrxq   %SHIFT_REG64, %rdi, %rdi
> @@ -682,35 +714,45 @@ L(loop_cross_page_2_vec):
>         /* The first VEC_SIZE * 2 bytes match or are ignored.  */
>         VMOVU   (VEC_SIZE * 2)(%rax, %r10), %YMM0
>         VMOVU   (VEC_SIZE * 3)(%rax, %r10), %YMM1
> -       VMOVU   (VEC_SIZE * 2)(%rdx, %r10), %YMM2
> -       VMOVU   (VEC_SIZE * 3)(%rdx, %r10), %YMM3
> -
> -       VPCMP   $4, %YMM0, %YMM2, %k0
> -       VPCMP   $0, %YMMZERO, %YMM0, %k1
> -       VPCMP   $0, %YMMZERO, %YMM2, %k2
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch in YMM0 and
> -          YMM2.  */
> -       kord    %k0, %k1, %k1
> -
> -       VPCMP   $4, %YMM1, %YMM3, %k3
> -       VPCMP   $0, %YMMZERO, %YMM1, %k4
> -       VPCMP   $0, %YMMZERO, %YMM3, %k5
> -       kord    %k4, %k5, %k4
> -       /* Each bit in K3 represents a NULL or a mismatch in YMM1 and
> -          YMM3.  */
> -       kord    %k3, %k4, %k3
>
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rdx, %r10).  */
> +       VPCMP   $0, (VEC_SIZE * 2)(%rdx, %r10), %YMM0, %k1{%k2}
> +       kmovd   %k1, %r9d
> +       /* Don't use subl since it is the lower 16/32 bits of RDI
> +          below.  */
> +       notl    %r9d
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Each bit in K1/K3 represents 4-byte element.  */
> -       kshiftlw $8, %k3, %k2
> +       /* Only last 8 bits are valid.  */
> +       andl    $0xff, %r9d
> +# endif
> +
> +       VPTESTM %YMM1, %YMM1, %k4
> +       /* Each bit cleared in K3 represents a mismatch or a null CHAR
> +          in YMM1 and 32 bytes at (VEC_SIZE * 3)(%rdx, %r10).  */
> +       VPCMP   $0, (VEC_SIZE * 3)(%rdx, %r10), %YMM1, %k3{%k4}
> +       kmovd   %k3, %edi
> +# ifdef USE_AS_WCSCMP
> +       /* Don't use subl since it is the upper 8 bits of EDI below.  */
> +       notl    %edi
> +       andl    $0xff, %edi
>  # else
> -       kshiftlq $32, %k3, %k2
> +       incl    %edi
>  # endif
>
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       korq    %k1, %k2, %k1
> -       kmovq   %k1, %rdi
> +# ifdef USE_AS_WCSCMP
> +       /* NB: Each bit in EDI/R9D represents 4-byte element.  */
> +       sall    $8, %edi
> +
> +       /* Each bit in EDI represents a null CHAR or a mismatch.  */
> +       orl     %r9d, %edi
> +# else
> +       salq    $32, %rdi
> +
> +       /* Each bit in RDI represents a null CHAR or a mismatch.  */
> +       orq     %r9, %rdi
> +# endif
>
>         xorl    %r8d, %r8d
>         /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
> @@ -719,12 +761,15 @@ L(loop_cross_page_2_vec):
>         /* R8 has number of bytes skipped.  */
>         movl    %ecx, %r8d
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Divide shift count by 4 since each bit in K1 represent 4
> +       /* NB: Divide shift count by 4 since each bit in RDI represent 4
>            bytes.  */
>         sarl    $2, %ecx
> -# endif
> +       /* Skip ECX bytes.  */
> +       shrl    %cl, %edi
> +# else
>         /* Skip ECX bytes.  */
>         shrq    %cl, %rdi
> +# endif
>  1:
>         /* Before jumping back to the loop, set ESI to the number of
>            VEC_SIZE * 4 blocks before page crossing.  */
> @@ -808,7 +853,7 @@ L(cross_page_loop):
>         movzbl  (%rdi, %rdx), %eax
>         movzbl  (%rsi, %rdx), %ecx
>  # endif
> -       /* Check null char.  */
> +       /* Check null CHAR.  */
>         testl   %eax, %eax
>         jne     L(cross_page_loop)
>         /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
> @@ -891,18 +936,17 @@ L(cross_page):
>         jg      L(cross_page_1_vector)
>  L(loop_1_vector):
>         VMOVU   (%rdi, %rdx), %YMM0
> -       VMOVU   (%rsi, %rdx), %YMM1
> -
> -       /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
> -       VPCMP   $4, %YMM0, %YMM1, %k0
> -       VPCMP   $0, %YMMZERO, %YMM0, %k1
> -       VPCMP   $0, %YMMZERO, %YMM1, %k2
> -       /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       kord    %k0, %k1, %k1
> +
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in YMM0 and 32 bytes at (%rsi, %rdx).  */
> +       VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
>         kmovd   %k1, %ecx
> -       testl   %ecx, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xff, %ecx
> +# else
> +       incl    %ecx
> +# endif
>         jne     L(last_vector)
>
>         addl    $VEC_SIZE, %edx
> @@ -921,18 +965,17 @@ L(cross_page_1_vector):
>         cmpl    $(PAGE_SIZE - 16), %eax
>         jg      L(cross_page_1_xmm)
>         VMOVU   (%rdi, %rdx), %XMM0
> -       VMOVU   (%rsi, %rdx), %XMM1
> -
> -       /* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
> -       VPCMP   $4, %XMM0, %XMM1, %k0
> -       VPCMP   $0, %XMMZERO, %XMM0, %k1
> -       VPCMP   $0, %XMMZERO, %XMM1, %k2
> -       /* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
> -       korw    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       korw    %k0, %k1, %k1
> -       kmovw   %k1, %ecx
> -       testl   %ecx, %ecx
> +
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in XMM0 and 16 bytes at (%rsi, %rdx).  */
> +       VPCMP   $0, (%rsi, %rdx), %XMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +# ifdef USE_AS_WCSCMP
> +       subl    $0xf, %ecx
> +# else
> +       subl    $0xffff, %ecx
> +# endif
>         jne     L(last_vector)
>
>         addl    $16, %edx
> @@ -955,25 +998,16 @@ L(cross_page_1_xmm):
>         vmovq   (%rdi, %rdx), %XMM0
>         vmovq   (%rsi, %rdx), %XMM1
>
> -       /* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
> -       VPCMP   $4, %XMM0, %XMM1, %k0
> -       VPCMP   $0, %XMMZERO, %XMM0, %k1
> -       VPCMP   $0, %XMMZERO, %XMM1, %k2
> -       /* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       kord    %k0, %k1, %k1
> -       kmovd   %k1, %ecx
> -
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in XMM0 and XMM1.  */
> +       VPCMP   $0, %XMM1, %XMM0, %k1{%k2}
> +       kmovb   %k1, %ecx
>  # ifdef USE_AS_WCSCMP
> -       /* Only last 2 bits are valid.  */
> -       andl    $0x3, %ecx
> +       subl    $0x3, %ecx
>  # else
> -       /* Only last 8 bits are valid.  */
> -       andl    $0xff, %ecx
> +       subl    $0xff, %ecx
>  # endif
> -
> -       testl   %ecx, %ecx
>         jne     L(last_vector)
>
>         addl    $8, %edx
> @@ -992,25 +1026,16 @@ L(cross_page_8bytes):
>         vmovd   (%rdi, %rdx), %XMM0
>         vmovd   (%rsi, %rdx), %XMM1
>
> -       /* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
> -       VPCMP   $4, %XMM0, %XMM1, %k0
> -       VPCMP   $0, %XMMZERO, %XMM0, %k1
> -       VPCMP   $0, %XMMZERO, %XMM1, %k2
> -       /* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
> -       kord    %k1, %k2, %k1
> -       /* Each bit in K1 represents a NULL or a mismatch.  */
> -       kord    %k0, %k1, %k1
> +       VPTESTM %YMM0, %YMM0, %k2
> +       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> +          in XMM0 and XMM1.  */
> +       VPCMP   $0, %XMM1, %XMM0, %k1{%k2}
>         kmovd   %k1, %ecx
> -
>  # ifdef USE_AS_WCSCMP
> -       /* Only the last bit is valid.  */
> -       andl    $0x1, %ecx
> +       subl    $0x1, %ecx
>  # else
> -       /* Only last 4 bits are valid.  */
> -       andl    $0xf, %ecx
> +       subl    $0xf, %ecx
>  # endif
> -
> -       testl   %ecx, %ecx
>         jne     L(last_vector)
>
>         addl    $4, %edx
> --
> 2.33.1
>

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] x86-64: Remove Prefer_AVX2_STRCMP
  2021-11-01 12:54 ` [PATCH 2/2] x86-64: Remove Prefer_AVX2_STRCMP H.J. Lu
@ 2022-04-23  1:34   ` Sunil Pandey
  0 siblings, 0 replies; 5+ messages in thread
From: Sunil Pandey @ 2022-04-23  1:34 UTC (permalink / raw)
  To: H.J. Lu, libc-stable; +Cc: GNU C Library

On Mon, Nov 1, 2021 at 5:54 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Remove Prefer_AVX2_STRCMP to enable EVEX strcmp.  When comparing 2 32-byte
> strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
> VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
> and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
> VPMOVMSKB and 1 TESTL.  EVEX strcmp is now faster than AVX2 strcmp by up
> to 40% on Tiger Lake and Ice Lake.
> ---
>  sysdeps/x86/cpu-features.c                                | 8 --------
>  sysdeps/x86/cpu-tunables.c                                | 2 --
>  .../include/cpu-features-preferred_feature_index_1.def    | 1 -
>  sysdeps/x86_64/multiarch/strcmp.c                         | 3 +--
>  sysdeps/x86_64/multiarch/strncmp.c                        | 3 +--
>  5 files changed, 2 insertions(+), 15 deletions(-)
>
> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> index 645bba6314..be2498b2e7 100644
> --- a/sysdeps/x86/cpu-features.c
> +++ b/sysdeps/x86/cpu-features.c
> @@ -546,14 +546,6 @@ init_cpu_features (struct cpu_features *cpu_features)
>           if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
>             cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
>               |= bit_arch_Prefer_No_VZEROUPPER;
> -
> -         /* Since to compare 2 32-byte strings, 256-bit EVEX strcmp
> -            requires 2 loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp
> -            requires 1 load, 2 VPCMPEQs, 1 VPMINU and 1 VPMOVMSKB,
> -            AVX2 strcmp is faster than EVEX strcmp.  */
> -         if (CPU_FEATURE_USABLE_P (cpu_features, AVX2))
> -           cpu_features->preferred[index_arch_Prefer_AVX2_STRCMP]
> -             |= bit_arch_Prefer_AVX2_STRCMP;
>         }
>
>        /* Avoid avoid short distance REP MOVSB on processor with FSRM.  */
> diff --git a/sysdeps/x86/cpu-tunables.c b/sysdeps/x86/cpu-tunables.c
> index 00fe5045eb..61b05e5b1d 100644
> --- a/sysdeps/x86/cpu-tunables.c
> +++ b/sysdeps/x86/cpu-tunables.c
> @@ -239,8 +239,6 @@ TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp)
>               CHECK_GLIBC_IFUNC_PREFERRED_BOTH (n, cpu_features,
>                                                 Fast_Copy_Backward,
>                                                 disable, 18);
> -             CHECK_GLIBC_IFUNC_PREFERRED_NEED_BOTH
> -               (n, cpu_features, Prefer_AVX2_STRCMP, AVX2, disable, 18);
>             }
>           break;
>         case 19:
> diff --git a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
> index d7c93f00c5..1530d594b3 100644
> --- a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
> +++ b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
> @@ -32,5 +32,4 @@ BIT (Prefer_ERMS)
>  BIT (Prefer_No_AVX512)
>  BIT (MathVec_Prefer_No_AVX512)
>  BIT (Prefer_FSRM)
> -BIT (Prefer_AVX2_STRCMP)
>  BIT (Avoid_Short_Distance_REP_MOVSB)
> diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
> index 62b7abeeee..7c2901bf44 100644
> --- a/sysdeps/x86_64/multiarch/strcmp.c
> +++ b/sysdeps/x86_64/multiarch/strcmp.c
> @@ -43,8 +43,7 @@ IFUNC_SELECTOR (void)
>      {
>        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
>           && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> -         && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
> -         && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_AVX2_STRCMP))
> +         && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
>         return OPTIMIZE (evex);
>
>        if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
> diff --git a/sysdeps/x86_64/multiarch/strncmp.c b/sysdeps/x86_64/multiarch/strncmp.c
> index 60ba0fe356..f94a421784 100644
> --- a/sysdeps/x86_64/multiarch/strncmp.c
> +++ b/sysdeps/x86_64/multiarch/strncmp.c
> @@ -43,8 +43,7 @@ IFUNC_SELECTOR (void)
>      {
>        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
>           && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> -         && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
> -         && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_AVX2_STRCMP))
> +         && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
>         return OPTIMIZE (evex);
>
>        if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
> --
> 2.33.1
>

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-04-23  1:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-01 12:54 [PATCH 0/2] Enable EVEX strcmp H.J. Lu
2021-11-01 12:54 ` [PATCH 1/2] x86-64: Improve EVEX strcmp with masked load H.J. Lu
2022-04-23  1:30   ` Sunil Pandey
2021-11-01 12:54 ` [PATCH 2/2] x86-64: Remove Prefer_AVX2_STRCMP H.J. Lu
2022-04-23  1:34   ` Sunil Pandey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).