From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=bGGZ=C2=free.fr=slash.tmp@sourceware.org>
Received: from smtp5-g21.free.fr (smtp5-g21.free.fr [212.27.42.5])
	by sourceware.org (Postfix) with ESMTPS id AE1943858D1E
	for <gcc-help@gcc.gnu.org>; Sat,  8 Jul 2023 15:02:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AE1943858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=free.fr
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=free.fr
Received: from [IPV6:2a02:8428:2a4:1a01:ac9a:3220:3127:4781] (unknown [IPv6:2a02:8428:2a4:1a01:ac9a:3220:3127:4781])
	(Authenticated sender: slash.tmp@free.fr)
	by smtp5-g21.free.fr (Postfix) with ESMTPSA id 7DB0B5FF96;
	Sat,  8 Jul 2023 17:02:28 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=free.fr;
	s=smtp-20201208; t=1688828559;
	bh=bed3h9ZPyH7c8SQGEiCb5f6KYwVxABzHG5yZafUzE+o=;
	h=Date:To:Cc:From:Subject:From;
	b=MB0jteTKyyEtNqzJP8F2qKSaPvPC3iNWqkKp4Xy9rVgScxXVe7SHaR9OrpHaNDC1Q
	 ubWJg8Z1F1C8/iGBCuycPvMCkfnBdhloHjBN39x8Jv8OS3oXVv3aZ8b4gJG0F9n1Fg
	 PtV5DmVIR1jj79ggW9RjhoDhKcZ/uDkorWrcOiEPCTHG/1Z6HktYQ84bOp4EaLfft5
	 VM6jhIVIR98AzxBmOMj0/I1S21aP/JnCqqfPrZ3eMOxfuDFgoe1aAlIIjPOZ5vX4TU
	 TgOM+NIrK8Z+5dAwKhzhFRGfb4JvCf3qvleTumCZZDcktXOfnqeGrXIiGBbVov9Tg5
	 IDWGTJxXDEDwg==
Message-ID: <d28df349-f510-69e9-b99b-9f72ff3a5436@free.fr>
Date: Sat, 8 Jul 2023 17:02:27 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Content-Language: en-US
To: gcc-help@gcc.gnu.org
Cc: Jakub Jelinek <jakub@redhat.com>, Jeffrey Walton <noloader@gmail.com>,
 Uros Bizjak <ubizjak@gmail.com>, Roger Sayle <roger@nextmovesoftware.com>,
 Vincent Lefevre <vincent-gcc@vinc17.net>,
 Michael_S <already5chosen@yahoo.com>, Terje Mathisen
 <terje.mathisen@tmsw.no>, HPeter Anvin <hpa@zytor.com>,
 Wolfgang Kern <kesys@utanet.at>
From: Mason <slash.tmp@free.fr>
Subject: Tickling a weird CPU stall on Haswell
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,JMQ_SPF_NEUTRAL,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-help.gcc.gnu.org>

Hello everyone,

I'm running Linux on Haswell.
https://en.wikichip.org/wiki/intel/cpuid
https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client)

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 60
model name	: Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
stepping	: 3
microcode	: 0x28
cpu MHz		: 3265.821
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown
bogomips	: 6585.02
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual


I was playing around with some code, when I ran into a weird stall
issue that degrades run-time by a factor of 2.5(!!)

Back in the K7 days (which dates me) I used to know the µ-architecture guide
by heart (I even reported an undocumented stall condition), but I will confess
that I haven't kept up with details of new µ-arches for over a decade :(


Here's the code in question:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define N 512

typedef unsigned long long u64;

void inner(u64 *acc, const u64 *a, const u64 *b)
{
#if V1
  asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "adc $0, %[D2]" :
  [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) :
  [LO] "r" (*a), [HI] "r" (*b) : "cc");
#elif V2
  asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "#adc $0, %[D2]" :
  [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) :
  [LO] "r" (*a), [HI] "r" (*b) : "cc");
#endif
}

static int min(int u, int v) { return u < v ? u : v; }

void fun1(u64 *acc, const u64 *a, const u64 *b)
{
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      inner(acc+i+j, a+i, b+j);
}

void fun2(u64 *acc, const u64 *a, const u64 *b)
{
  for (int sum = 0; sum < 2*N-1; ++sum) {
    int v = min(sum, N-1);
    int u = sum - v;
    for (int i = u; i <= v; ++i)
      inner(acc+sum, a+i, b+sum-i);
  }
}

u64 A[N], B[N], ACC1[N*2], ACC2[N*2];

int main(int argc, char **argv)
{
  for (int i = 0; i < N; ++i) {
    A[i] = rand();
    B[i] = rand();
  }

  if (argc < 2) {
    fun1(ACC1, A, B);
    fun2(ACC2, A, B);
    printf("fun1 vs fun2 = %d\n", memcmp(ACC1, ACC2, sizeof(ACC1)));
    return 0;
  }

  int nf = atoi(argv[1]);
  for (int xp = 0; xp < 1000; ++xp) {
    if (nf == 1) fun1(ACC1, A, B);
    if (nf == 2) fun2(ACC1, A, B);
  }

  return 0;
}


The code touches 4*N*8 data bytes = 16 KB
which should /entirely/ fit in L1 D$
(Haswell has 32 KB/core 8-way set associative)

fun1 and fun2 perform /exactly/ the same calculation,
but in a different order.

inner_V2 = inner_V1 with the 3rd ADD commented out.

Here are the observed run-times on my system:

$ gcc -Wall -O2 -march=native -DV1 slower.c -o v1.out
$ gcc -Wall -O2 -march=native -DV2 slower.c -o v2.out

$ time ./v1.out 1
real	0m7,362s
user	0m7,328s
sys	0m0,004s

$ time ./v1.out 2
real	0m2,895s
user	0m2,895s
sys	0m0,000s

$ time ./v2.out 1
real	0m2,896s
user	0m2,884s
sys	0m0,000s

$ time ./v2.out 2
real	0m2,888s
user	0m2,888s
sys	0m0,000s


Why in heaven's name is fun1 with inner_V1
2.5 times slower than any one of
fun1 with inner_V1
fun2 with inner_V1
fun2 with inner_V2 ???

Some kind of memory-aliasing issue?

(I can show generated assembly if someone thinks it's useful,
but it's pretty much what I expected.)

Regards