From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=cyoa=E3=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-ot1-x330.google.com (mail-ot1-x330.google.com [IPv6:2607:f8b0:4864:20::330])
	by sourceware.org (Postfix) with ESMTPS id CE0A83858D3C
	for <libc-alpha@sourceware.org>; Mon, 11 Sep 2023 16:09:00 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CE0A83858D3C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-ot1-x330.google.com with SMTP id 46e09a7af769-6c0b727c1caso2935921a34.0
        for <libc-alpha@sourceware.org>; Mon, 11 Sep 2023 09:09:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1694448540; x=1695053340; darn=sourceware.org;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :to:content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=tfkzWoAegknLhtuq5fLaxIILsJSY6/bhPudeze0dvII=;
        b=oSuTytLR/ABFGxYaav41a1XkzOga9F2BMcr9aXSG4bE/rZu8kSZa6d6ZVWGN6F44CO
         wcT2kX6bfpXJC6edmiL0naqGmWOjvf3+155/kLN8IkfD/L8XUKT+HLCL27AoY1wpDnS7
         3k6A2MjIDuYnN4OWsxjdAFzxaOILkZtKKD7F93agxRFB/UTQs8BIXC5vHZZ4G6LZvYbj
         y2CLdKn0ioZarnBcYN+TIj3B61W+u9TKRNb3THx42Ndo4vM+EvYC54aQRJoEg1zAIgYV
         hm/uuPy8tSugZx5T33PlxWoM4tQ5lqdOzdqqxGMadWLUx8bBZI0T3Fj3WP50mK332F5m
         mb/w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1694448540; x=1695053340;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :to:content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=tfkzWoAegknLhtuq5fLaxIILsJSY6/bhPudeze0dvII=;
        b=RmhowiVSh2wotQhWftIalSECqPqMBHPoMbLllZRqcfVKeYdiJeglsChknxTV7IhWRM
         zkE6gnKqjp93xc6cWeZvXQHKZnSOaQV57KfzwuSE0caGCh3PxrpI3K/907zkL/RTqhQ3
         KuzEPkj0wCkof2510W/Xgbb4xqoIIhn829mYJJmZ9LGrml4MPnXZ/Eb6+Uda2zignUCN
         H3+KNBOBEXH87wdOAChL/oVGY4OW+AHMZArsHWAVYmPUWIDk+TJuEWDbwmCAI9mRvvdy
         aYTM4iW75mhvRLBsTr/Pl2R9tkiUgvuBGXtYgDa1YDKFKTufMPsQu314k4slpQFNGrpi
         0v3w==
X-Gm-Message-State: AOJu0Yy70IWad9I220VLussrMbP98D5gpdfVwIbagCRKSC+2XoJ8kBmM
	m0ETBL2X3K34jc3xJN5u7PrRYw==
X-Google-Smtp-Source: AGHT+IGwa0HNm28R0EYbUIea2ZkSIuDUqwOnbj4ParLhNLt4/RcftwpoEwfrigVsRRXbR8woTL8pMQ==
X-Received: by 2002:a05:6830:52:b0:6b9:b67e:ea8a with SMTP id d18-20020a056830005200b006b9b67eea8amr10372910otp.14.1694448539750;
        Mon, 11 Sep 2023 09:08:59 -0700 (PDT)
Received: from ?IPV6:2804:1b3:a7c0:91cb:1977:7e4f:e638:7fad? ([2804:1b3:a7c0:91cb:1977:7e4f:e638:7fad])
        by smtp.gmail.com with ESMTPSA id t3-20020a0568301e2300b006b8bf76174fsm3233231otr.21.2023.09.11.09.08.56
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 11 Sep 2023 09:08:57 -0700 (PDT)
Message-ID: <c0c498e7-03f5-0f29-c8ef-8e13894b1bdf@linaro.org>
Date: Mon, 11 Sep 2023 13:08:54 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.15.0
Subject: Re: [PATCH 2/2] x86: Add generic CPUID data dumper to ld.so
 --list-diagnostics
Content-Language: en-US
To: Florian Weimer <fweimer@redhat.com>, libc-alpha@sourceware.org
References: <4a77d6294e0023338a8115fad9a3d549c47cae87.1694203757.git.fweimer@redhat.com>
 <b9ec9217f3b7e6c02ed850f283b4c732c756a528.1694203757.git.fweimer@redhat.com>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <b9ec9217f3b7e6c02ed850f283b4c732c756a528.1694203757.git.fweimer@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-13.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 08/09/23 17:10, Florian Weimer wrote:
> This is surprisingly difficult to implement if the goal is to produce
> reasonably sized output.  With the current approaches to output
> compression (suppressing zeros and repeated results between CPUs,
> folding ranges of identical subleaves, dealing with the %ecx
> reflection issue), the output is less than 600 KiB even for systems
> with 256 threads.
> 
> Tested on i686-linux-gnu and x86_64-linux-gnu.  Built with a fairly
> broad build-many-glibcs.py subset (including both Hurd targets).
> 
> ---
>  manual/dynlink.texi                           |  86 +++-
>  .../linux/x86/dl-diagnostics-cpu-kernel.c     | 457 ++++++++++++++++++
>  2 files changed, 542 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> 
> diff --git a/manual/dynlink.texi b/manual/dynlink.texi
> index 06a6c15533..1f02124722 100644
> --- a/manual/dynlink.texi
> +++ b/manual/dynlink.texi
> @@ -228,7 +228,91 @@ reported by the @code{uname} function.  @xref{Platform Type}.
>  @item x86.cpu_features.@dots{}
>  These items are specific to the i386 and x86-64 architectures.  They
>  reflect supported CPU features and information on cache geometry, mostly
> -collected using the @code{CPUID} instruction.
> +collected using the CPUID instruction.
> +
> +@item x86.processor[@var{index}].@dots{}
> +These are additional items for the i386 and x86-64 architectures, as
> +described below.  They mostly contain raw data from the CPUID
> +instruction.  The probes are performed for each active CPU for the
> +@code{ld.so} process, and data for different probed CPUs receives a
> +uniqe @var{index} value.  Some CPUID data is expected to differ from CPU
> +core to CPU core.  In some cases, CPUs are not correctly initialized and
> +indicate the presence of different feature sets.
> +
> +@item x86.processor[@var{index}].requested=@var{kernel-cpu}
> +The kernel is told to run the subsequent probing on the CPU numbered
> +@var{kernel-cpu}.  The values @var{kernel-cpu} and @var{index} can be
> +distinct if there are gaps in the process CPU affinity mask.  This line
> +is not included if CPU affinity mask information is not available.
> +
> +@item x86.processor[@var{index}].observed=@var{kernel-cpu}
> +This line reports the kernel CPU number @var{kernel-cpu} on which the
> +probing code initially ran.  This line is only printed if the requested
> +and observed kernel CPU numbers differ.  This can happen if the kernel
> +fails to act on a request to change the process CPU affinity mask.
> +
> +@item x86.processor[@var{index}].observed_node=@var{node}
> +This reports the observed NUMA node number, as reported by the
> +@code{getcpu} system call.  It is missing if the @code{getcpu} system
> +call failed.
> +
> +@item x86.processor[@var{index}].cpuid_leaves=@var{count}
> +This line indicates that @var{count} distinct CPUID leaves were
> +encountered.  (This reflects internal @code{ld.so} storage space, it
> +does not directly correspond to @code{CPUID} enumeration ranges.)
> +
> +@item x86.processor[@var{index}].ecx_limit=@var{value}
> +The CPUID data extraction code uses a brute-force approach to enumerate
> +subleaves (see the @samp{.subleaf_eax} lines below).  The last
> +@code{%rcx} value used in a CPUID query on this probed CPU was
> +@var{value}.
> +
> +@item x86.processor[@var{index}].cpuid.eax[@var{query_eax}].eax=@var{eax}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ebx=@var{ebx}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].ecx=@var{ecx}
> +@itemx x86.processor[@var{index}].cpuid.eax[@var{query_eax}].edx=@var{edx}
> +These lines report the register contents after executing the CPUID
> +instruction with @samp{%rax == @var{query_eax}} and @samp{%rcx == 0} (a
> +@dfn{leaf}).  For the first probed CPU (with a zero @var{index}), only
> +leaves with non-zero register contents are reported.  For subsequent
> +CPUs, only leaves whose register contents differs from the previously
> +probed CPUs (with @var{index} one less) are reported.
> +
> +Basic and extended leaves are reported using the same syntax.  This
> +means there is a large jump in @var{query_eax} for the first reported
> +extended leaf.
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].eax=@var{eax}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ebx=@var{ebx}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx=@var{ecx}
> +@itemx x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].edx=@var{edx}
> +This is similar to the leaves above, but for a @dfn{subleaf}.  For
> +subleaves, the CPUID instruction is executed with @samp{%rax ==
> +@var{query_eax}} and @samp{%rcx == @var{query_ecx}}, so the result
> +depends on both register values.  The same rules about filtering zero
> +and identical results apply.
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].until_ecx=@var{ecx_limit}
> +Some CPUID results are the same regardless the @var{query_ecx} value.
> +If this situation is detected, a line with the @samp{.until_ecx}
> +selector ins included, and this indicates that the CPUID register
> +contents is the same for @code{%rcx} values between @var{query_ecx}
> +and @var{ecx_limit} (inclusive).
> +
> +@item x86.processor[@var{index}].cpuid.subleaf_eax[@var{query_eax}].ecx[@var{query_ecx}].ecx_query_mask=0xff
> +This line indicates that in an @samp{.until_ecx} range, the CPUID
> +instruction preserved the lowested 8 bits of the input @code{%rcx} in
> +the output @code{%rcx} registers.  Otherwise, the subleaves in the range
> +have identical values.  This special treatment is necessary to report
> +compact range information in case such copying occurs (because the
> +subleaves would otherwise be all different).
> +
> +@item x86.processor[@var{index}].xgetbv.ecx[@var{query_ecx}]=@var{result}
> +This line shows the 64-bit @var{result} value in the @code{%rdx:%rax}
> +register pair after executing the XGETBV instruction with @code{%rcx}
> +set to @var{query_ecx}.  Zero values and values matching the previously
> +probed CPU are omitted.  Nothing is printed if the system does not
> +support the XGETBV instruction.
>  @end table
>  
>  @node Dynamic Linker Introspection
> diff --git a/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> new file mode 100644
> index 0000000000..f84331b33b
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/x86/dl-diagnostics-cpu-kernel.c
> @@ -0,0 +1,457 @@
> +/* Print CPU/kernel diagnostics data in ld.so.  Version for x86.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <dl-diagnostics.h>
> +
> +#include <array_length.h>
> +#include <cpu-features.h>
> +#include <cpuid.h>
> +#include <ldsodefs.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <sysdep.h>
> +
> +/* Register arguments to CPUID.  Multiple ECX subleaf values yielding
> +   the same result are combined, to shorten the output.  Both
> +   identical matches (EAX to EDX are the same) and matches where EAX,
> +   EBX, EDX, and ECX are equal except in the lower byte, which must
> +   match the query ECX value.  The latter is needed to compress ranges
> +   on CPUs which preserve the lowest byte in ECX if an unknown leaf is
> +   queried.  */
> +struct cpuid_query
> +{
> +  unsigned int eax;
> +  unsigned ecx_first;
> +  unsigned ecx_last;
> +  bool ecx_preserves_query_byte;
> +};
> +
> +/* Single integer value that can be used for sorting/ordering
> +   comparisons.  Uses Q->eax and Q->ecx_first only because ecx_last is
> +   always greater than the previous ecx_first value and less than the
> +   subsequent one.  */
> +static inline unsigned long long int
> +cpuid_query_combined (struct cpuid_query *q)
> +{
> +  /* ecx can be -1 (that is, ~0U).  If this happens, this the only ecx
> +     value for this eax value, so the ordering does not matter.  */
> +  return ((unsigned long long int) q->eax << 32) | (unsigned int) q->ecx_first;
> +};
> +
> +/* Used for differential reporting of zero/non-zero values.  */
> +static const struct cpuid_registers cpuid_registers_zero;
> +
> +/* Register arguments to CPUID paired with the results that came back.  */
> +struct cpuid_query_result
> +{
> +  struct cpuid_query q;
> +  struct cpuid_registers r;
> +};
> +
> +/* During a first enumeration pass, we try to collect data for
> +  cpuid_initial_subleaf_limit subleaves per leaf/EAX value.  If we run
> +  out of space, we try once more with applying the lower limit.  */
> +enum { cpuid_main_leaf_limit = 128 };
> +enum { cpuid_initial_subleaf_limit = 512 };
> +enum { cpuid_subleaf_limit = 32 };
> +
> +/* Offset of the extended leaf area.  */
> +enum {cpuid_extended_leaf_offset = 0x80000000 };
> +
> +/* Collected CPUID data.  Everything is stored in a statically sized
> +   array that is sized so that the second pass will collect some data
> +   for all leaves, after the limit is applied.  On the second pass,
> +   ecx_limit is set to cpuid_subleaf_limit.  */
> +struct cpuid_collected_data
> +{
> +  unsigned int used;
> +  unsigned int ecx_limit;
> +  uint64_t xgetbv_ecx_0;
> +  struct cpuid_query_result qr[cpuid_main_leaf_limit
> +                               * 2 * cpuid_subleaf_limit];
> +};
> +
> +/* Fill in the result of a CPUID query.  Returns true if there is
> +   room, false if nothing could be stored.  */
> +static bool
> +_dl_diagnostics_cpuid_store (struct cpuid_collected_data *ccd,
> +                             unsigned eax, int ecx)
> +{
> +  if (ccd->used >= array_length (ccd->qr))
> +    return false;
> +
> +  /* Tentatively fill in the next value.  */
> +  __cpuid_count (eax, ecx,
> +                 ccd->qr[ccd->used].r.eax,
> +                 ccd->qr[ccd->used].r.ebx,
> +                 ccd->qr[ccd->used].r.ecx,
> +                 ccd->qr[ccd->used].r.edx);
> +
> +  /* If the ECX subleaf is next subleaf after the previous one (for
> +     the same leaf), and the values are the same, merge the result
> +     with the already-stored one.  Do this before skipping zero
> +     leaves, which avoids artifiacts for ECX == 256 queries.  */
> +  if (ccd->used > 0
> +      && ccd->qr[ccd->used - 1].q.eax == eax
> +      && ccd->qr[ccd->used - 1].q.ecx_last + 1 == ecx)
> +    {
> +      /* Exact match of the previous result. Ignore the value of
> +         ecx_preserves_query_byte if this is a singleton range so far
> +         because we can treat ECX as fixed if the same value repeats.  */
> +      if ((!ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
> +           || (ccd->qr[ccd->used - 1].q.ecx_first
> +               == ccd->qr[ccd->used - 1].q.ecx_last))
> +          && memcmp (&ccd->qr[ccd->used - 1].r, &ccd->qr[ccd->used].r,
> +                     sizeof (ccd->qr[ccd->used].r)) == 0)
> +        {
> +          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
> +          /* ECX is now fixed because the same value has been observed
> +             twice, even if we had a low-byte match before.  */
> +          ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte = false;
> +          return true;
> +        }
> +      /* Match except for the low byte in ECX, which must match the
> +         incoming ECX value.  */
> +      if (ccd->qr[ccd->used - 1].q.ecx_preserves_query_byte
> +          && (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff)
> +          && ccd->qr[ccd->used].r.eax == ccd->qr[ccd->used - 1].r.eax
> +          && ccd->qr[ccd->used].r.ebx == ccd->qr[ccd->used - 1].r.ebx
> +          && ((ccd->qr[ccd->used].r.ecx & 0xffffff00)
> +              == (ccd->qr[ccd->used - 1].r.ecx & 0xffffff00))
> +          && ccd->qr[ccd->used].r.edx == ccd->qr[ccd->used - 1].r.edx)
> +        {
> +          ccd->qr[ccd->used - 1].q.ecx_last = ecx;
> +          return true;
> +        }
> +    }
> +
> +  /* Do not store zero results.  All-zero values usually mean that the
> +     subleaf is unsupported.  */
> +  if (ccd->qr[ccd->used].r.eax == 0
> +      && ccd->qr[ccd->used].r.ebx == 0
> +      && ccd->qr[ccd->used].r.ecx == 0
> +      && ccd->qr[ccd->used].r.edx == 0)
> +    return true;
> +
> +  /* The result needs to be stored.  Fill in the query parameters and
> +     consume the storage.  */
> +  ccd->qr[ccd->used].q.eax = eax;
> +  ccd->qr[ccd->used].q.ecx_first = ecx;
> +  ccd->qr[ccd->used].q.ecx_last = ecx;
> +  ccd->qr[ccd->used].q.ecx_preserves_query_byte
> +    = (ecx & 0xff) == (ccd->qr[ccd->used].r.ecx & 0xff);
> +  ++ccd->used;
> +  return true;
> +}
> +
> +/* Collected CPUID data into *CCD.  If LIMIT, apply per-leaf limits to
> +   avoid exceeding the pre-allocated space.  Return true if all data
> +   could be stored, false if the retrying without a limit is
> +   requested.  */
> +static bool
> +_dl_diagnostics_cpuid_collect_1 (struct cpuid_collected_data *ccd, bool limit)
> +{
> +  ccd->used = 0;
> +  ccd->ecx_limit
> +    = (limit ? cpuid_subleaf_limit : cpuid_initial_subleaf_limit) - 1;
> +  _dl_diagnostics_cpuid_store (ccd, 0x00, 0x00);
> +  if (ccd->used == 0)
> +    /* CPUID reported all 0.  Should not happen.  */
> +    return true;
> +  unsigned int maximum_leaf = ccd->qr[0x00].r.eax;
> +  if (limit && maximum_leaf >= cpuid_main_leaf_limit)
> +    maximum_leaf = cpuid_main_leaf_limit - 1;
> +
> +  for (unsigned int eax = 1; eax <= maximum_leaf; ++eax)
> +    {
> +      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
> +        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
> +          return false;
> +    }
> +
> +  if (!_dl_diagnostics_cpuid_store (ccd, cpuid_extended_leaf_offset, 0x00))
> +    return false;
> +  maximum_leaf = ccd->qr[ccd->used - 1].r.eax;
> +  if (maximum_leaf < cpuid_extended_leaf_offset)
> +    /* No extended CPUID information.  */
> +    return true;
> +  if (limit
> +      && maximum_leaf - cpuid_extended_leaf_offset >= cpuid_main_leaf_limit)
> +    maximum_leaf = cpuid_extended_leaf_offset + cpuid_main_leaf_limit - 1;
> +  for (unsigned int eax = cpuid_extended_leaf_offset + 1;
> +       eax <= maximum_leaf; ++eax)
> +    {
> +      for (unsigned int ecx = 0; ecx <= ccd->ecx_limit; ++ecx)
> +        if (!_dl_diagnostics_cpuid_store (ccd, eax, ecx))
> +          return false;
> +    }
> +  return true;
> +}
> +
> +/* Call _dl_diagnostics_cpuid_collect_1 twice if necessary, the
> +   second time with the limit applied.  */
> +static void
> +_dl_diagnostics_cpuid_collect (struct cpuid_collected_data *ccd)
> +{
> +  if (!_dl_diagnostics_cpuid_collect_1 (ccd, false))
> +    _dl_diagnostics_cpuid_collect_1 (ccd, true);
> +
> +  /* Re-use the result of the official feature probing here.  */
> +  const struct cpu_features *cpu_features = __get_cpu_features ();
> +  if (CPU_FEATURES_CPU_P (cpu_features, OSXSAVE))
> +    {
> +      unsigned int xcrlow;
> +      unsigned int xcrhigh;
> +      asm ("xgetbv" : "=a" (xcrlow), "=d" (xcrhigh) : "c" (0));
> +      ccd->xgetbv_ecx_0 = ((uint64_t) xcrhigh << 32) + xcrlow;
> +    }
> +  else
> +    ccd->xgetbv_ecx_0 = 0;
> +}
> +
> +/* Print a CPUID register value (passed as REG_VALUE) if it differs
> +   from the expected REG_REFERENCE value.  PROCESSOR_INDEX is the
> +   process sequence number (always starting at zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_print_reg (unsigned int processor_index,
> +                                 const struct cpuid_query *q,
> +                                 const char *reg_label, unsigned int reg_value,
> +                                 bool subleaf)
> +{
> +  if (subleaf)
> +    _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                ".ecx[0x%x].%s=0x%x\n",
> +                processor_index, q->eax, q->ecx_first, reg_label, reg_value);
> +  else
> +    _dl_printf ("x86.processor[0x%x].cpuid.eax[0x%x].%s=0x%x\n",
> +                processor_index, q->eax, reg_label, reg_value);
> +}
> +
> +/* Print CPUID result values in *RESULT for the query in
> +   CCD->qr[CCD_IDX].  PROCESSOR_INDEX is the process sequence number
> +   (always starting at zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_print_query (unsigned int processor_index,
> +                                   struct cpuid_collected_data *ccd,
> +                                   unsigned int ccd_idx,
> +                                   const struct cpuid_registers *result)
> +{
> +  /* Treat this as a value if subleaves if ecx isn't zero (maybe
> +     within the [ecx_fist, ecx_last] range), or if eax matches its
> +     neighbors.  If the range is [0, ecx_limit], then the subleaves
> +     are not distinct (independently of ecx_preserves_query_byte),
> +     so do not report them separately.  */
> +  struct cpuid_query *q = &ccd->qr[ccd_idx].q;
> +  bool subleaf = (q->ecx_first > 0
> +                  || (q->ecx_first != q->ecx_last
> +                      && !(q->ecx_first == 0 && q->ecx_last == ccd->ecx_limit))
> +                  || (ccd_idx > 0 && q->eax == ccd->qr[ccd_idx - 1].q.eax)
> +                  || (ccd_idx + 1 < ccd->used
> +                      && q->eax == ccd->qr[ccd_idx + 1].q.eax));
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "eax", result->eax,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ebx", result->ebx,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "ecx", result->ecx,
> +                                   subleaf);
> +  _dl_diagnostics_cpuid_print_reg (processor_index, q, "edx", result->edx,
> +                                   subleaf);
> +
> +  if (subleaf && q->ecx_first != q->ecx_last)
> +    {
> +      _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                  ".ecx[0x%x].until_ecx=0x%x\n",
> +                  processor_index, q->eax, q->ecx_first, q->ecx_last);
> +      if (q->ecx_preserves_query_byte)
> +        _dl_printf ("x86.processor[0x%x].cpuid.subleaf_eax[0x%x]"
> +                    ".ecx[0x%x].ecx_query_mask=0xff\n",
> +                    processor_index, q->eax, q->ecx_first);
> +    }
> +}
> +
> +/* Perform differential reporting of the data in *CURRENT against
> +   *BASE.  REQUESTED_CPU is the kernel CPU ID the thread was
> +   configured to run on, or -1 if no configuration was possible.
> +   PROCESSOR_INDEX is the process sequence number (always starting at
> +   zero; not a kernel ID).  */
> +static void
> +_dl_diagnostics_cpuid_report (unsigned int processor_index, int requested_cpu,
> +                              struct cpuid_collected_data *current,
> +                              struct cpuid_collected_data *base)
> +{
> +  if (requested_cpu >= 0)
> +    _dl_printf ("x86.processor[0x%x].requested=0x%x\n",
> +                processor_index, requested_cpu);
> +
> +  /* Despite CPU pinning, the requested CPU number may be different
> +     from the one we are running on.  Some container hosts behave this
> +     way.  */
> +  {
> +    unsigned int cpu_number;
> +    unsigned int node_number;
> +    if (INTERNAL_SYSCALL_CALL (getcpu, &cpu_number, &node_number) >= 0)
> +      {
> +        if (cpu_number != requested_cpu)
> +          _dl_printf ("x86.processor[0x%x].observed=0x%x\n",
> +                      processor_index, cpu_number);
> +        _dl_printf ("x86.processor[0x%x].observed_node=0x%x\n",
> +                    processor_index, node_number);
> +      }
> +  }
> +
> +  _dl_printf ("x86.processor[0x%x].cpuid_leaves=0x%x\n",
> +              processor_index, current->used);
> +  _dl_printf ("x86.processor[0x%x].ecx_limit=0x%x\n",
> +              processor_index, current->ecx_limit);
> +
> +  unsigned int base_idx = 0;
> +  for (unsigned int current_idx = 0; current_idx < current->used;
> +       ++current_idx)
> +    {
> +      /* Report missing data on the current CPU as 0.  */
> +      unsigned long long int current_query
> +        = cpuid_query_combined (&current->qr[current_idx].q);
> +      while (base_idx < base->used
> +             && cpuid_query_combined (&base->qr[base_idx].q) < current_query)
> +      {
> +        _dl_diagnostics_cpuid_print_query (processor_index, base, base_idx,
> +                                           &cpuid_registers_zero);
> +        ++base_idx;
> +      }
> +
> +      if (base_idx < base->used
> +          && cpuid_query_combined (&base->qr[base_idx].q) == current_query)
> +        {
> +          _Static_assert (sizeof (struct cpuid_registers) == 4 * 4,
> +                          "no padding in struct cpuid_registers");
> +          if (current->qr[current_idx].q.ecx_last
> +              != base->qr[base_idx].q.ecx_last
> +              || memcmp (&current->qr[current_idx].r,
> +                         &base->qr[base_idx].r,
> +                         sizeof (struct cpuid_registers)) != 0)
> +              /* The ECX range or the values have changed.  Show the
> +                 new values.  */
> +            _dl_diagnostics_cpuid_print_query (processor_index,
> +                                               current, current_idx,
> +                                               &current->qr[current_idx].r);
> +          ++base_idx;
> +        }
> +      else
> +        /* Data is absent in the base reference.  Report the new data.  */
> +        _dl_diagnostics_cpuid_print_query (processor_index,
> +                                           current, current_idx,
> +                                           &current->qr[current_idx].r);
> +    }
> +
> +  if (current->xgetbv_ecx_0 != base->xgetbv_ecx_0)
> +    {
> +      /* Re-use the 64-bit printing routine.  */
> +      _dl_printf ("x86.processor[0x%x].", processor_index);
> +      _dl_diagnostics_print_labeled_value ("xgetbv.ecx[0x0]",
> +                                           current->xgetbv_ecx_0);
> +    }
> +}
> +
> +void
> +_dl_diagnostics_cpu_kernel (void)
> +{
> +#if !HAS_CPUID
> +  /* CPUID is not supported, so there is nothing to dump.  */
> +  if (__get_cpuid_max (0, 0) == 0)
> +    return;
> +#endif

I think we don't support __i486__ anymore, so we can just assume HAS_CPUID
at sysdeps/x86/include/cpu-features.h. 

> +
> +  /* The number of processors reported so far.  Note that is a count,
> +     not a kernel CPU number.  */
> +  unsigned int processor_index = 0;
> +
> +  /* Two copies of the data are used.  Data is written to the index
> +     (processor_index & 1).  The previous version against which the
> +     data dump is reported is at index !(processor_index & 1).  */
> +  struct cpuid_collected_data ccd[2];
> +
> +  /* The initial data is presumed to be all zero.  Zero results are
> +     not recorded.  */
> +  ccd[1].used = 0;
> +  ccd[1].xgetbv_ecx_0 = 0;
> +
> +  /* Run the CPUID probing on a specific CPU.  There are expected
> +     differences for encoding core IDs and topology information in
> +     CPUID output, but some firmware/kernel bugs also may result in
> +     asymmetric data across CPUs in some cases.
> +
> +     The CPU mask arrays are large enough for 4096 or 8192 CPUs, which
> +     should give ample space for future expansion.  */
> +  unsigned long int mask_reference[1024];
> +  int length_reference
> +    = INTERNAL_SYSCALL_CALL (sched_getaffinity, 0,
> +                             sizeof (mask_reference), mask_reference);
> +
> +  /* A parallel bit mask that is used below to request running on a
> +     specific CPU.  */
> +  unsigned long int mask_request[array_length (mask_reference)];
> +
> +  if (length_reference >= sizeof (long))
> +    {
> +      /* The kernel is supposed to return a multiple of the word size.  */
> +      length_reference /= sizeof (long);
> +
> +      for (unsigned int i = 0; i < length_reference; ++i)
> +        {

Why not use the interfaces to work on cpuset? 

  if (length_reference > 0)
    {
      int cpu_count = CPU_COUNT_S (length_reference, mask_reference);
      for (int i = 0; i < cpu_count; i++)
        {
          if (CPU_ISSET_S (i, length_reference, mask_reference)
            {
              CPU_SET_S (i, length_reference, mask_request);
              if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
					 length_reference, mask_request) == 0)
                {
                  _dl_diagnostics_cpuid_collect (&ccd[i & 1]);
                  _dl_diagnostics_cpuid_report (processor_index, i, 
						&ccd[processor_index & 1],
						&ccd[!(processor_index & 1)]);
                  ++processor_index;
                }
              CPU_CLR_S (i, length_reference, mask_request);
            }
        }
    }

I will iterate over the list twice, but I don't think this would really matter
here.

> +          /* Iterate over the bits in mask_request[i] and process
> +             those that are set; j is the bit index, bitmask is the
> +             derived mask for the bit at this index.  */
> +          unsigned int j = 0;
> +          for (unsigned long int bitmask = 1; bitmask != 0; bitmask <<= 1, ++j)
> +            {
> +              mask_request[i] = mask_reference[i] & bitmask;
> +              if (mask_request[i])
> +                {
> +                  unsigned int mask_array_length
> +                    = (i + 1) * sizeof (unsigned long int);
> +                  if (INTERNAL_SYSCALL_CALL (sched_setaffinity, 0,
> +                                             mask_array_length,
> +                                             mask_request) == 0)
> +                    {
> +                      /* This is the CPU ID number used by the
> +                         kernel.  It should match the first result
> +                         from getcpu.  */
> +                      int requested_cpu = i * ULONG_WIDTH + j;
> +                      _dl_diagnostics_cpuid_collect
> +                        (&ccd[processor_index & 1]);
> +                      _dl_diagnostics_cpuid_report
> +                        (processor_index, requested_cpu,
> +                         &ccd[processor_index & 1],
> +                         &ccd[!(processor_index & 1)]);
> +                      ++processor_index;
> +                    }
> +                }
> +            }
> +          /* Reset the mask word, so that the mask has always
> +             population count one.  */
> +          mask_request[i] = 0;
> +        }
> +    }
> +
> +  /* Fallback if we could not deliberately select a CPU.  */
> +  if (processor_index == 0)
> +    {
> +      _dl_diagnostics_cpuid_collect (&ccd[0]);
> +      _dl_diagnostics_cpuid_report (processor_index, -1, &ccd[0], &ccd[1]);
> +    }
> +}