From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=DZy9=7M=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-ot1-x330.google.com (mail-ot1-x330.google.com [IPv6:2607:f8b0:4864:20::330])
	by sourceware.org (Postfix) with ESMTPS id 7865D3858C62
	for <libc-alpha@sourceware.org>; Mon, 20 Mar 2023 13:48:03 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7865D3858C62
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-ot1-x330.google.com with SMTP id k14-20020a056830150e00b0069f156d4ce9so2136784otp.6
        for <libc-alpha@sourceware.org>; Mon, 20 Mar 2023 06:48:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1679320082;
        h=content-transfer-encoding:mime-version:message-id:date:subject:to
         :from:from:to:cc:subject:date:message-id:reply-to;
        bh=AtVgU+JHOMBVFHwtCtq6v8YCKKOLrAaIJTReceaX3qo=;
        b=jHDxHppvxB/0KfdOuNkr6LZsNP20SBGpOGZcCfL8DR9suadOeUmEqNCAjaRid371Po
         BEThdmyqbb+51TYsbZTttYq7P3gWpDJYk7e2gewYy8+csmM8j3PAKChY2jlGHR4SpN+S
         sSDn5vomgcQz7mLu0Gmj1kMoYapUQ/5tFn8jolnupr+wvG6CF9N7qazyk1agVudR3GoG
         gSWq9FQHSsvNYJkkVaLhVu5mxmi86vqqZV/oyVMyfC9VlDd03rXQzSFvoGXsZOIaKldS
         K9jVuqjE1At49ROJwTk6rcq+/3X1W9S7rZxyqPLLq/oCu/1SnMqKDAOlz915/DLhRhqs
         loPw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679320082;
        h=content-transfer-encoding:mime-version:message-id:date:subject:to
         :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=AtVgU+JHOMBVFHwtCtq6v8YCKKOLrAaIJTReceaX3qo=;
        b=CiPG9bbYAT7WL6l3rVIJpih0u4ThzhREPsgAaYNOZXWUHFXz5lY7QPzyoGROQtzXY3
         XDVBXeZdhu2oPQaBNnbid4r0kho7oJSH8aOhN3VU9UqnUdR/o5QbSri0o9DdCn8N6Q0w
         GUu5+sU+vSmAJAAs30JNn9vkMx7xmJJgou4LHVuezUPQDv0TCOU41HJPVnPbcYiU0oU6
         5lckskXbIOFyHoIrP1pfNJmWdKkl4r+b5JlZq6QBuzdaGrF83Kn/xG2Y4XYIikoE/6/l
         f1Ymp9UE+y4TZ97qxlvbWbpq51O3MZ+JzEfS6n1VEt0woZ301Ezxb+wFDmus4zfYnjLu
         W32w==
X-Gm-Message-State: AO0yUKXTvm2L4T3+Cq06mxKwxK989/4kjkT5+c7Lm+W8isAEsdpCNicL
	4oH5EhLcOCOifA3HPC/vVD4/ymwh+BQsm8aImyEVcQ==
X-Google-Smtp-Source: AK7set9iqlgLPGGn6H9BNo+p527WLo1Ow4uMWysoOBUA2ukteVczY/fPw6nV7lSRn9amZrO7MuDgwQ==
X-Received: by 2002:a9d:6e02:0:b0:694:7e8f:2547 with SMTP id e2-20020a9d6e02000000b006947e8f2547mr4754408otr.22.1679320082146;
        Mon, 20 Mar 2023 06:48:02 -0700 (PDT)
Received: from mandiga.. ([2804:1b3:a7c0:c260:e868:74cf:5638:4f8])
        by smtp.gmail.com with ESMTPSA id r8-20020a9d7cc8000000b0069e16ceab0csm3937907otn.53.2023.03.20.06.48.00
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 20 Mar 2023 06:48:01 -0700 (PDT)
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: libc-alpha@sourceware.org,
	Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
	"H . J . Lu" <hjl.tools@gmail.com>
Subject: [PATCH v4 0/5] Improve fmod and fmodf
Date: Mon, 20 Mar 2023 10:47:52 -0300
Message-Id: <20230320134757.340756-1-adhemerval.zanella@linaro.org>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

This is an updated version of a previous submission aimed to improve
fmod implementation [1] by Kirill Okhotnikov.  I extended it with:

  1. Proper benchmarks for both single and double.  The inputs are
     divided in 3 subsets: subnormals, normal nubmers, and close 
     exponents.  It uses a list with random generated values.

  2. Use math_config.h definitions instead math_private (so it might
     eventually get back on optimize-routines).

  3. Implement the same strategy for float version.

  4. Also tuned the final division to use multiplication with inverse
     instead of direct modulo.  It showed better performance on both
     x86_64 and aarch64 chips I have tested.

  5. Remove SVID error handling wrapper.

The performance shows a good improvement compared to current algorithm
for fmod (using gcc 11):

  Architecture     | Input           | master   | patch
  -----------------|-----------------|----------|--------
  x86_64 (Ryzen 9) | subnormals      | 19.1584  | 9.40992
  x86_64 (Ryzen 9) | normal          | 1016.51  | 296.738
  x86_64 (Ryzen 9) | close-exponents | 18.4428  | 13.119
  aarch64 (N1)     | subnormal       | 11.153   | 4.33313
  aarch64 (N1)     | normal          | 528.649  | 158.339
  aarch64 (N1)     | close-exponents | 11.4517  | 5.76138

I also see similar improvements on arm-linux-gnueabihf when running on
the N1 aarch64 chips, where it uses a lot of soft-fp implementation
(for modulo, clz, ctz, and multiplication):

  Architecture     | Input           | master   | patch
  -----------------|-----------------|----------|--------
  armhf (N1)       | subnormal       | 15.7284  | 14.5746
  armhf (N1)       | normal          | 837.525  | 241.738
  armhf (N1)       | close-exponents | 16.2111  | 22.457


The fmodf shows a more moderate improvement:

  Architecture     | Input           | master   | patch
  -----------------|-----------------|----------|--------
  x86_64 (Ryzen 9) | subnormals      | 17.2549  | 9.35776
  x86_64 (Ryzen 9) | normal          | 85.4096  | 46.2761
  x86_64 (Ryzen 9) | close-exponents | 19.1072  | 12.6199
  aarch64 (N1)     | subnormal       | 10.2182  | 4.39188
  aarch64 (N1)     | normal          | 60.0616  | 18.3888
  aarch64 (N1)     | close-exponents | 11.5256  | 5.93518
  armhf (N1)       | subnormal       | 11.6662  | 7.75977
  armhf (N1)       | normal          | 69.2759  | 31.623
  armhf (N1)       | close-exponents | 13.6472  | 15.6689


I also checked against H.J proposal to use fprem on x86_64 [2] and
against recent suggestion on libc-alpha [3], and on both cases 
this newer implementation shows better performance.

Changes from v3:
 * New tests cover more floating points types.

Changes from v2:
 * Bug fixes and improve testsuite.

Changes from v1:
 * Remove SVID error handling wrapper.
 * Extend testing for subnormal with different signs.
 * Code cleanup.

Adhemerval Zanella (5):
  benchtests: Add fmod benchmark
  benchtests: Add fmodf benchmark
  math: Improve fmod
  math: Improve fmodf
  math: Remove the error handling wrapper from fmod and fmodf

 benchtests/Makefile                           |    2 +
 benchtests/fmod-inputs                        | 2182 +++++++++++++++++
 benchtests/fmodf-inputs                       | 2182 +++++++++++++++++
 math/Versions                                 |    4 +
 math/libm-test-fmod.inc                       |   18 +
 math/w_fmod_compat.c                          |   13 +-
 math/w_fmodf_compat.c                         |    6 +-
 sysdeps/i386/fpu/w_fmod_compat.c              |   14 +
 sysdeps/i386/fpu/w_fmodf_compat.c             |   14 +
 sysdeps/ieee754/dbl-64/e_fmod.c               |  248 +-
 sysdeps/ieee754/dbl-64/math_config.h          |   70 +
 sysdeps/ieee754/dbl-64/math_err.c             |    6 +
 sysdeps/ieee754/dbl-64/w_fmod.c               |    1 +
 sysdeps/ieee754/flt-32/e_fmodf.c              |  244 +-
 sysdeps/ieee754/flt-32/math_config.h          |   48 +
 sysdeps/ieee754/flt-32/math_errf.c            |    6 +
 sysdeps/ieee754/flt-32/w_fmodf.c              |    1 +
 sysdeps/m68k/m680x0/fpu/w_fmod_compat.c       |   14 +
 sysdeps/m68k/m680x0/fpu/w_fmodf_compat.c      |   14 +
 sysdeps/unix/sysv/linux/aarch64/libm.abilist  |    2 +
 sysdeps/unix/sysv/linux/alpha/libm.abilist    |    2 +
 sysdeps/unix/sysv/linux/arm/be/libm.abilist   |    2 +
 sysdeps/unix/sysv/linux/arm/le/libm.abilist   |    2 +
 sysdeps/unix/sysv/linux/hppa/libm.abilist     |    2 +
 .../sysv/linux/m68k/coldfire/libm.abilist     |    2 +
 .../sysv/linux/microblaze/be/libm.abilist     |    2 +
 .../sysv/linux/microblaze/le/libm.abilist     |    2 +
 .../unix/sysv/linux/mips/mips32/libm.abilist  |    2 +
 .../unix/sysv/linux/mips/mips64/libm.abilist  |    2 +
 sysdeps/unix/sysv/linux/nios2/libm.abilist    |    2 +
 .../linux/powerpc/powerpc32/fpu/libm.abilist  |    2 +
 .../powerpc/powerpc32/nofpu/libm.abilist      |    2 +
 .../linux/powerpc/powerpc64/be/libm.abilist   |    2 +
 .../linux/powerpc/powerpc64/le/libm.abilist   |    2 +
 .../unix/sysv/linux/s390/s390-32/libm.abilist |    2 +
 .../unix/sysv/linux/s390/s390-64/libm.abilist |    2 +
 sysdeps/unix/sysv/linux/sh/be/libm.abilist    |    2 +
 sysdeps/unix/sysv/linux/sh/le/libm.abilist    |    2 +
 .../sysv/linux/sparc/sparc32/libm.abilist     |    2 +
 .../sysv/linux/sparc/sparc64/libm.abilist     |    2 +
 .../unix/sysv/linux/x86_64/64/libm.abilist    |    2 +
 .../unix/sysv/linux/x86_64/x32/libm.abilist   |    2 +
 42 files changed, 4936 insertions(+), 197 deletions(-)
 create mode 100644 benchtests/fmod-inputs
 create mode 100644 benchtests/fmodf-inputs
 create mode 100644 sysdeps/i386/fpu/w_fmod_compat.c
 create mode 100644 sysdeps/i386/fpu/w_fmodf_compat.c
 create mode 100644 sysdeps/ieee754/dbl-64/w_fmod.c
 create mode 100644 sysdeps/ieee754/flt-32/w_fmodf.c
 create mode 100644 sysdeps/m68k/m680x0/fpu/w_fmod_compat.c
 create mode 100644 sysdeps/m68k/m680x0/fpu/w_fmodf_compat.c

-- 
2.34.1