public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16
@ 2023-05-28 19:15 hubicka at gcc dot gnu.org
  2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 19:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

            Bug ID: 110015
           Summary: openjpeg is slower when built with gcc13 compared to
                    clang16
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

I tried to reproduce openjpeg benchmarks from Phoronix
https://www.phoronix.com/review/gcc13-clang16-raptorlake/5

On zen3 hardware I get 42607ms for clang build and 45702ms for gcc build that
is a 7% difference (Phoronix reports 10% on RaptorLake)

perf of clang build:
  88.64%  opj_t1_cblk_encode_processor
   6.68%  opj_dwt_encode_and_deinterleave_v
   1.30%  opj_dwt_encode_and_deinterleave_h_one_row

opj_t1_cblk_encode_processor is huge with no obvious hot spots.

perf of gcc build:

  70.36% opj_t1_cblk_encode_processor                                           
  16.12% opj_t1_enc_refpass.lto_priv.0                                          
   3.88% opj_dwt_encode_and_deinterleave_v                                      
   2.46% pj_dwt_fetch_cols_vertical_pass                                        
   2.35% opj_mqc_byteout                                                        

So we apparently inline less even at -O3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
@ 2023-05-28 19:42 ` hubicka at gcc dot gnu.org
  2023-10-31 12:08 ` zhangjungcc at gmail dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 19:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
opj_t1_enc_refpass is not inlined due to large function growth and some others
due to max-inline-insns-auto.  With inlining forced I get profile:

  87.35%   opj_t1_cblk_encode_processor
   6.22%  opj_dwt_encode_and_deinterleave_v.lto_priv.0
   1.80%  opj_mqc_byteout
   1.50%  opj_dwt_encode_and_deinterleave_h_one_row.lto_priv.0

So pretty much same profile as for clang. However runtime is still 45573 with
-O3 -flto -march=native -fno-semantic-interposition --param
large-function-insns=1000000  --param max-inline-insns-auto=50000

So it does not seem to be missing IPA optimizations.

There are number of conditional moves in clang code, -mbrach=cost helps a bit,
but not enough.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
  2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
@ 2023-10-31 12:08 ` zhangjungcc at gmail dot com
  2023-11-01  1:15 ` crazylht at gmail dot com
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zhangjungcc at gmail dot com @ 2023-10-31 12:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

jun zhang <zhangjungcc at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |zhangjungcc at gmail dot com

--- Comment #2 from jun zhang <zhangjungcc at gmail dot com> ---
  The following loop couldn't vectorize in gcc, but could in llvm. it has 3%
improvement.
more info, please refer: https://godbolt.org/z/zMbjq41h5

#include<string.h>
typedef signed int  OPJ_INT32;
typedef unsigned int OPJ_UINT32;
typedef int OPJ_BOOL;
#define OPJ_TRUE 1
#define OPJ_FALSE 0
typedef char          OPJ_CHAR;
typedef float         OPJ_FLOAT32;
typedef double        OPJ_FLOAT64;
typedef unsigned char OPJ_BYTE;
#define T1_NMSEDEC_FRACBITS 6
#define OPJ_RESTRICT restrict
#define OPJ_TLS_KEY_T1  0
#include <stdio.h>
typedef size_t   OPJ_SIZE_T;

typedef struct opj_tcd_cblk_enc {
    OPJ_BYTE* data;               /* Data */
//    opj_tcd_layer_t* layers;      /* layer information */
//    opj_tcd_pass_t* passes;       /* information about the passes */
    OPJ_INT32 x0, y0, x1,
              y1;     /* dimension of the code-blocks : left upper corner (x0,
y0) right low corner (x1,y1) */
    OPJ_UINT32 numbps;
    OPJ_UINT32 numlenbits;
    OPJ_UINT32 data_size;         /* Size of allocated data buffer */
    OPJ_UINT32
    numpasses;         /* number of pass already done for the code-blocks */
    OPJ_UINT32 numpassesinlayers; /* number of passes in the layer */
    OPJ_UINT32 totalpasses;       /* total number of passes */
} opj_tcd_cblk_enc_t;
typedef struct opj_t1 {

    /** MQC component */
//    opj_mqc_t mqc;

    OPJ_INT32  *data;
    /** Flags used by decoder and encoder.
     * Such that flags[1+0] is for state of col=0,row=0..3,
       flags[1+1] for col=1, row=0..3, flags[1+flags_stride] for
col=0,row=4..7, ...
       This array avoids too much cache trashing when processing by 4 vertical
samples
       as done in the various decoding steps. */
//    opj_flag_t *flags;

    OPJ_UINT32 w;
    OPJ_UINT32 h;
    OPJ_UINT32 datasize;
    OPJ_UINT32 flagssize;
    OPJ_BOOL   encoder;

    /* Thre 3 variables below are only used by the decoder */
    /* set to TRUE in multithreaded context */
    OPJ_BOOL     mustuse_cblkdatabuffer;
    /* Temporary buffer to concatenate all chunks of a codebock */
    OPJ_BYTE    *cblkdatabuffer;
    /* Maximum size available in cblkdatabuffer */
    OPJ_UINT32   cblkdatabuffersize;
} opj_t1_t;

#define INLINE __inline__
static INLINE OPJ_INT32 opj_int_max(OPJ_INT32 a, OPJ_INT32 b)
{
    return (a > b) ? a : b;
}
#define opj_to_smr(x)   ((x) >= 0 ? (OPJ_UINT32)(x) : ((OPJ_UINT32)(-x) |
0x80000000U))
OPJ_FLOAT64 opj_t1_encode_cblk(opj_t1_t *t1,
                                      opj_tcd_cblk_enc_t* cblk,
                                      OPJ_UINT32 orient,
                                      OPJ_UINT32 compno,
                                      OPJ_UINT32 level,
                                      OPJ_UINT32 qmfbid,
                                      OPJ_FLOAT64 stepsize,
                                      OPJ_UINT32 cblksty,
                                      OPJ_UINT32 numcomps,
                                      const OPJ_FLOAT64 * mct_norms,
                                      OPJ_UINT32 mct_numcomps)
{
    OPJ_INT32 max;
    OPJ_UINT32 i, j;
    OPJ_INT32* datap;

    max = 0;
    datap = t1->data;
    for (j = 0; j < t1->h; ++j) {
        const OPJ_UINT32 w = t1->w;
        for (i = 0; i < w; ++i, ++datap) {
            OPJ_INT32 tmp = *datap;
            if (tmp < 0) {
                OPJ_UINT32 tmp_unsigned;
                max = opj_int_max(max, -tmp);
                tmp_unsigned = opj_to_smr(tmp);
                memcpy(datap, &tmp_unsigned, sizeof(OPJ_INT32));
            } else {
                max = opj_int_max(max, tmp);
            }
        }
    }
        cblk->numbps = max ? 6 : 0;
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
  2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
  2023-10-31 12:08 ` zhangjungcc at gmail dot com
@ 2023-11-01  1:15 ` crazylht at gmail dot com
  2023-11-01  1:28 ` crazylht at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2023-11-01  1:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
169test.c:85:23: note:   vect_is_simple_use: operand max_38 = PHI <max_5(16),
max_40(43)>, type of def: unknown
170test.c:85:23: missed:   Unsupported pattern.
171test.c:62:24: missed:   not vectorized: unsupported use in stmt.
172test.c:85:23: missed:  unexpected pattern.
173test.c:85:23: note:  ***** Analysis  failed with vector mode V8SI
174test.c:85:23: note:  ***** The result for vector mode V32QI would be the
same
175test.c:85:23: missed: couldn't vectorize loop
176test.c:65:13: note: vectorized 0 loops in function.
177Removing basic block 5
178;; basic block 5, loop depth 2
179;;  pred:       16
180;;              43
181# max_38 = PHI <max_5(16), max_40(43)>
182# i_42 = PHI <i_29(16), 0(43)>
183# datap_44 = PHI <datap_30(16), datap_46(43)>
184tmp_24 = *datap_44;
185_35 = tmp_24 < 0;
186_56 = (unsigned int) tmp_24;
187_51 = -_56;
188_1 = (int) _51;
189_25 = MAX_EXPR <_1, max_38>;
190_31 = _1 | -2147483648;
191iftmp.0_27 = (unsigned int) _31;
192.MASK_STORE (datap_44, 8B, _35, iftmp.0_27);
193_26 = MAX_EXPR <tmp_24, max_38>;
194max_5 = _35 ? _25 : _26;
195i_29 = i_42 + 1;
196datap_30 = datap_44 + 4;
197if (w_22 > i_29)
198  goto <bb 16>; [89.00%]
199else
200  goto <bb 9>; [11.00%]
201;;  succ:       16

So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can be
merge together with MAX_EXPR <max_38, ABS_EXPR <tmp>>

manually change the loop to below, then it can be vectorized.

    for (j = 0; j < t1->h; ++j) {
        const OPJ_UINT32 w = t1->w;
        for (i = 0; i < w; ++i, ++datap) {
            OPJ_INT32 tmp = *datap;
            if (tmp < 0)
              {
                OPJ_UINT32 tmp_unsigned;
                tmp_unsigned = opj_to_smr(tmp);
                memcpy(datap, &tmp_unsigned, sizeof(OPJ_INT32));
                tmp = -tmp;
              }
            max = opj_int_max(max, tmp);
        }
    }

maybe it's related to phiopt?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-11-01  1:15 ` crazylht at gmail dot com
@ 2023-11-01  1:28 ` crazylht at gmail dot com
  2023-11-07  1:12 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2023-11-01  1:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
> So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can
> be merge together with MAX_EXPR <max_38, ABS_EXPR <tmp>>
> 
Create pr112324.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-11-01  1:28 ` crazylht at gmail dot com
@ 2023-11-07  1:12 ` pinskia at gcc dot gnu.org
  2023-11-07  2:14 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07  1:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
After fixing PR 112324 (and a secondary patch to phiopt to do
factor_out_conditional_operation for all phi nodes rather than just a single
one) we still miss the abs detection:

  _34 = tmp_24 < 0;
  _55 = (unsigned int) tmp_24;
  _56 = -_55;
  _1 = (intD.6) _56;
  _30 = _1 | -2147483648;
  iftmp.0_26 = (unsigned intD.9) _30;
  # .MEM_27 = VDEF <.MEM_46>
  # USE = anything
  # CLB = anything
  .MASK_STORE (datap_43, 8B, _34, iftmp.0_26);
  # RANGE [irange] int [0, +INF]
  _25 = _34 ? _1 : tmp_24;

basically

`a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR<a>`.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-11-07  1:12 ` pinskia at gcc dot gnu.org
@ 2023-11-07  2:14 ` pinskia at gcc dot gnu.org
  2023-11-07  2:14 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07  2:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |112416

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> 
> `a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR<a>`.

Filed PR 112416 for that.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112416
[Bug 112416] absu is not detected

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-11-07  2:14 ` pinskia at gcc dot gnu.org
@ 2023-11-07  2:14 ` pinskia at gcc dot gnu.org
  2023-11-07  2:42 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07  2:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
           Severity|normal                      |enhancement
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2023-11-07

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-11-07  2:14 ` pinskia at gcc dot gnu.org
@ 2023-11-07  2:42 ` pinskia at gcc dot gnu.org
  2023-11-07  2:50 ` pinskia at gcc dot gnu.org
  2023-11-24 23:22 ` hubicka at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07  2:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |112418

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> After fixing PR	112324 (and a secondary patch to phiopt to do
> factor_out_conditional_operation for all phi nodes rather than just a single
> one) we still miss the abs detection:

Filed PR 112418 for the secondary patch.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112418
[Bug 112418] factor_out_conditional_operation could be done for more phis

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2023-11-07  2:42 ` pinskia at gcc dot gnu.org
@ 2023-11-07  2:50 ` pinskia at gcc dot gnu.org
  2023-11-24 23:22 ` hubicka at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07  2:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I should note that PR 112416 is not needed to vectorize the loop, though it
would improve code.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
  2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2023-11-07  2:50 ` pinskia at gcc dot gnu.org
@ 2023-11-24 23:22 ` hubicka at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-24 23:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
runtimes on zen4 hardware.

trunk -O3 -flto -march-native
        42171
        42964
        42106
clang -O3 -flto -march=native
        37393
        37423
        37508
gcc 13 -O3 -flto -march=native
        42380
        42314
        43285

So seems the performance did not change

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-11-24 23:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
2023-10-31 12:08 ` zhangjungcc at gmail dot com
2023-11-01  1:15 ` crazylht at gmail dot com
2023-11-01  1:28 ` crazylht at gmail dot com
2023-11-07  1:12 ` pinskia at gcc dot gnu.org
2023-11-07  2:14 ` pinskia at gcc dot gnu.org
2023-11-07  2:14 ` pinskia at gcc dot gnu.org
2023-11-07  2:42 ` pinskia at gcc dot gnu.org
2023-11-07  2:50 ` pinskia at gcc dot gnu.org
2023-11-24 23:22 ` hubicka at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).