public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16
@ 2023-05-28 19:15 hubicka at gcc dot gnu.org
2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 19:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
Bug ID: 110015
Summary: openjpeg is slower when built with gcc13 compared to
clang16
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
I tried to reproduce openjpeg benchmarks from Phoronix
https://www.phoronix.com/review/gcc13-clang16-raptorlake/5
On zen3 hardware I get 42607ms for clang build and 45702ms for gcc build that
is a 7% difference (Phoronix reports 10% on RaptorLake)
perf of clang build:
88.64% opj_t1_cblk_encode_processor
6.68% opj_dwt_encode_and_deinterleave_v
1.30% opj_dwt_encode_and_deinterleave_h_one_row
opj_t1_cblk_encode_processor is huge with no obvious hot spots.
perf of gcc build:
70.36% opj_t1_cblk_encode_processor
16.12% opj_t1_enc_refpass.lto_priv.0
3.88% opj_dwt_encode_and_deinterleave_v
2.46% pj_dwt_fetch_cols_vertical_pass
2.35% opj_mqc_byteout
So we apparently inline less even at -O3
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
@ 2023-05-28 19:42 ` hubicka at gcc dot gnu.org
2023-10-31 12:08 ` zhangjungcc at gmail dot com
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 19:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
opj_t1_enc_refpass is not inlined due to large function growth and some others
due to max-inline-insns-auto. With inlining forced I get profile:
87.35% opj_t1_cblk_encode_processor
6.22% opj_dwt_encode_and_deinterleave_v.lto_priv.0
1.80% opj_mqc_byteout
1.50% opj_dwt_encode_and_deinterleave_h_one_row.lto_priv.0
So pretty much same profile as for clang. However runtime is still 45573 with
-O3 -flto -march=native -fno-semantic-interposition --param
large-function-insns=1000000 --param max-inline-insns-auto=50000
So it does not seem to be missing IPA optimizations.
There are number of conditional moves in clang code, -mbrach=cost helps a bit,
but not enough.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
@ 2023-10-31 12:08 ` zhangjungcc at gmail dot com
2023-11-01 1:15 ` crazylht at gmail dot com
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: zhangjungcc at gmail dot com @ 2023-10-31 12:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
jun zhang <zhangjungcc at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |zhangjungcc at gmail dot com
--- Comment #2 from jun zhang <zhangjungcc at gmail dot com> ---
The following loop couldn't vectorize in gcc, but could in llvm. it has 3%
improvement.
more info, please refer: https://godbolt.org/z/zMbjq41h5
#include<string.h>
typedef signed int OPJ_INT32;
typedef unsigned int OPJ_UINT32;
typedef int OPJ_BOOL;
#define OPJ_TRUE 1
#define OPJ_FALSE 0
typedef char OPJ_CHAR;
typedef float OPJ_FLOAT32;
typedef double OPJ_FLOAT64;
typedef unsigned char OPJ_BYTE;
#define T1_NMSEDEC_FRACBITS 6
#define OPJ_RESTRICT restrict
#define OPJ_TLS_KEY_T1 0
#include <stdio.h>
typedef size_t OPJ_SIZE_T;
typedef struct opj_tcd_cblk_enc {
OPJ_BYTE* data; /* Data */
// opj_tcd_layer_t* layers; /* layer information */
// opj_tcd_pass_t* passes; /* information about the passes */
OPJ_INT32 x0, y0, x1,
y1; /* dimension of the code-blocks : left upper corner (x0,
y0) right low corner (x1,y1) */
OPJ_UINT32 numbps;
OPJ_UINT32 numlenbits;
OPJ_UINT32 data_size; /* Size of allocated data buffer */
OPJ_UINT32
numpasses; /* number of pass already done for the code-blocks */
OPJ_UINT32 numpassesinlayers; /* number of passes in the layer */
OPJ_UINT32 totalpasses; /* total number of passes */
} opj_tcd_cblk_enc_t;
typedef struct opj_t1 {
/** MQC component */
// opj_mqc_t mqc;
OPJ_INT32 *data;
/** Flags used by decoder and encoder.
* Such that flags[1+0] is for state of col=0,row=0..3,
flags[1+1] for col=1, row=0..3, flags[1+flags_stride] for
col=0,row=4..7, ...
This array avoids too much cache trashing when processing by 4 vertical
samples
as done in the various decoding steps. */
// opj_flag_t *flags;
OPJ_UINT32 w;
OPJ_UINT32 h;
OPJ_UINT32 datasize;
OPJ_UINT32 flagssize;
OPJ_BOOL encoder;
/* Thre 3 variables below are only used by the decoder */
/* set to TRUE in multithreaded context */
OPJ_BOOL mustuse_cblkdatabuffer;
/* Temporary buffer to concatenate all chunks of a codebock */
OPJ_BYTE *cblkdatabuffer;
/* Maximum size available in cblkdatabuffer */
OPJ_UINT32 cblkdatabuffersize;
} opj_t1_t;
#define INLINE __inline__
static INLINE OPJ_INT32 opj_int_max(OPJ_INT32 a, OPJ_INT32 b)
{
return (a > b) ? a : b;
}
#define opj_to_smr(x) ((x) >= 0 ? (OPJ_UINT32)(x) : ((OPJ_UINT32)(-x) |
0x80000000U))
OPJ_FLOAT64 opj_t1_encode_cblk(opj_t1_t *t1,
opj_tcd_cblk_enc_t* cblk,
OPJ_UINT32 orient,
OPJ_UINT32 compno,
OPJ_UINT32 level,
OPJ_UINT32 qmfbid,
OPJ_FLOAT64 stepsize,
OPJ_UINT32 cblksty,
OPJ_UINT32 numcomps,
const OPJ_FLOAT64 * mct_norms,
OPJ_UINT32 mct_numcomps)
{
OPJ_INT32 max;
OPJ_UINT32 i, j;
OPJ_INT32* datap;
max = 0;
datap = t1->data;
for (j = 0; j < t1->h; ++j) {
const OPJ_UINT32 w = t1->w;
for (i = 0; i < w; ++i, ++datap) {
OPJ_INT32 tmp = *datap;
if (tmp < 0) {
OPJ_UINT32 tmp_unsigned;
max = opj_int_max(max, -tmp);
tmp_unsigned = opj_to_smr(tmp);
memcpy(datap, &tmp_unsigned, sizeof(OPJ_INT32));
} else {
max = opj_int_max(max, tmp);
}
}
}
cblk->numbps = max ? 6 : 0;
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
2023-10-31 12:08 ` zhangjungcc at gmail dot com
@ 2023-11-01 1:15 ` crazylht at gmail dot com
2023-11-01 1:28 ` crazylht at gmail dot com
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2023-11-01 1:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
169test.c:85:23: note: vect_is_simple_use: operand max_38 = PHI <max_5(16),
max_40(43)>, type of def: unknown
170test.c:85:23: missed: Unsupported pattern.
171test.c:62:24: missed: not vectorized: unsupported use in stmt.
172test.c:85:23: missed: unexpected pattern.
173test.c:85:23: note: ***** Analysis failed with vector mode V8SI
174test.c:85:23: note: ***** The result for vector mode V32QI would be the
same
175test.c:85:23: missed: couldn't vectorize loop
176test.c:65:13: note: vectorized 0 loops in function.
177Removing basic block 5
178;; basic block 5, loop depth 2
179;; pred: 16
180;; 43
181# max_38 = PHI <max_5(16), max_40(43)>
182# i_42 = PHI <i_29(16), 0(43)>
183# datap_44 = PHI <datap_30(16), datap_46(43)>
184tmp_24 = *datap_44;
185_35 = tmp_24 < 0;
186_56 = (unsigned int) tmp_24;
187_51 = -_56;
188_1 = (int) _51;
189_25 = MAX_EXPR <_1, max_38>;
190_31 = _1 | -2147483648;
191iftmp.0_27 = (unsigned int) _31;
192.MASK_STORE (datap_44, 8B, _35, iftmp.0_27);
193_26 = MAX_EXPR <tmp_24, max_38>;
194max_5 = _35 ? _25 : _26;
195i_29 = i_42 + 1;
196datap_30 = datap_44 + 4;
197if (w_22 > i_29)
198 goto <bb 16>; [89.00%]
199else
200 goto <bb 9>; [11.00%]
201;; succ: 16
So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can be
merge together with MAX_EXPR <max_38, ABS_EXPR <tmp>>
manually change the loop to below, then it can be vectorized.
for (j = 0; j < t1->h; ++j) {
const OPJ_UINT32 w = t1->w;
for (i = 0; i < w; ++i, ++datap) {
OPJ_INT32 tmp = *datap;
if (tmp < 0)
{
OPJ_UINT32 tmp_unsigned;
tmp_unsigned = opj_to_smr(tmp);
memcpy(datap, &tmp_unsigned, sizeof(OPJ_INT32));
tmp = -tmp;
}
max = opj_int_max(max, tmp);
}
}
maybe it's related to phiopt?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2023-11-01 1:15 ` crazylht at gmail dot com
@ 2023-11-01 1:28 ` crazylht at gmail dot com
2023-11-07 1:12 ` pinskia at gcc dot gnu.org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2023-11-01 1:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
> So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can
> be merge together with MAX_EXPR <max_38, ABS_EXPR <tmp>>
>
Create pr112324.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2023-11-01 1:28 ` crazylht at gmail dot com
@ 2023-11-07 1:12 ` pinskia at gcc dot gnu.org
2023-11-07 2:14 ` pinskia at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07 1:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
After fixing PR 112324 (and a secondary patch to phiopt to do
factor_out_conditional_operation for all phi nodes rather than just a single
one) we still miss the abs detection:
_34 = tmp_24 < 0;
_55 = (unsigned int) tmp_24;
_56 = -_55;
_1 = (intD.6) _56;
_30 = _1 | -2147483648;
iftmp.0_26 = (unsigned intD.9) _30;
# .MEM_27 = VDEF <.MEM_46>
# USE = anything
# CLB = anything
.MASK_STORE (datap_43, 8B, _34, iftmp.0_26);
# RANGE [irange] int [0, +INF]
_25 = _34 ? _1 : tmp_24;
basically
`a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR<a>`.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2023-11-07 1:12 ` pinskia at gcc dot gnu.org
@ 2023-11-07 2:14 ` pinskia at gcc dot gnu.org
2023-11-07 2:14 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07 2:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |112416
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
>
> `a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR<a>`.
Filed PR 112416 for that.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112416
[Bug 112416] absu is not detected
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
` (5 preceding siblings ...)
2023-11-07 2:14 ` pinskia at gcc dot gnu.org
@ 2023-11-07 2:14 ` pinskia at gcc dot gnu.org
2023-11-07 2:42 ` pinskia at gcc dot gnu.org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07 2:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Severity|normal |enhancement
Ever confirmed|0 |1
Last reconfirmed| |2023-11-07
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
` (6 preceding siblings ...)
2023-11-07 2:14 ` pinskia at gcc dot gnu.org
@ 2023-11-07 2:42 ` pinskia at gcc dot gnu.org
2023-11-07 2:50 ` pinskia at gcc dot gnu.org
2023-11-24 23:22 ` hubicka at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07 2:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |112418
--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> After fixing PR 112324 (and a secondary patch to phiopt to do
> factor_out_conditional_operation for all phi nodes rather than just a single
> one) we still miss the abs detection:
Filed PR 112418 for the secondary patch.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112418
[Bug 112418] factor_out_conditional_operation could be done for more phis
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
` (7 preceding siblings ...)
2023-11-07 2:42 ` pinskia at gcc dot gnu.org
@ 2023-11-07 2:50 ` pinskia at gcc dot gnu.org
2023-11-24 23:22 ` hubicka at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-07 2:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I should note that PR 112416 is not needed to vectorize the loop, though it
would improve code.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
` (8 preceding siblings ...)
2023-11-07 2:50 ` pinskia at gcc dot gnu.org
@ 2023-11-24 23:22 ` hubicka at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-24 23:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015
--- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
runtimes on zen4 hardware.
trunk -O3 -flto -march-native
42171
42964
42106
clang -O3 -flto -march=native
37393
37423
37508
gcc 13 -O3 -flto -march=native
42380
42314
43285
So seems the performance did not change
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-11-24 23:22 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-28 19:15 [Bug middle-end/110015] New: openjpeg is slower when built with gcc13 compared to clang16 hubicka at gcc dot gnu.org
2023-05-28 19:42 ` [Bug middle-end/110015] " hubicka at gcc dot gnu.org
2023-10-31 12:08 ` zhangjungcc at gmail dot com
2023-11-01 1:15 ` crazylht at gmail dot com
2023-11-01 1:28 ` crazylht at gmail dot com
2023-11-07 1:12 ` pinskia at gcc dot gnu.org
2023-11-07 2:14 ` pinskia at gcc dot gnu.org
2023-11-07 2:14 ` pinskia at gcc dot gnu.org
2023-11-07 2:42 ` pinskia at gcc dot gnu.org
2023-11-07 2:50 ` pinskia at gcc dot gnu.org
2023-11-24 23:22 ` hubicka at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).