public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
@ 2012-07-13 20:00 burnus at gcc dot gnu.org
  2012-07-18 11:26 ` [Bug middle-end/53957] " rguenth at gcc dot gnu.org
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: burnus at gcc dot gnu.org @ 2012-07-13 20:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

             Bug #: 53957
           Summary: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long
                    as other compiler
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: burnus@gcc.gnu.org


[Note that MP_PROP_DESIGN is also discussed at the gcc-graphite mailing list,
albeit more with regards to automatic parallelization.]

The polyhedron benchmark (2011 version) is available at:
http://www.polyhedron.com/polyhedron_benchmark_suite0html, namely:
http://www.polyhedron.com/web_images/documents/pb11.zip

(The original program, which also contains a ready-to-go benchmark is at
http://propdesign.weebly.com/; Note that you may have to rename some input
*.txt files to *TXT.)


The program takes twice as long with GCC as with ifort. The program is just 502
lines long (w/o comments) and contains no subroutines or functions. It mainly
consists of loops and a some math functions (sin, cos, pow, tan, atan, acos,
exp).


[Result on CentOS 5.7, x86-64-gnu-linux, Intel Xeon X3430 @2.40GHz]


Using GCC 4.8.0 20120622 (experimental) [trunk revision 188871], I get:

$ gfortran -Ofast -funroll-loops -fwhole-program -march=native
mp_prop_design.f90
$ time ./a.out > /dev/null 

real    2m47.138s
user    2m46.808s
sys     0m0.236s


Using Intel's ifort on Intel(R) 64, Version 12.1 Build 20120212:

$ ifort -fast mp_prop_design.f90
$ time ./a.out > /dev/null 
real    1m25.906s
user    1m25.598s
sys     0m0.244s


With Intel's libimf preloaded (LD_PRELOAD=.../libimf.so), GCC has:

real    2m0.524s
user    1m59.809s
sys     0m0.689s



The code features expressions like a**2.0D0, but those are converted in GCC to
a*a.

Using -mveclibabi=svml (and no preloading) gives the same timings as without
(or slightly worse); it just calls vmldAtan2.


Vectorizer: I haven't profiled this part, but I want to note that ifort
vectorizes more, namely:

GCC vectorizes:

662: LOOP VECTORIZED.
1032: LOOP VECTORIZED.
1060: LOOP VECTORIZED.


While ifort has:

mp_prop_design.f90(271): (col. 10) remark: LOOP WAS VECTORIZED.
  (Loop "m1 =2, 45" with conditional jump out of the loop)
mp_prop_design.f90(552): (col. 16) remark: LOOP WAS VECTORIZED.
  (Loop with condition)
mp_prop_design.f90(576): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED.
  (Loop with two IF blocks)
mp_prop_design.f90(639): (col. 16) remark: LOOP WAS VECTORIZED.
  (Rather simple loop)
mp_prop_design.f90(662): (col.  2) remark: LOOP WAS VECTORIZED.
  (Vectorized by GCC)
mp_prop_design.f90(677): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED.
   (Line number points to the outermost of the three loops; there are also
    conditional jumps)
mp_prop_design.f90(818): (col. 16) remark: LOOP WAS VECTORIZED.
   (Nested "if" blocks)
mp_prop_design.f90(1032): (col. 2) remark: LOOP WAS VECTORIZED.
mp_prop_design.f90(1060): (col. 2) remark: LOOP WAS VECTORIZED.
   (The last two are handled by GCC)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug middle-end/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
@ 2012-07-18 11:26 ` rguenth at gcc dot gnu.org
  2012-07-18 12:02 ` rguenth at gcc dot gnu.org
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-18 11:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2012-07-18
     Ever Confirmed|0                           |1

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-18 11:25:55 UTC ---
On trunk we do vectorize the loop at 552, but I'm not sure that unconditionally
calling vmldAtan2 is profitable.  That is, trunk for me has (-Ofast
-mveclibabi=svml):

552: LOOP VECTORIZED.
576: LOOP VECTORIZED.
662: LOOP VECTORIZED.
1032: LOOP VECTORIZED.
1060: LOOP VECTORIZED.

The loop at 639 is converted to two memset calls.

mp_prop_design.f90(677): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED.
   (Line number points to the outermost of the three loops; there are also
    conditional jumps)

seems to be the important one to tackle.

For the loop at 818 we fail to if-convert the nested if

                  IF ( j.EQ.1 ) THEN
                     tempa(j) = ZERO
                  ELSE
                     arg1 = -vefz(j)
                     arg2 = vefphi(j)
                     IF ( (arg2.LT.ZERO) .OR. (arg2.GT.ZERO) ) THEN
                        tempa(j) = ATAN(arg1/arg2) - theta(j)
                     ELSE
                        tempa(j) = -theta(j)
                     ENDIF
                  ENDIF

where we also fail to apply store-motion of tempa(j).  The if (j == 1)
conditional code makes the loop suitable for peeling, too.

That said, this loop is suitable for analysis as well.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug middle-end/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
  2012-07-18 11:26 ` [Bug middle-end/53957] " rguenth at gcc dot gnu.org
@ 2012-07-18 12:02 ` rguenth at gcc dot gnu.org
  2012-07-18 12:09 ` [Bug fortran/53957] " rguenth at gcc dot gnu.org
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-18 12:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-18 12:02:32 UTC ---
All time is spent in the loop nest starting at line 677, 683, 694, 696 for
all of them we claim they are in bad loop form.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
  2012-07-18 11:26 ` [Bug middle-end/53957] " rguenth at gcc dot gnu.org
  2012-07-18 12:02 ` rguenth at gcc dot gnu.org
@ 2012-07-18 12:09 ` rguenth at gcc dot gnu.org
  2012-07-18 12:46 ` burnus at gcc dot gnu.org
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-18 12:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |fortran

--- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-18 12:09:23 UTC ---
The issue seems to be that the frontend uses two induction variables, one
signed
and one unsigned, for

                        DO i = 1 , 1 + NINT(2.0D0*PI*trns/dphit) ,      &
     &                     NINT(ainc/(dphit*(180.0D0/PI)))
...
                        END DO

<bb 78>:
  # i_5 = PHI <[mp_prop_design.f90 : 697:0] 1(77), [mp_prop_design.f90 : 696:0]
i_621(79)>
  # countm1.38_32 = PHI <[mp_prop_design.f90 : 696:0] countm1.38_466(77),
[mp_prop_design.f90 : 696:0] countm1.38_622(79)>
  # prephitmp.386_3285 = PHI <pretmp.385_3284(77), D.2618_614(79)>
  # prephitmp.386_3287 = PHI <pretmp.385_3286(77), D.2620_620(79)>
...
  [mp_prop_design.f90 : 696:0] i_621 = i_5 + pretmp.378_3242;
  [mp_prop_design.f90 : 696:0] # DEBUG i => i_621
  [mp_prop_design.f90 : 696:0] if (countm1.38_32 == 0)
    goto <bb 80>;
  else
    goto <bb 79>;

<bb 79>:
  [mp_prop_design.f90 : 696:0] countm1.38_622 = countm1.38_32 + 4294967295;
  [mp_prop_design.f90 : 696 : 0] goto <bb 78>;

and the "decrement" of countm1 happens in the loop latch block.  It would
be better to have this similar to other loops I see,

       bool flag = end-value == i;
       i = i + 1;
       if (flag) goto loop_exit;


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2012-07-18 12:09 ` [Bug fortran/53957] " rguenth at gcc dot gnu.org
@ 2012-07-18 12:46 ` burnus at gcc dot gnu.org
  2012-07-18 13:18 ` rguenther at suse dot de
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: burnus at gcc dot gnu.org @ 2012-07-18 12:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |burnus at gcc dot gnu.org

--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> 2012-07-18 12:46:29 UTC ---
(In reply to comment #3)
> It would be better to have this similar to other loops I see,
>
>        bool flag = end-value == i;
>        i = i + 1;
>        if (flag) goto loop_exit;

That's not that simple as one might not reach the end value due to the step. If
"step" is (plus or minus) unity and if one has integers (and not reals, added
in Fortran 77, deleted in Fortran 90), it is simple.

But if abs(step) != 1 or if the loop variable is not an integer, one either
needs to calculate the number of trips beforehand, or has to use ">" or "<"
rather than "==". The problem with "<" / ">" is that one has to do another
comparison, unless the sign of "step" is known:

  if (step > 0 ? dovar > to : dovar < to)
    goto exit_label;

I don't see whether that version is better than the current version.
Suggestions or comments?


The current code is (comment from trans-stmt.c's gfc_trans_do):

------------<cut>-----------------
   We translate a do loop from:

   DO dovar = from, to, step
      body
   END DO

   to:

   [evaluate loop bounds and step]
   empty = (step > 0 ? to < from : to > from);
   countm1 = (to - from) / step;
   dovar = from;
   if (empty) goto exit_label;
   for (;;)
     {
       body;
cycle_label:
       dovar += step
       if (countm1 ==0) goto exit_label;
       countm1--;
     }
exit_label:

   countm1 is an unsigned integer.  It is equal to the loop count minus one,
   because the loop count itself can overflow.  */
------------</cut>-----------------


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2012-07-18 12:46 ` burnus at gcc dot gnu.org
@ 2012-07-18 13:18 ` rguenther at suse dot de
  2012-07-18 14:05 ` burnus at gcc dot gnu.org
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenther at suse dot de @ 2012-07-18 13:18 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> 2012-07-18 13:18:13 UTC ---
On Wed, 18 Jul 2012, burnus at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> 
> Tobias Burnus <burnus at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |burnus at gcc dot gnu.org
> 
> --- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> 2012-07-18 12:46:29 UTC ---
> (In reply to comment #3)
> > It would be better to have this similar to other loops I see,
> >
> >        bool flag = end-value == i;
> >        i = i + 1;
> >        if (flag) goto loop_exit;
> 
> That's not that simple as one might not reach the end value due to the step. If
> "step" is (plus or minus) unity and if one has integers (and not reals, added
> in Fortran 77, deleted in Fortran 90), it is simple.
> 
> But if abs(step) != 1 or if the loop variable is not an integer, one either
> needs to calculate the number of trips beforehand, or has to use ">" or "<"
> rather than "==". The problem with "<" / ">" is that one has to do another
> comparison, unless the sign of "step" is known:
> 
>   if (step > 0 ? dovar > to : dovar < to)
>     goto exit_label;
> 
> I don't see whether that version is better than the current version.
> Suggestions or comments?
> 
> 
> The current code is (comment from trans-stmt.c's gfc_trans_do):
> 
> ------------<cut>-----------------
>    We translate a do loop from:
> 
>    DO dovar = from, to, step
>       body
>    END DO
> 
>    to:
> 
>    [evaluate loop bounds and step]
>    empty = (step > 0 ? to < from : to > from);
>    countm1 = (to - from) / step;
>    dovar = from;
>    if (empty) goto exit_label;
>    for (;;)
>      {
>        body;
> cycle_label:
>        dovar += step
>        if (countm1 ==0) goto exit_label;
>        countm1--;
>      }
> exit_label:
> 
>    countm1 is an unsigned integer.  It is equal to the loop count minus one,
>    because the loop count itself can overflow.  */

If you do

>    [evaluate loop bounds and step]
>    empty = (step > 0 ? to < from : to > from);
>    countm1 = (to - from) / step;
>    dovar = from;
>    if (empty) goto exit_label;
>    for (;;)
>      { 
>        body;
> cycle_label:
>        dovar += step
         exit = countm1 == 0;
         countm1--;
>        if (exit) goto exit_label;
>      }
> exit_label:

it would work for this case.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2012-07-18 13:18 ` rguenther at suse dot de
@ 2012-07-18 14:05 ` burnus at gcc dot gnu.org
  2012-07-18 14:48 ` rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: burnus at gcc dot gnu.org @ 2012-07-18 14:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #6 from Tobias Burnus <burnus at gcc dot gnu.org> 2012-07-18 14:04:45 UTC ---
Created attachment 27823
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27823
Draft patch: Change comparison into bool assignment, decrement conditional jump

(In reply to comment #5)
> If you do

>          exit = countm1 == 0;
>          countm1--;
> >        if (exit) goto exit_label;

> it would work for this case.


If I apply the attached patch, I do not see any performance difference on my
AMD Athlon64 x2 4800+ with -Ofast -funroll-loops -march=native. 

real  3m45.711s  3m45.589s  3m44.308s  | 3m45.363s  3m45.328s  3m44.220s
user  3m45.710s  3m45.582s  3m44.274s  | 3m45.282s  3m45.286s  3m44.218s


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2012-07-18 14:05 ` burnus at gcc dot gnu.org
@ 2012-07-18 14:48 ` rguenth at gcc dot gnu.org
  2012-07-26  9:59 ` burnus at gcc dot gnu.org
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-18 14:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-18 14:47:49 UTC ---
It helps to make us even consider the loop.  We now run into

696: worklist: examine stmt: D.2574_254 = (real(kind=4)) i_5;

696: vect_is_simple_use: operand i_5
696: def_stmt: i_5 = PHI <1(77), i_324(80)>

696: Unsupported pattern.
696: not vectorized: unsupported use in stmt.
696: unexpected pattern.

that is, the following induction is not handled:

                           phit = phib + phie(k) + (REAL(i)-0.50D0)     &
     &                            *dphit

so it would be still worthwhile to pursue your patch if it does not have
negative effects elsewhere.  We should be able to fix the induction code
to handle this case.

If you can help isolating the innermost two loops into a smaller testcase
that would be great, too.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2012-07-18 14:48 ` rguenth at gcc dot gnu.org
@ 2012-07-26  9:59 ` burnus at gcc dot gnu.org
  2012-07-26 10:19 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: burnus at gcc dot gnu.org @ 2012-07-26  9:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #8 from Tobias Burnus <burnus at gcc dot gnu.org> 2012-07-26 09:58:41 UTC ---
(In reply to comment #7)
> so it would be still worthwhile to pursue your patch if it does not have
> negative effects elsewhere.  We should be able to fix the induction code
> to handle this case.

Regarding negative (or positive) impact with regards to performance: That's
difficult to test :-(

However, with the patch, f951 stops with the following ICE
  internal compiler error: in free_regset_pool, at sel-sched-ir.c:994
with gfortran.dg/pr42294.f and gfortran.dg/pr44691.f.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2012-07-26  9:59 ` burnus at gcc dot gnu.org
@ 2012-07-26 10:19 ` rguenth at gcc dot gnu.org
  2012-09-11 15:02 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-26 10:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #9 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-26 10:18:55 UTC ---
(In reply to comment #8)
> (In reply to comment #7)
> > so it would be still worthwhile to pursue your patch if it does not have
> > negative effects elsewhere.  We should be able to fix the induction code
> > to handle this case.
> 
> Regarding negative (or positive) impact with regards to performance: That's
> difficult to test :-(
> 
> However, with the patch, f951 stops with the following ICE
>   internal compiler error: in free_regset_pool, at sel-sched-ir.c:994
> with gfortran.dg/pr42294.f and gfortran.dg/pr44691.f.

That's a pre-existing issue on current trunk, unrelated to the patch.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2012-07-26 10:19 ` rguenth at gcc dot gnu.org
@ 2012-09-11 15:02 ` rguenth at gcc dot gnu.org
  2013-01-16 22:53 ` burnus at gcc dot gnu.org
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-09-11 15:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-09-11 15:02:00 UTC ---
There are a lot more reasons why we do not vectorize this loop :(


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2012-09-11 15:02 ` rguenth at gcc dot gnu.org
@ 2013-01-16 22:53 ` burnus at gcc dot gnu.org
  2013-06-10  4:41 ` prop_design at yahoo dot com
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-01-16 22:53 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |burnus at gcc dot gnu.org

--- Comment #11 from Tobias Burnus <burnus at gcc dot gnu.org> 2013-01-16 22:53:35 UTC ---
(In reply to comment #6)
> Created attachment 27823 [details]
> Draft patch: Change comparison into bool assignment, decrement conditional jump

A similar but slightly different patch has been committed, cf. PR 52865 comment
13.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2013-01-16 22:53 ` burnus at gcc dot gnu.org
@ 2013-06-10  4:41 ` prop_design at yahoo dot com
  2013-06-10 17:44 ` prop_design at yahoo dot com
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: prop_design at yahoo dot com @ 2013-06-10  4:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

Anthony Falzone <prop_design at yahoo dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |prop_design at yahoo dot com

--- Comment #12 from Anthony Falzone <prop_design at yahoo dot com> ---
Hi Guys,

I'm the developer of PROP_DESIGN.  I originally posted on the Google GCC
Graphite Group.  Thanks Tobias for creating this bug and realizing the root
issue.  I originally thought auto-parallelization would be of benefit. 
However, I recently starting experimenting with the Intel Fortran compiler and
have found some things that may help you out.  I have found Intel Fortran IPO,
auto-vectorization, and/or auto-parallelization are of no benefit to
PROP_DESIGN.  I also found, as Tobias mentioned here, that gfortran creates
significantly slower executable files than Intel Fortran.  I have narrowed it
down to just the basic optimizations.  It does not have to do with anything
else.  If you compare gfortran -03 optimizations versus Intel Fortran /O3
optimizations, you see a big difference.  One case I ran shows about 38.65%
faster executable files, if you use Intel Fortran with /O3 optimizations
compared to gfortran with -O3 optimizations.

I have a measle AMD C-60 processor and use Windows 7 64-bit.  I have tried many
other gfortran for Windows compilers in the past, but I'm currently using the
latest version of TDM-GCC.  I have also tried Linux but am not currently using
it.  I have not tried Intel Fortran on Linux.

I am not much of a programmer, so I can't say why gfortran -O3 is making slower
executable files than Intel Fortran /O3.  Perhaps you guys would know.  I
thought this information might help you out.  If I can be of any help to you,
let me know.  My website has the latest version of PROP_DESIGN.  Polyhedron
refuses to update the version I sent them years ago.  It would probably be
better if you used the latest version for testing your software.

Sincerely,

Anthony Falzone
http://propdesign.weebly.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2013-06-10  4:41 ` prop_design at yahoo dot com
@ 2013-06-10 17:44 ` prop_design at yahoo dot com
  2013-06-22 13:04 ` dominiq at lps dot ens.fr
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: prop_design at yahoo dot com @ 2013-06-10 17:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #13 from Anthony Falzone <prop_design at yahoo dot com> ---
My previous post needs a correction.  Comparing gfortran O3 to Intel Fortran O3
I see a 60% speed improvement in favor of the Intel Fortran compiler.  There is
a 40% improvement over past releases of PROP_DESIGN, which used gfortran Ofast.
 There is not much difference between Intel Fortran O3 and Ofast, so I am using
O3 to ensure accurate calculations.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2013-06-10 17:44 ` prop_design at yahoo dot com
@ 2013-06-22 13:04 ` dominiq at lps dot ens.fr
  2013-06-23  5:25 ` prop_design at yahoo dot com
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: dominiq at lps dot ens.fr @ 2013-06-22 13:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #14 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
Anthony, could you provide a reduced test showing the problem?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2013-06-22 13:04 ` dominiq at lps dot ens.fr
@ 2013-06-23  5:25 ` prop_design at yahoo dot com
  2020-06-27 23:34 ` prop_design at protonmail dot com
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: prop_design at yahoo dot com @ 2013-06-23  5:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #15 from Anthony Falzone <prop_design at yahoo dot com> ---
(In reply to Dominique d'Humieres from comment #14)
> Anthony, could you provide a reduced test showing the problem?

Hi Dominique,

About the most reduced I can think of is PROP_DESIGN_ANALYSIS.  It contains the
core calculations that are required to determine aircraft propeller
performance.  PROP_DESIGN_ANALYSIS_BENCHMARK just adds some looping options
that in my mind could run in parallel.  However, I don't know anything about
parallel programming and when I tried some Fortran compilers with
auto-parallelization non of them can pick up on the loops that too me seem
obviously parallel.  So what I think and what is currently feasible with
auto-parallelization are not the same.

In any event, I have noticed that just using O3 optimizations there is a
substantial difference between Intel Fortran and gfortran.  So I am just
confirming what Tobias is saying here in this bug report.

If PROP_DESIGN_ANALYSIS_BENCHMARK is too complex of a test case for you the
only thing I can think to do is take PROP_DESIGN_ANALYSIS and strip out most of
the end parts where various outputs are created.  So the program would just
take the inputs run the minimum calculations with the least looping possible
and output pretty much nothing.  It wouldn't be hard for me to do something
like that if it would be of any benefit to you, I'm not sure.

Also, if there is anything programming wise that you would like changed for
some reason as far as Fortran syntax, I can try that too.  My knowledge of
Fortran is fairly basic.  I tried to stick strictly to Fortran 77, since that
is what I was trained in and have a lot of experience with.  I don't know any
other programming languages or even any other versions of Fortran such as 90/95
etc...

Anthony


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2013-06-23  5:25 ` prop_design at yahoo dot com
@ 2020-06-27 23:34 ` prop_design at protonmail dot com
  2020-06-28 10:49 ` tkoenig at gcc dot gnu.org
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: prop_design at protonmail dot com @ 2020-06-27 23:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

prop_design at protonmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |prop_design at protonmail dot com

--- Comment #19 from prop_design at protonmail dot com ---
hi everyone,

I'm not sure if this is the right place to ask this or not, but it relates to
the topic. I can't find the other thread about graphite auto-parallelization
that I made a long time ago.

I tried gfortran 10.1.0 via MSYS2. It seems to work very well on the latest
version of PROP_DESIGN. MP_PROP_DESIGN had some extra loops for benchmarking. I
found it made it harder for the optimizer so I deleted that code and just use
the 'real' version of the code it was based on called PROP_DESIGN_MAPS. So
that's the actual propeller design code with no additional looping for
benchmarking purposes.

I've found no Fortran compiler and do the auto-parallelization the way I would
like. The only code that would implement any at run time actually slowed the
code way down instead of sped it up.

I still have my original problem with gfortran. That is, at runtime no actual
parallelization occurs. The code runs the exact same as if the commands are not
present. Oddly though, the code does say it auto-parallelized many loops.
Although, not the loops that would really help, but at least it shows it's
doing something. That's an improvement from when I started these threads.

The problem is if I compile with the following:

gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static
-march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp -pthread
-ftree-parallelize-loops=2 -floop-parallelize-all -fopt-info-loop


It runs the exact same way as if I compile with:

gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static
-march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp


Again, gfortran does say it auto-parallelize some loops. So it's very odd. I
have searched the net and can't find anything that has helped.

I'm wondering if for Linux users, the code actually will work in parallel. That
would at least narrow the problem down some. I'm using Windows 10 and the code
will only run with one core. Compiling both ways it shows 2 threads used for
awhile and then drops to 1 thread.

The good news from when this was posted is that gfortran ran the code at the
same speed as the PGI Community Edition Compiler. Since they just stopped
developing that, I switched back to gfortran. I no longer have Intel Fortran to
test. That was the compiler that actually did run the code in parallel, but it
ran twice as slow instead of twice as fast. That was a year or two ago. I don't
know if it's any better now.

I'm wondering if there is some sort of issue with -pthread not being able to
call anything more than one core on Windows 10.

You can download PROP_DESIGN at https://propdesign.jimdofree.com

Inside the download are all the *.f files. I also have c.bat files in there
with the compiler options I used. The auto-parallelization commands are not
present, since they don't seem to be working still. At least on Windows 10.

The code now runs much faster than it used to, due to many bug fixes and
improvements I've made over the years. However, you can get it to run really
slow for testing purposes. In the settings file for the program change the
defaults like this:

1           ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM,
DEFAULT = 2)
2           ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER,
NON-DIM, DEFAULT = 2)

or like this

1           ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM,
DEFAULT = 2)
1           ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER,
NON-DIM, DEFAULT = 2)

The first runs very slow, the second incredibly slow. I just close the command
window once I've seen if the code is running in parallel or not. With the
defaults set at 2 for each of those values the code runs so fast you can't
really get a sense of what's going on.

Thanks for any help,

Anthony

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2020-06-27 23:34 ` prop_design at protonmail dot com
@ 2020-06-28 10:49 ` tkoenig at gcc dot gnu.org
  2020-06-28 11:03 ` tkoenig at gcc dot gnu.org
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-06-28 10:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #20 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
I looked at the output of 

gfortran -Ofast -funroll-loops -march=native -mtune=native -fopt-info-vec
mp_prop_design.f90  2>&1 | wc -l

in the Polyhedron testsuite, and I can confirm it really shows
much more vectorization than before.  So, I'm glad that part seems
to be fixed.

Regarding auto-parallelization: I can confirm your observation under
Linux, it doesn't do any more than on Windows.

I think I will open up meta-PR and make this PR depend on it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2020-06-28 10:49 ` tkoenig at gcc dot gnu.org
@ 2020-06-28 11:03 ` tkoenig at gcc dot gnu.org
  2020-06-28 15:40 ` prop_design at protonmail dot com
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-06-28 11:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #21 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Another question: Is there anything left to be done with the
vectorizer, or could we remove that dependency?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2020-06-28 11:03 ` tkoenig at gcc dot gnu.org
@ 2020-06-28 15:40 ` prop_design at protonmail dot com
  2020-06-29  9:36 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: prop_design at protonmail dot com @ 2020-06-28 15:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #22 from Anthony <prop_design at protonmail dot com> ---
(In reply to Thomas Koenig from comment #21)
> Another question: Is there anything left to be done with the
> vectorizer, or could we remove that dependency?

thanks for looking into this again for me. i'm surprised it worked the same on
Linux, but knowing that, at least helps debug this issue some more. I'm not
sure about the vectorizer question, maybe that question was intended for
someone else. the runtimes seem good as is though. i doubt the
auto-parallelization will add much speed. but it's an interesting feature that
i've always hoped would work. i've never got it to work though. the only code
that did actually implement something was Intel Fortran. it implemented one
trivial loop, but it slowed the code down instead of speeding it up. the output
from gfortran shows more loops it wants to run in parallel. they aren't
important ones. but something would be better than nothing. if it slowed the
code down, i would just not use it.

there is something different in gfortran where it mentions a lot of 16bit
vectorization. i don't recall that from the past. but whatever it's doing,
seems fine from a speed perspective.

some compliments to the developers. the code compiles very fast compared to
other compilers. i'm really glad it doesn't rely on Microsoft Visual Studio.
that's a huge time consuming install. I was very happy I could finally
uninstall it. also, gfortran handles all my stop statements properly. PGI
Community Edition was adding in a bunch of non-sense output anytime a stop
command was issued. So it's nice to have the code work as intended again.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2020-06-28 15:40 ` prop_design at protonmail dot com
@ 2020-06-29  9:36 ` rguenther at suse dot de
  2020-06-29 14:09 ` prop_design at protonmail dot com
  2020-07-29 22:25 ` prop_design at protonmail dot com
  21 siblings, 0 replies; 23+ messages in thread
From: rguenther at suse dot de @ 2020-06-29  9:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #23 from rguenther at suse dot de <rguenther at suse dot de> ---
On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> 
> --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> (In reply to Thomas Koenig from comment #21)
> > Another question: Is there anything left to be done with the
> > vectorizer, or could we remove that dependency?
> 
> thanks for looking into this again for me. i'm surprised it worked the same on
> Linux, but knowing that, at least helps debug this issue some more. I'm not
> sure about the vectorizer question, maybe that question was intended for
> someone else. the runtimes seem good as is though. i doubt the
> auto-parallelization will add much speed. but it's an interesting feature that
> i've always hoped would work. i've never got it to work though. the only code
> that did actually implement something was Intel Fortran. it implemented one
> trivial loop, but it slowed the code down instead of speeding it up. the output
> from gfortran shows more loops it wants to run in parallel. they aren't
> important ones. but something would be better than nothing. if it slowed the
> code down, i would just not use it.

GCC adds runtime checks for a minimal number of iterations before
dispatching to the parallelized code - I guess we simply never hit
the threshold.  This is configurable via --param parloops-min-per-thread,
the default is 100, the default number of threads is determined the same
as for OpenMP so you can probably tune that via OMP_NUM_THREADS.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2020-06-29  9:36 ` rguenther at suse dot de
@ 2020-06-29 14:09 ` prop_design at protonmail dot com
  2020-07-29 22:25 ` prop_design at protonmail dot com
  21 siblings, 0 replies; 23+ messages in thread
From: prop_design at protonmail dot com @ 2020-06-29 14:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #24 from Anthony <prop_design at protonmail dot com> ---
(In reply to rguenther@suse.de from comment #23)
> On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > 
> > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > (In reply to Thomas Koenig from comment #21)
> > > Another question: Is there anything left to be done with the
> > > vectorizer, or could we remove that dependency?
> > 
> > thanks for looking into this again for me. i'm surprised it worked the same on
> > Linux, but knowing that, at least helps debug this issue some more. I'm not
> > sure about the vectorizer question, maybe that question was intended for
> > someone else. the runtimes seem good as is though. i doubt the
> > auto-parallelization will add much speed. but it's an interesting feature that
> > i've always hoped would work. i've never got it to work though. the only code
> > that did actually implement something was Intel Fortran. it implemented one
> > trivial loop, but it slowed the code down instead of speeding it up. the output
> > from gfortran shows more loops it wants to run in parallel. they aren't
> > important ones. but something would be better than nothing. if it slowed the
> > code down, i would just not use it.
> 
> GCC adds runtime checks for a minimal number of iterations before
> dispatching to the parallelized code - I guess we simply never hit
> the threshold.  This is configurable via --param parloops-min-per-thread,
> the default is 100, the default number of threads is determined the same
> as for OpenMP so you can probably tune that via OMP_NUM_THREADS.

thanks for that tip. i tried changing the parloops parameters but no luck. the
only difference was the max thread use went from 2 to 3. core use was the same.

i added the following an some variations of these:

--param parloops-min-per-thread=2 (the default was 100 like you said) --param
parloops-chunk-size=1 (the default was zero so i removed this parameter later)
--param parloops-schedule=auto (tried all options except guided, the default is
static)

i was able to check that they were set via:

--help=param -Q

some other things i tried was adding -mthreads and removing -static. but so far
no luck. i also tried using -mthreads instead of -pthread.

i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
the ability of the optimizer to auto-parallelize (in the past at least).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler
  2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2020-06-29 14:09 ` prop_design at protonmail dot com
@ 2020-07-29 22:25 ` prop_design at protonmail dot com
  21 siblings, 0 replies; 23+ messages in thread
From: prop_design at protonmail dot com @ 2020-07-29 22:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #25 from Anthony <prop_design at protonmail dot com> ---
(In reply to Anthony from comment #24)
> (In reply to rguenther@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature that
> > > i've always hoped would work. i've never got it to work though. the only code
> > > that did actually implement something was Intel Fortran. it implemented one
> > > trivial loop, but it slowed the code down instead of speeding it up. the output
> > > from gfortran shows more loops it wants to run in parallel. they aren't
> > > important ones. but something would be better than nothing. if it slowed the
> > > code down, i would just not use it.
> > 
> > GCC adds runtime checks for a minimal number of iterations before
> > dispatching to the parallelized code - I guess we simply never hit
> > the threshold.  This is configurable via --param parloops-min-per-thread,
> > the default is 100, the default number of threads is determined the same
> > as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
> 
> thanks for that tip. i tried changing the parloops parameters but no luck.
> the only difference was the max thread use went from 2 to 3. core use was
> the same.
> 
> i added the following an some variations of these:
> 
> --param parloops-min-per-thread=2 (the default was 100 like you said)
> --param parloops-chunk-size=1 (the default was zero so i removed this
> parameter later) --param parloops-schedule=auto (tried all options except
> guided, the default is static)
> 
> i was able to check that they were set via:
> 
> --help=param -Q
> 
> some other things i tried was adding -mthreads and removing -static. but so
> far no luck. i also tried using -mthreads instead of -pthread.
> 
> i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
> MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
> the ability of the optimizer to auto-parallelize (in the past at least).

I did more testing and it the added options actually slow the code way down.
however, it still is only using one core. from what i can tell if i set
OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago
where someone had the same problem. i think OMP_PLACES might be working on
linux but not on windows. that's what the thread i found was saying. don't
really know. but i've exhausted all the possibilities at this point. the only
thing i know for sure is i can't get it to use anything more than one core.

--- Comment #26 from Anthony <prop_design at protonmail dot com> ---
(In reply to Anthony from comment #24)
> (In reply to rguenther@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature that
> > > i've always hoped would work. i've never got it to work though. the only code
> > > that did actually implement something was Intel Fortran. it implemented one
> > > trivial loop, but it slowed the code down instead of speeding it up. the output
> > > from gfortran shows more loops it wants to run in parallel. they aren't
> > > important ones. but something would be better than nothing. if it slowed the
> > > code down, i would just not use it.
> > 
> > GCC adds runtime checks for a minimal number of iterations before
> > dispatching to the parallelized code - I guess we simply never hit
> > the threshold.  This is configurable via --param parloops-min-per-thread,
> > the default is 100, the default number of threads is determined the same
> > as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
> 
> thanks for that tip. i tried changing the parloops parameters but no luck.
> the only difference was the max thread use went from 2 to 3. core use was
> the same.
> 
> i added the following an some variations of these:
> 
> --param parloops-min-per-thread=2 (the default was 100 like you said)
> --param parloops-chunk-size=1 (the default was zero so i removed this
> parameter later) --param parloops-schedule=auto (tried all options except
> guided, the default is static)
> 
> i was able to check that they were set via:
> 
> --help=param -Q
> 
> some other things i tried was adding -mthreads and removing -static. but so
> far no luck. i also tried using -mthreads instead of -pthread.
> 
> i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
> MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
> the ability of the optimizer to auto-parallelize (in the past at least).

I did more testing and it the added options actually slow the code way down.
however, it still is only using one core. from what i can tell if i set
OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago
where someone had the same problem. i think OMP_PLACES might be working on
linux but not on windows. that's what the thread i found was saying. don't
really know. but i've exhausted all the possibilities at this point. the only
thing i know for sure is i can't get it to use anything more than one core.

--- Comment #27 from Anthony <prop_design at protonmail dot com> ---
so after trying a bunch of things, i think the final problem may be this. i get
the following result when i try to set thread affinity:

set GOMP_CPU_AFFINITY="0 1"

gives the following feedback at run time; libgomp: Affinity not supported on
this configuration

i have to close the command prompt window to stop the program. the program
doesn't run properly if i try to set thread affinity.

so this still makes me thing it might work on linux and not windows 10, but i
have no way to test that.

the extra threads that auto-parallelization create will only go to one core, on
my machine at least.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-07-29 22:25 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-13 20:00 [Bug middle-end/53957] New: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler burnus at gcc dot gnu.org
2012-07-18 11:26 ` [Bug middle-end/53957] " rguenth at gcc dot gnu.org
2012-07-18 12:02 ` rguenth at gcc dot gnu.org
2012-07-18 12:09 ` [Bug fortran/53957] " rguenth at gcc dot gnu.org
2012-07-18 12:46 ` burnus at gcc dot gnu.org
2012-07-18 13:18 ` rguenther at suse dot de
2012-07-18 14:05 ` burnus at gcc dot gnu.org
2012-07-18 14:48 ` rguenth at gcc dot gnu.org
2012-07-26  9:59 ` burnus at gcc dot gnu.org
2012-07-26 10:19 ` rguenth at gcc dot gnu.org
2012-09-11 15:02 ` rguenth at gcc dot gnu.org
2013-01-16 22:53 ` burnus at gcc dot gnu.org
2013-06-10  4:41 ` prop_design at yahoo dot com
2013-06-10 17:44 ` prop_design at yahoo dot com
2013-06-22 13:04 ` dominiq at lps dot ens.fr
2013-06-23  5:25 ` prop_design at yahoo dot com
2020-06-27 23:34 ` prop_design at protonmail dot com
2020-06-28 10:49 ` tkoenig at gcc dot gnu.org
2020-06-28 11:03 ` tkoenig at gcc dot gnu.org
2020-06-28 15:40 ` prop_design at protonmail dot com
2020-06-29  9:36 ` rguenther at suse dot de
2020-06-29 14:09 ` prop_design at protonmail dot com
2020-07-29 22:25 ` prop_design at protonmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).