Re: Implement stack arrays even for unknown sizes

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: Implement stack arrays even for unknown sizes
@ 2011-04-09 10:08 Dominique Dhumieres
  2011-04-09 12:17 ` Paul Richard Thomas
  0 siblings, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-09 10:08 UTC (permalink / raw)
  To: fortran; +Cc: gcc-patches, matz

Michael,

I have applied your patch on top of revision 172217 on x86_64-apple-darwin10.7.0.
So far I have only limited tests on the polyhedron test suite.
The test nf.f90 (containing an automatic array) executes in less than 20s, compares
to ~28s without the patch. However capacita.f90 is miscompiled (Segmentation fault
at -O) and fatigue.f90 with -fwhole-program executes in ~7.5s instead of ~6.3s
(with -Ofast).

Cheers,

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-09 10:08 Implement stack arrays even for unknown sizes Dominique Dhumieres
@ 2011-04-09 12:17 ` Paul Richard Thomas
  2011-04-10 13:29   ` Dominique Dhumieres
  2011-04-11 16:04   ` Michael Matz
  0 siblings, 2 replies; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-09 12:17 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: fortran, gcc-patches, matz

Dear Dominique and Michael,

I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.

Cheers

Paul

On Sat, Apr 9, 2011 at 12:08 PM, Dominique Dhumieres <dominiq@lps.ens.fr> wrote:
> Michael,
>
> I have applied your patch on top of revision 172217 on x86_64-apple-darwin10.7.0.
> So far I have only limited tests on the polyhedron test suite.
> The test nf.f90 (containing an automatic array) executes in less than 20s, compares
> to ~28s without the patch. However capacita.f90 is miscompiled (Segmentation fault
> at -O) and fatigue.f90 with -fwhole-program executes in ~7.5s instead of ~6.3s
> (with -Ofast).
>
> Cheers,
>
> Dominique
>



-- 
The knack of flying is learning how to throw yourself at the ground and miss.
       --Hitchhikers Guide to the Galaxy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-09 12:17 ` Paul Richard Thomas
@ 2011-04-10 13:29   ` Dominique Dhumieres
  2011-04-11 11:49     ` Michael Matz
  2011-04-11 16:04   ` Michael Matz
  1 sibling, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-10 13:29 UTC (permalink / raw)
  To: paul.richard.thomas, dominiq; +Cc: matz, gcc-patches, fortran

> I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.

On x86_64-apple-darwin10, nf.f90 "works". However if I run it through
valgrind I get

==64815== Memcheck, a memory error detector
==64815== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==64815== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==64815== Command: a.out --max-stackframe=2118496
==64815==
==64815== Warning: set address range perms: large range [0x7ffe6c000000, 0x7fff5bc01000) (defined)
==64815== Warning: client switching stacks?  SP change: 0x7fff5bffe410 --> 0x7fff5be0cef0
==64815==          to suppress, use: --max-stackframe=2037024 or greater
==64815== Invalid write of size 8
==64815==    at 0x100003B22: nf2dprecon.1828 (nf.f90:262)
==64815==    by 0x10000C3B6: nfcg_ (nf.f90:279)
==64815==    by 0x3FEFFFFFFFFFFFFF: ???
==64815==    by 0x3FEFFFFFFFFFFFFF: ???
==64815==  Address 0x7fff5be0cee8 is on thread 1's stack
==64815==
==64815== Invalid read of size 8
==64815==    at 0x100001296: trisolve.1839 (nf.f90:238)
==64815==    by 0x10000C3B6: nfcg_ (nf.f90:279)
==64815==    by 0x3FEFFFFFFFFFFFFF: ???
==64815==    by 0x3FEFFFFFFFFFFFFF: ???
==64815==  Address 0x7fff5be0cee8 is on thread 1's stack
==64815==
...
==64815== Invalid write of size 8
==64815==    at 0x100000F38: trisolve.1839 (nf.f90:231)
==64815==    by 0x10000C3B6: nfcg_ (nf.f90:279)
==64815==    by 0x3FEFFFFFFFFFFFFF: ???
==64815==    by 0x3FEFFFFFFFFFFFFF: ???
==64815==  Address 0x7fff5bffdde8 is on thread 1's stack
==64815==
==64815== Invalid read of size 8
==64815==    at 0x100000F64: trisolve.1839 (nf.f90:233)
==64815==    by 0xC02396760B69D62B: ???
==64815==    by 0xC02396760B69D62C: ???
==64815==    by 0xC02396760B69D62D: ???
==64815==    by 0xC02396760B69D62D: ???
==64815==    by 0xC02396760B69D62D: ???
==64815==    by 0xC02396760B69D62D: ???
==64815==    by 0xC02396760B69D62D: ???
==64815==    by 0xC02396760B69D62D: ???
==64815==    by 0xC02396760B69D62B: ???
==64815==    by 0xC02396760B69D62B: ???
==64815==    by 0xC02396760B69D62B: ???
==64815==  Address 0x7fff5bffdde8 is on thread 1's stack
==64815==
...
 Time for setup          1.316
 Time per iteration      3.250
 Total Time            144.295
==64815==
==64815== HEAP SUMMARY:
==64815==     in use at exit: 656 bytes in 13 blocks
==64815==   total heap usage: 411 allocs, 398 frees, 387,314,592 bytes allocated
==64815==
==64815== LEAK SUMMARY:
==64815==    definitely lost: 0 bytes in 0 blocks
==64815==    indirectly lost: 0 bytes in 0 blocks
==64815==      possibly lost: 0 bytes in 0 blocks
==64815==    still reachable: 656 bytes in 13 blocks
==64815==         suppressed: 0 bytes in 0 blocks
==64815== Rerun with --leak-check=full to see details of leaked memory
==64815==
==64815== For counts of detected and suppressed errors, rerun with: -v
==64815== ERROR SUMMARY: 40050 errors from 1000 contexts (suppressed: 0 from 0)

The segfault for capacita.f90 occurs in the subroutine fourir at the line

            write(unit=*, fmt=*) "error in fourier: n=", ntot

AFAICT the problem occurs in the loop

      do m=1,ntot/4-1
        E(m) = exp(m*h)
      end do

If I print ntot, loc(ntot) before it I get

        2048      140734799794712

After the loop loc(ntot) is

              9205357642636066816

and any attemp to print its values yields a segfault.

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-10 13:29   ` Dominique Dhumieres
@ 2011-04-11 11:49     ` Michael Matz
  2011-04-11 11:58       ` Axel Freyn
  2011-04-11 13:35       ` Eric Botcazou
  0 siblings, 2 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-11 11:49 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran

Hi,

On Sun, 10 Apr 2011, Dominique Dhumieres wrote:

> > I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.
> 
> On x86_64-apple-darwin10, nf.f90 "works". However if I run it through
> valgrind I get
> 
> ==64815== Memcheck, a memory error detector
> ==64815== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==64815== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
> ==64815== Command: a.out --max-stackframe=2118496
> ==64815==
> ==64815== Warning: set address range perms: large range [0x7ffe6c000000, 0x7fff5bc01000) (defined)
> ==64815== Warning: client switching stacks?  SP change: 0x7fff5bffe410 --> 0x7fff5be0cef0
> ==64815==          to suppress, use: --max-stackframe=2037024 or greater

See?  That's whay I meant with having to use a large ulimit for stack 
size.  Usually stack overflows symptom is a simple segfault.  What ulimit 
-s have you used for your capacita tests?

> The segfault for capacita.f90 occurs in the subroutine fourir at the line
> 
>             write(unit=*, fmt=*) "error in fourier: n=", ntot
> 
> AFAICT the problem occurs in the loop
> 
>       do m=1,ntot/4-1
>         E(m) = exp(m*h)
>       end do
> 
> If I print ntot, loc(ntot) before it I get
> 
>         2048      140734799794712
> 
> After the loop loc(ntot) is
> 
>               9205357642636066816
> 
> and any attemp to print its values yields a segfault.

I'll poke at polyhedron somewhat.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 11:49     ` Michael Matz
@ 2011-04-11 11:58       ` Axel Freyn
  2011-04-11 13:35       ` Eric Botcazou
  1 sibling, 0 replies; 40+ messages in thread
From: Axel Freyn @ 2011-04-11 11:58 UTC (permalink / raw)
  To: Michael Matz
  Cc: Dominique Dhumieres, paul.richard.thomas, gcc-patches, fortran

Hi,
On Mon, Apr 11, 2011 at 01:49:16PM +0200, Michael Matz wrote:
> Hi,
> 
> On Sun, 10 Apr 2011, Dominique Dhumieres wrote:
> 
> > > I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.
> > 
> > On x86_64-apple-darwin10, nf.f90 "works". However if I run it through
> > valgrind I get
> > 
> > ==64815== Memcheck, a memory error detector
> > ==64815== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> > ==64815== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
> > ==64815== Command: a.out --max-stackframe=2118496
> > ==64815==
> > ==64815== Warning: set address range perms: large range [0x7ffe6c000000, 0x7fff5bc01000) (defined)
> > ==64815== Warning: client switching stacks?  SP change: 0x7fff5bffe410 --> 0x7fff5be0cef0
> > ==64815==          to suppress, use: --max-stackframe=2037024 or greater
> 
> See?  That's whay I meant with having to use a large ulimit for stack 
> size.  Usually stack overflows symptom is a simple segfault.  What ulimit 
> -s have you used for your capacita tests?

Just a small remark: according to the message from valgrind, the
"max-stackframe" parameter is at the WRONG position. It has to be
valgrind --max-stackframe=2118496 ./a.out
and NOT
valgrind ./a.out --max-stackframe=2118496
(the second way, valgrind ignores "max-stackframe" and passes it as
parameter to ./a.out. According to the valgrind-output, you passed the
parameter AFTER the "./a.out". That might explain the problem?

Axel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 11:49     ` Michael Matz
  2011-04-11 11:58       ` Axel Freyn
@ 2011-04-11 13:35       ` Eric Botcazou
  2011-04-11 14:06         ` Dominique Dhumieres
  2011-04-11 14:59         ` Michael Matz
  1 sibling, 2 replies; 40+ messages in thread
From: Eric Botcazou @ 2011-04-11 13:35 UTC (permalink / raw)
  To: Michael Matz
  Cc: gcc-patches, Dominique Dhumieres, paul.richard.thomas, fortran

> See?  That's whay I meant with having to use a large ulimit for stack
> size.  Usually stack overflows symptom is a simple segfault.  What ulimit
> -s have you used for your capacita tests?

Compiling with -fstack-check should give the segfault reliably.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 13:35       ` Eric Botcazou
@ 2011-04-11 14:06         ` Dominique Dhumieres
  2011-04-11 14:59         ` Michael Matz
  1 sibling, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-11 14:06 UTC (permalink / raw)
  To: matz, ebotcazou; +Cc: paul.richard.thomas, gcc-patches, fortran, dominiq

> What ulimit -s have you used for your capacita tests?

On *-apple-darwin* the stacksize is limited to (kbytes, -s) 65532
and it is hard coded. It is my (very limited) understanding that
this limit can be bypassed by using something such as
-Wl,-stack_size,0xf0000000 (for 1Gbytes in 64 bit mode -in 32 bit mode
you need an additional trick). Doing so for capacita does not help.
For nf the best I get is

[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays nf.f90 -g -save-temps 
[macbook] lin/test% valgrind --max-stackframe=64118496 a.out
==25739== Memcheck, a memory error detector
==25739== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==25739== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==25739== Command: a.out
==25739== 

 Solve   97 by 105 by  99 FD problem - method=NFCG
 Band elements are   100.00,    10.00, and     1.00. stiffness=   1000.00
 Target RMS residual =   0.1000E-06 Maximum Iterations =  50

 Iter      Alpha        Beta     RMS Residual   Sum of Residuals
   0                             0.9958682E-01      100.0000    
   0     1.00000                 0.1412538E-01    -0.2314427E-09
   1     0.08835     0.00000     0.9698062E-02    -0.2522486E-09
...
  44     0.07743     0.55715     0.8934363E-07    -0.4388281E-14

 Time for setup          0.868
 Time per iteration      1.709
 Total Time             76.049
==25739== 
==25739== HEAP SUMMARY:
==25739==     in use at exit: 656 bytes in 13 blocks
==25739==   total heap usage: 409 allocs, 396 frees, 387,298,208 bytes allocated
==25739== 
==25739== LEAK SUMMARY:
==25739==    definitely lost: 0 bytes in 0 blocks
==25739==    indirectly lost: 0 bytes in 0 blocks
==25739==      possibly lost: 0 bytes in 0 blocks
==25739==    still reachable: 656 bytes in 13 blocks
==25739==         suppressed: 0 bytes in 0 blocks
==25739== Rerun with --leak-check=full to see details of leaked memory
==25739== 
==25739== For counts of detected and suppressed errors, rerun with: -v
==25739== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

> Compiling with -fstack-check should give the segfault reliably.

It does not seems to work for me.

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 13:35       ` Eric Botcazou
  2011-04-11 14:06         ` Dominique Dhumieres
@ 2011-04-11 14:59         ` Michael Matz
  1 sibling, 0 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-11 14:59 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: gcc-patches, Dominique Dhumieres, paul.richard.thomas, fortran

Hi,

On Mon, 11 Apr 2011, Eric Botcazou wrote:

> > See?  That's whay I meant with having to use a large ulimit for stack
> > size.  Usually stack overflows symptom is a simple segfault.  What ulimit
> > -s have you used for your capacita tests?
> 
> Compiling with -fstack-check should give the segfault reliably.

Yeah, I see the problem.  Strange that it doesn't occure with spec or the 
testsuite, it's a scoping problem.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-09 12:17 ` Paul Richard Thomas
  2011-04-10 13:29   ` Dominique Dhumieres
@ 2011-04-11 16:04   ` Michael Matz
  2011-04-11 16:45     ` Steven Bosscher
                       ` (6 more replies)
  1 sibling, 7 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-11 16:04 UTC (permalink / raw)
  To: Paul Richard Thomas; +Cc: Dominique Dhumieres, fortran, gcc-patches

On Sat, 9 Apr 2011, Paul Richard Thomas wrote:

> I find that both nf.f90 and capacita.f90 segfault in runtime for any 
> stack size.

Try this patch.  I've verified that capacita and nf work with it and 
-march=native -ffast-math -funroll-loops -fstack-arrays -O3 .  In fact all 
of polyhedron works for me on these flags.  (I've set a ulimit -s of 
512MB, but I don't know if such a large amount is required).


Ciao,
Michael.

	* trans-array.c (toplevel): Include gimple.h.
	(gfc_trans_allocate_array_storage): Check flag_stack_arrays,
	properly expand variable length arrays.
	(gfc_trans_auto_array_allocation): If flag_stack_arrays create
	variable length decls and associate them with their scope.
	* gfortran.h (gfc_option_t): Add flag_stack_arrays member.
	* options.c (gfc_init_options): Handle -fstack_arrays option.
	* lang.opt (fstack-arrays): Add option.
	* invoke.texi (Code Gen Options): Document it.
	* Make-lang.in (trans-array.o): Depend on GIMPLE_H.

Index: trans-array.c
===================================================================
*** trans-array.c	(revision 172206)
--- trans-array.c	(working copy)
*************** along with GCC; see the file COPYING3.
*** 81,86 ****
--- 81,87 ----
  #include "system.h"
  #include "coretypes.h"
  #include "tree.h"
+ #include "gimple.h"
  #include "diagnostic-core.h"	/* For internal_error/fatal_error.  */
  #include "flags.h"
  #include "gfortran.h"
*************** gfc_trans_allocate_array_storage (stmtbl
*** 630,647 ****
      {
        /* Allocate the temporary.  */
        onstack = !dynamic && initial == NULL_TREE
! 			 && gfc_can_put_var_on_stack (size);
  
        if (onstack)
  	{
  	  /* Make a temporary variable to hold the data.  */
  	  tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
  				 nelem, gfc_index_one_node);
  	  tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
  				  tmp);
  	  tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
  				  tmp);
  	  tmp = gfc_create_var (tmp, "A");
  	  tmp = gfc_build_addr_expr (NULL_TREE, tmp);
  	  gfc_conv_descriptor_data_set (pre, desc, tmp);
  	}
--- 631,654 ----
      {
        /* Allocate the temporary.  */
        onstack = !dynamic && initial == NULL_TREE
! 			 && (gfc_option.flag_stack_arrays
! 			     || gfc_can_put_var_on_stack (size));
  
        if (onstack)
  	{
  	  /* Make a temporary variable to hold the data.  */
  	  tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
  				 nelem, gfc_index_one_node);
+ 	  tmp = gfc_evaluate_now (tmp, pre);
  	  tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
  				  tmp);
  	  tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
  				  tmp);
  	  tmp = gfc_create_var (tmp, "A");
+ 	  gfc_add_expr_to_block (pre,
+ 				 fold_build1_loc (input_location,
+ 						  DECL_EXPR, TREE_TYPE (tmp),
+ 						  tmp));
  	  tmp = gfc_build_addr_expr (NULL_TREE, tmp);
  	  gfc_conv_descriptor_data_set (pre, desc, tmp);
  	}
*************** gfc_trans_auto_array_allocation (tree de
*** 4744,4749 ****
--- 4751,4758 ----
    tree tmp;
    tree size;
    tree offset;
+   tree space;
+   tree inittree;
    bool onstack;
  
    gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
*************** gfc_trans_auto_array_allocation (tree de
*** 4800,4814 ****
        return;
      }
  
!   /* The size is the number of elements in the array, so multiply by the
!      size of an element to get the total size.  */
!   tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
!   size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! 			  size, fold_convert (gfc_array_index_type, tmp));
  
!   /* Allocate memory to hold the data.  */
!   tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
!   gfc_add_modify (&init, decl, tmp);
  
    /* Set offset of the array.  */
    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
--- 4809,4838 ----
        return;
      }
  
!   if (gfc_option.flag_stack_arrays)
!     {
!       gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
!       space = build_decl (sym->declared_at.lb->location,
! 			  VAR_DECL, create_tmp_var_name ("A"),
! 			  TREE_TYPE (TREE_TYPE (decl)));
!       gfc_trans_vla_type_sizes (sym, &init);
!     }
!   else
!     {
!       /* The size is the number of elements in the array, so multiply by the
! 	 size of an element to get the total size.  */
!       tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
!       size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! 			      size, fold_convert (gfc_array_index_type, tmp));
  
!       /* Allocate memory to hold the data.  */
!       tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
!       gfc_add_modify (&init, decl, tmp);
! 
!       /* Free the temporary.  */
!       tmp = gfc_call_free (convert (pvoid_type_node, decl));
!       space = NULL_TREE;
!     }
  
    /* Set offset of the array.  */
    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
*************** gfc_trans_auto_array_allocation (tree de
*** 4817,4826 ****
    /* Automatic arrays should not have initializers.  */
    gcc_assert (!sym->value);
  
!   /* Free the temporary.  */
!   tmp = gfc_call_free (convert (pvoid_type_node, decl));
  
!   gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
  }
  
  
--- 4841,4866 ----
    /* Automatic arrays should not have initializers.  */
    gcc_assert (!sym->value);
  
!   inittree = gfc_finish_block (&init);
! 
!   if (space)
!     {
!       tree addr;
!       pushdecl (space);
  
!       /* Don't create new scope, emit the DECL_EXPR in exactly the scope
!          where also space is located.  */
!       gfc_init_block (&init);
!       tmp = fold_build1_loc (input_location, DECL_EXPR,
! 			     TREE_TYPE (space), space);
!       gfc_add_expr_to_block (&init, tmp);
!       addr = fold_build1_loc (sym->declared_at.lb->location,
! 			      ADDR_EXPR, TREE_TYPE (decl), space);
!       gfc_add_modify (&init, decl, addr);
!       gfc_add_init_cleanup (block, gfc_finish_block (&init), NULL_TREE);
!       tmp = NULL_TREE;
!     }
!   gfc_add_init_cleanup (block, inittree, tmp);
  }
  
  
Index: Make-lang.in
===================================================================
*** Make-lang.in	(revision 172206)
--- Make-lang.in	(working copy)
*************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
*** 353,359 ****
  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
    fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
    gt-fortran-trans-intrinsic.h
  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
--- 353,359 ----
  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
    fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
    gt-fortran-trans-intrinsic.h
  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
Index: gfortran.h
===================================================================
*** gfortran.h	(revision 172206)
--- gfortran.h	(working copy)
*************** typedef struct
*** 2220,2225 ****
--- 2220,2226 ----
    int flag_d_lines;
    int gfc_flag_openmp;
    int flag_sign_zero;
+   int flag_stack_arrays;
    int flag_module_private;
    int flag_recursive;
    int flag_init_local_zero;
Index: lang.opt
===================================================================
*** lang.opt	(revision 172206)
--- lang.opt	(working copy)
*************** fmax-stack-var-size=
*** 454,459 ****
--- 454,463 ----
  Fortran RejectNegative Joined UInteger
  -fmax-stack-var-size=<n>	Size in bytes of the largest array that will be put on the stack
  
+ fstack-arrays
+ Fortran
+ Put all local arrays on stack.
+ 
  fmodule-private
  Fortran
  Set default accessibility of module entities to PRIVATE.
Index: invoke.texi
===================================================================
*** invoke.texi	(revision 172206)
--- invoke.texi	(working copy)
*************** and warnings}.
*** 167,172 ****
--- 167,173 ----
  -fbounds-check -fcheck-array-temporaries  -fmax-array-constructor =@var{n} @gol
  -fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
  -fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
+ -fstack-arrays @gol
  -fpack-derived  -frepack-arrays  -fshort-enums  -fexternal-blas @gol
  -fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
  -finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
*************** Future versions of GNU Fortran may impro
*** 1361,1366 ****
--- 1362,1374 ----
  
  The default value for @var{n} is 32768.
  
+ @item -fstack-arrays
+ @opindex @code{fstack-arrays}
+ Adding this option will make the fortran compiler put all local arrays,
+ even those of unknown size onto stack memory.  If your program uses very
+ large local arrays it's possible that you'll have to extend your runtime
+ limits for stack memory on some operating systems.
+ 
  @item -fpack-derived
  @opindex @code{fpack-derived}
  @cindex structure packing
Index: options.c
===================================================================
*** options.c	(revision 172206)
--- options.c	(working copy)
*************** gfc_init_options (unsigned int decoded_o
*** 123,128 ****
--- 123,129 ----
  
    /* Default value of flag_max_stack_var_size is set in gfc_post_options.  */
    gfc_option.flag_max_stack_var_size = -2;
+   gfc_option.flag_stack_arrays = 0;
  
    gfc_option.flag_range_check = 1;
    gfc_option.flag_pack_derived = 0;
*************** gfc_handle_option (size_t scode, const c
*** 783,788 ****
--- 784,793 ----
        gfc_option.flag_max_stack_var_size = value;
        break;
  
+     case OPT_fstack_arrays:
+       gfc_option.flag_stack_arrays = value;
+       break;
+ 
      case OPT_fmodule_private:
        gfc_option.flag_module_private = value;
        break;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:04   ` Michael Matz
@ 2011-04-11 16:45     ` Steven Bosscher
  2011-04-12 11:54       ` Michael Matz
  2011-04-11 16:46     ` Dominique Dhumieres
                       ` (5 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Steven Bosscher @ 2011-04-11 16:45 UTC (permalink / raw)
  To: Michael Matz
  Cc: Paul Richard Thomas, Dominique Dhumieres, fortran, gcc-patches

On Mon, Apr 11, 2011 at 6:04 PM, Michael Matz <matz@suse.de> wrote:
> On Sat, 9 Apr 2011, Paul Richard Thomas wrote:
>
>> I find that both nf.f90 and capacita.f90 segfault in runtime for any
>> stack size.
>
> Try this patch.  I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 .  In fact all
> of polyhedron works for me on these flags.  (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).

FWIW, I don't think it is reasonable to require ulimit -s to run
polyhedron. Isn't there a way to put a maximum on the size of the
arrays on stack, e.g. -fstack-arrays-limit= or something like that?

BTW why does trans-array need gimple.h?

Ciao!
Steven

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:04   ` Michael Matz
  2011-04-11 16:45     ` Steven Bosscher
@ 2011-04-11 16:46     ` Dominique Dhumieres
  2011-04-11 16:59     ` H.J. Lu
                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-11 16:46 UTC (permalink / raw)
  To: paul.richard.thomas, matz; +Cc: gcc-patches, fortran, dominiq

I am trying the new patch. Updating failed with

../../work/gcc/fortran/trans-array.c: In function 'gfc_trans_auto_array_allocation':
../../work/gcc/fortran/trans-array.c:4881:24: error: 'tmp' may be used uninitialized in this function [-Werror=uninitialized]
cc1: all warnings being treated as errors

I am continuing after having changed

tree tmp;

to

tree tmp = NULL_TREE;

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:04   ` Michael Matz
  2011-04-11 16:45     ` Steven Bosscher
  2011-04-11 16:46     ` Dominique Dhumieres
@ 2011-04-11 16:59     ` H.J. Lu
  2011-04-11 17:06       ` N.M. Maclaren
  2011-04-11 18:28     ` Tobias Burnus
                       ` (3 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: H.J. Lu @ 2011-04-11 16:59 UTC (permalink / raw)
  To: Michael Matz
  Cc: Paul Richard Thomas, Dominique Dhumieres, fortran, gcc-patches

On Mon, Apr 11, 2011 at 9:04 AM, Michael Matz <matz@suse.de> wrote:
> On Sat, 9 Apr 2011, Paul Richard Thomas wrote:
>
>> I find that both nf.f90 and capacita.f90 segfault in runtime for any
>> stack size.
>
> Try this patch.  I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 .  In fact all
> of polyhedron works for me on these flags.  (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).
>
>

Is that possible to raise stack limit when -fstack-arrays is used?


-- 
H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:59     ` H.J. Lu
@ 2011-04-11 17:06       ` N.M. Maclaren
  0 siblings, 0 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-11 17:06 UTC (permalink / raw)
  To: fortran, gcc-patches

On Apr 11 2011, H.J. Lu wrote:
>
>Is that possible to raise stack limit when -fstack-arrays is used?

In some systems, yes.  In others (including most Unices), not without
evil contortions that will assuredly break something else (like
debuggers and some MPIs).

Regards,
Nick Maclaren.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:04   ` Michael Matz
                       ` (2 preceding siblings ...)
  2011-04-11 16:59     ` H.J. Lu
@ 2011-04-11 18:28     ` Tobias Burnus
  2011-04-11 21:18     ` Dominique Dhumieres
                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Tobias Burnus @ 2011-04-11 18:28 UTC (permalink / raw)
  To: Michael Matz; +Cc: fortran, gcc-patches

On 04/11/2011 06:04 PM, Michael Matz wrote:
> ===================================================================
> *** trans-array.c	(revision 172206)
> --- trans-array.c	(working copy)
> *************** gfc_trans_auto_array_allocation (tree de
[...]
> !   gfc_add_init_cleanup (block, inittree, tmp);
>    }

I get the fatal warning:

.../gcc/fortran/trans-array.c: In function 
Â‘gfc_trans_auto_array_allocationÂ’:
..../gcc/fortran/trans-array.c:4881:24: error: Â‘tmpÂ’ may be used 
uninitialized in this function [-Werror=uninitialized]

which is the line quoted above.

  * * *

Regarding the goal of the patch: I like the idea and I think it is 
rather useful. It is unfortunate that running out of stack space causes 
segfaults, which is not very user friendly, but I think HPC users - and 
in particular Intel compiler users - are used to have "ulimit -s 
unlimited" (or some large value), which also HPC admins set. For 
instance, on the login node of our HPC cluster (2*quadcore/node; 24 GB 
RAM/node, 2200 nodes) one has 3 GB stack space by default (soft and hard 
limit) - I think the value on the compute nodes is the same.

Tobias

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:04   ` Michael Matz
                       ` (3 preceding siblings ...)
  2011-04-11 18:28     ` Tobias Burnus
@ 2011-04-11 21:18     ` Dominique Dhumieres
  2011-04-12  6:23     ` Paul Richard Thomas
  2011-04-14 20:23     ` Tobias Burnus
  6 siblings, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-11 21:18 UTC (permalink / raw)
  To: paul.richard.thomas, matz; +Cc: gcc-patches, fortran, dominiq

With the new patch (+the fix for the uninitialized tmp), the polyhedron
tests pass:

================================================================================
Date & Time     : 11 Apr 2011 21:20:59
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      8.10       54576      8.16       2  0.0000
      aermod    172.91     1468672     18.58       2  0.1884
         air     25.79       89992      6.81       5  0.1852
    capacita     15.01      101384     40.50       2  0.0012
     channel      3.43       34448      2.98       5  0.2042
       doduc     27.18      212280     27.25       2  0.0220
     fatigue      9.87       85264      7.07       2  0.0636
     gas_dyn     23.07      135944      4.72       2  0.0636
      induct     23.90      185600     14.24       2  0.1194
       linpk      2.54       21536     21.74       2  0.0207
        mdbx      9.11       84776     12.51       2  0.0080
          nf     15.29       87848     19.06       2  0.1862
     protein     28.07      155624     35.40       2  0.0268
      rnflow     38.21      183696     26.76       2  0.0374
    test_fpu     19.94      141720     10.87       2  0.0598
        tfft      1.70       22072      3.29       2  0.0759

Geometric Mean Execution Time =      12.33 seconds

================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved

For comparison the results without -fstack-arrays are

================================================================================
Date & Time     : 11 Apr 2011 21:59:38
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      8.11       54576      8.17       2  0.0061
      aermod    173.69     1472520     18.81       2  0.1063
         air     25.43       89992      6.79       2  0.0516
    capacita     13.47      101344     40.33       2  0.0012
     channel      3.11       34448      2.97       5  0.2000
       doduc     27.33      212280     27.25       2  0.0239
     fatigue     10.08       81128      4.71       2  0.0212
     gas_dyn     24.89      144112      4.68       5  0.2030
      induct     23.86      185600     14.26       3  0.1813
       linpk      2.53       21536     21.75       2  0.0460
        mdbx      9.44       84776     12.50       2  0.0920
          nf     30.54      120568     28.22       5  0.2760
     protein     28.05      155624     35.55       2  0.0928
      rnflow     37.60      183696     26.83       2  0.0075
    test_fpu     20.20      145816     11.07       2  0.0135
        tfft      1.67       22072      3.30       4  0.1569

Geometric Mean Execution Time =      12.33 seconds

================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved

Valgrind does not report any error for capacita, however it requires
--max-stackframe=8118496 for nf (--max-stackframe=6118496 yields
many errors).

Thanks for the work.

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:04   ` Michael Matz
                       ` (4 preceding siblings ...)
  2011-04-11 21:18     ` Dominique Dhumieres
@ 2011-04-12  6:23     ` Paul Richard Thomas
  2011-04-12  6:35       ` Dominique Dhumieres
  2011-04-14 20:23     ` Tobias Burnus
  6 siblings, 1 reply; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-12  6:23 UTC (permalink / raw)
  To: Michael Matz; +Cc: Dominique Dhumieres, fortran, gcc-patches

Dear Michael,

Thanks for updating the patch.  I am afraid that my attention to
gfortran is somewhat limited at present.  However, I see that
Dominique has verified your patch and that all is well.

The resulting speed up for nf.f90 is rather remarkable.  What specific
feature of the fortran leads to a 30=>15s ?

Cheers

Paul

On Mon, Apr 11, 2011 at 6:04 PM, Michael Matz <matz@suse.de> wrote:
> On Sat, 9 Apr 2011, Paul Richard Thomas wrote:
>
>> I find that both nf.f90 and capacita.f90 segfault in runtime for any
>> stack size.
>
> Try this patch.  I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 .  In fact all
> of polyhedron works for me on these flags.  (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).
>
>
> Ciao,
> Michael.
>
>        * trans-array.c (toplevel): Include gimple.h.
>        (gfc_trans_allocate_array_storage): Check flag_stack_arrays,
>        properly expand variable length arrays.
>        (gfc_trans_auto_array_allocation): If flag_stack_arrays create
>        variable length decls and associate them with their scope.
>        * gfortran.h (gfc_option_t): Add flag_stack_arrays member.
>        * options.c (gfc_init_options): Handle -fstack_arrays option.
>        * lang.opt (fstack-arrays): Add option.
>        * invoke.texi (Code Gen Options): Document it.
>        * Make-lang.in (trans-array.o): Depend on GIMPLE_H.
>
> Index: trans-array.c
> ===================================================================
> *** trans-array.c       (revision 172206)
> --- trans-array.c       (working copy)
> *************** along with GCC; see the file COPYING3.
> *** 81,86 ****
> --- 81,87 ----
>  #include "system.h"
>  #include "coretypes.h"
>  #include "tree.h"
> + #include "gimple.h"
>  #include "diagnostic-core.h"  /* For internal_error/fatal_error.  */
>  #include "flags.h"
>  #include "gfortran.h"
> *************** gfc_trans_allocate_array_storage (stmtbl
> *** 630,647 ****
>      {
>        /* Allocate the temporary.  */
>        onstack = !dynamic && initial == NULL_TREE
> !                        && gfc_can_put_var_on_stack (size);
>
>        if (onstack)
>        {
>          /* Make a temporary variable to hold the data.  */
>          tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
>                                 nelem, gfc_index_one_node);
>          tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
>                                  tmp);
>          tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
>                                  tmp);
>          tmp = gfc_create_var (tmp, "A");
>          tmp = gfc_build_addr_expr (NULL_TREE, tmp);
>          gfc_conv_descriptor_data_set (pre, desc, tmp);
>        }
> --- 631,654 ----
>      {
>        /* Allocate the temporary.  */
>        onstack = !dynamic && initial == NULL_TREE
> !                        && (gfc_option.flag_stack_arrays
> !                            || gfc_can_put_var_on_stack (size));
>
>        if (onstack)
>        {
>          /* Make a temporary variable to hold the data.  */
>          tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
>                                 nelem, gfc_index_one_node);
> +         tmp = gfc_evaluate_now (tmp, pre);
>          tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
>                                  tmp);
>          tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
>                                  tmp);
>          tmp = gfc_create_var (tmp, "A");
> +         gfc_add_expr_to_block (pre,
> +                                fold_build1_loc (input_location,
> +                                                 DECL_EXPR, TREE_TYPE (tmp),
> +                                                 tmp));
>          tmp = gfc_build_addr_expr (NULL_TREE, tmp);
>          gfc_conv_descriptor_data_set (pre, desc, tmp);
>        }
> *************** gfc_trans_auto_array_allocation (tree de
> *** 4744,4749 ****
> --- 4751,4758 ----
>    tree tmp;
>    tree size;
>    tree offset;
> +   tree space;
> +   tree inittree;
>    bool onstack;
>
>    gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
> *************** gfc_trans_auto_array_allocation (tree de
> *** 4800,4814 ****
>        return;
>      }
>
> !   /* The size is the number of elements in the array, so multiply by the
> !      size of an element to get the total size.  */
> !   tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
> !   size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
> !                         size, fold_convert (gfc_array_index_type, tmp));
>
> !   /* Allocate memory to hold the data.  */
> !   tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
> !   gfc_add_modify (&init, decl, tmp);
>
>    /* Set offset of the array.  */
>    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
> --- 4809,4838 ----
>        return;
>      }
>
> !   if (gfc_option.flag_stack_arrays)
> !     {
> !       gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
> !       space = build_decl (sym->declared_at.lb->location,
> !                         VAR_DECL, create_tmp_var_name ("A"),
> !                         TREE_TYPE (TREE_TYPE (decl)));
> !       gfc_trans_vla_type_sizes (sym, &init);
> !     }
> !   else
> !     {
> !       /* The size is the number of elements in the array, so multiply by the
> !        size of an element to get the total size.  */
> !       tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
> !       size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
> !                             size, fold_convert (gfc_array_index_type, tmp));
>
> !       /* Allocate memory to hold the data.  */
> !       tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
> !       gfc_add_modify (&init, decl, tmp);
> !
> !       /* Free the temporary.  */
> !       tmp = gfc_call_free (convert (pvoid_type_node, decl));
> !       space = NULL_TREE;
> !     }
>
>    /* Set offset of the array.  */
>    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
> *************** gfc_trans_auto_array_allocation (tree de
> *** 4817,4826 ****
>    /* Automatic arrays should not have initializers.  */
>    gcc_assert (!sym->value);
>
> !   /* Free the temporary.  */
> !   tmp = gfc_call_free (convert (pvoid_type_node, decl));
>
> !   gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
>  }
>
>
> --- 4841,4866 ----
>    /* Automatic arrays should not have initializers.  */
>    gcc_assert (!sym->value);
>
> !   inittree = gfc_finish_block (&init);
> !
> !   if (space)
> !     {
> !       tree addr;
> !       pushdecl (space);
>
> !       /* Don't create new scope, emit the DECL_EXPR in exactly the scope
> !          where also space is located.  */
> !       gfc_init_block (&init);
> !       tmp = fold_build1_loc (input_location, DECL_EXPR,
> !                            TREE_TYPE (space), space);
> !       gfc_add_expr_to_block (&init, tmp);
> !       addr = fold_build1_loc (sym->declared_at.lb->location,
> !                             ADDR_EXPR, TREE_TYPE (decl), space);
> !       gfc_add_modify (&init, decl, addr);
> !       gfc_add_init_cleanup (block, gfc_finish_block (&init), NULL_TREE);
> !       tmp = NULL_TREE;
> !     }
> !   gfc_add_init_cleanup (block, inittree, tmp);
>  }
>
>
> Index: Make-lang.in
> ===================================================================
> *** Make-lang.in        (revision 172206)
> --- Make-lang.in        (working copy)
> *************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
> *** 353,359 ****
>  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
>  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
>    fortran/ioparm.def
> ! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
>  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
>    gt-fortran-trans-intrinsic.h
>  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
> --- 353,359 ----
>  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
>  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
>    fortran/ioparm.def
> ! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
>  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
>    gt-fortran-trans-intrinsic.h
>  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
> Index: gfortran.h
> ===================================================================
> *** gfortran.h  (revision 172206)
> --- gfortran.h  (working copy)
> *************** typedef struct
> *** 2220,2225 ****
> --- 2220,2226 ----
>    int flag_d_lines;
>    int gfc_flag_openmp;
>    int flag_sign_zero;
> +   int flag_stack_arrays;
>    int flag_module_private;
>    int flag_recursive;
>    int flag_init_local_zero;
> Index: lang.opt
> ===================================================================
> *** lang.opt    (revision 172206)
> --- lang.opt    (working copy)
> *************** fmax-stack-var-size=
> *** 454,459 ****
> --- 454,463 ----
>  Fortran RejectNegative Joined UInteger
>  -fmax-stack-var-size=<n>      Size in bytes of the largest array that will be put on the stack
>
> + fstack-arrays
> + Fortran
> + Put all local arrays on stack.
> +
>  fmodule-private
>  Fortran
>  Set default accessibility of module entities to PRIVATE.
> Index: invoke.texi
> ===================================================================
> *** invoke.texi (revision 172206)
> --- invoke.texi (working copy)
> *************** and warnings}.
> *** 167,172 ****
> --- 167,173 ----
>  -fbounds-check -fcheck-array-temporaries  -fmax-array-constructor =@var{n} @gol
>  -fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
>  -fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
> + -fstack-arrays @gol
>  -fpack-derived  -frepack-arrays  -fshort-enums  -fexternal-blas @gol
>  -fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
>  -finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
> *************** Future versions of GNU Fortran may impro
> *** 1361,1366 ****
> --- 1362,1374 ----
>
>  The default value for @var{n} is 32768.
>
> + @item -fstack-arrays
> + @opindex @code{fstack-arrays}
> + Adding this option will make the fortran compiler put all local arrays,
> + even those of unknown size onto stack memory.  If your program uses very
> + large local arrays it's possible that you'll have to extend your runtime
> + limits for stack memory on some operating systems.
> +
>  @item -fpack-derived
>  @opindex @code{fpack-derived}
>  @cindex structure packing
> Index: options.c
> ===================================================================
> *** options.c   (revision 172206)
> --- options.c   (working copy)
> *************** gfc_init_options (unsigned int decoded_o
> *** 123,128 ****
> --- 123,129 ----
>
>    /* Default value of flag_max_stack_var_size is set in gfc_post_options.  */
>    gfc_option.flag_max_stack_var_size = -2;
> +   gfc_option.flag_stack_arrays = 0;
>
>    gfc_option.flag_range_check = 1;
>    gfc_option.flag_pack_derived = 0;
> *************** gfc_handle_option (size_t scode, const c
> *** 783,788 ****
> --- 784,793 ----
>        gfc_option.flag_max_stack_var_size = value;
>        break;
>
> +     case OPT_fstack_arrays:
> +       gfc_option.flag_stack_arrays = value;
> +       break;
> +
>      case OPT_fmodule_private:
>        gfc_option.flag_module_private = value;
>        break;
>



-- 
The knack of flying is learning how to throw yourself at the ground and miss.
       --Hitchhikers Guide to the Galaxy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-12  6:23     ` Paul Richard Thomas
@ 2011-04-12  6:35       ` Dominique Dhumieres
  2011-04-13  9:42         ` Paul Richard Thomas
  2011-04-14 13:59         ` Michael Matz
  0 siblings, 2 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-12  6:35 UTC (permalink / raw)
  To: paul.richard.thomas, matz; +Cc: gcc-patches, fortran, dominiq

> The resulting speed up for nf.f90 is rather remarkable.  What specific
> feature of the fortran leads to a 30=>15s ?

I think it is the automatic array in the subroutine trisolve. Note that the 
speedup is rather 27->19s and may be darwin specific (slow malloc).

Note also that -fstack-arrays prevents some optimizations on
fatigue: 4.7->7s. This may be related to pr45810.

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:45     ` Steven Bosscher
@ 2011-04-12 11:54       ` Michael Matz
  2011-04-12 12:42         ` N.M. Maclaren
  0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-12 11:54 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Paul Richard Thomas, Dominique Dhumieres, fortran, gcc-patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1456 bytes --]

Hello,

On Mon, 11 Apr 2011, Steven Bosscher wrote:

> > Try this patch. Â I've verified that capacita and nf work with it and
> > -march=native -ffast-math -funroll-loops -fstack-arrays -O3 . Â In fact all
> > of polyhedron works for me on these flags. Â (I've set a ulimit -s of
> > 512MB, but I don't know if such a large amount is required).
> 
> FWIW, I don't think it is reasonable to require ulimit -s to run
> polyhedron.

Actually only nf requires about 16MB of stack size, all other benchmarks 
run fine with 8MB (which is the default for my systems).  Note that ICC 
compiled binaries have the same behaviour and also require the user to set 
ulimits.

> Isn't there a way to put a maximum on the size of the arrays on stack, 
> e.g. -fstack-arrays-limit= or something like that?

Not without generating contorded code.  The problem is that these arrays 
are variable length, hence runtime tests would be required, branching to 
malloc or stack_save/alloca, and also for the deallocation side, requiring 
a flag for branching to free or stack_restore.  And as we use the variable 
length facilities this is actually not controllable in such a way, the 
frontend doesn't generate alloca calls at all.

Nope, it's either all or nothing, like with ICC.

> BTW why does trans-array need gimple.h?

create_tmp_var_name.  gfc_create_var_np doesn't take a location and at 
that point input_location is already at the end of file.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-12 11:54       ` Michael Matz
@ 2011-04-12 12:42         ` N.M. Maclaren
  0 siblings, 0 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-12 12:42 UTC (permalink / raw)
  To: fortran, gcc-patches

On Apr 12 2011, Michael Matz wrote:
>On Mon, 11 Apr 2011, Steven Bosscher wrote:
>
>> Isn't there a way to put a maximum on the size of the arrays on stack, 
>> e.g. -fstack-arrays-limit= or something like that?
>
>Not without generating contorded code.  The problem is that these arrays 
>are variable length, hence runtime tests would be required, branching to 
>malloc or stack_save/alloca, and also for the deallocation side, requiring 
>a flag for branching to free or stack_restore.  And as we use the variable 
>length facilities this is actually not controllable in such a way, the 
>frontend doesn't generate alloca calls at all.

It would almost certainly be cleaner and simpler to use a secondary
stack than to have such dynamic testing.  It is very similar (conceptually)
to the dynamic code that switches between inline and library code for
sizes of DO-loop or array on some specialist HPC compilers.  And that gets
very messy, very fast.

Regards,
Nick Maclaren.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-12  6:35       ` Dominique Dhumieres
@ 2011-04-13  9:42         ` Paul Richard Thomas
  2011-04-13 10:38           ` Richard Guenther
  2011-04-14 13:59         ` Michael Matz
  1 sibling, 1 reply; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-13  9:42 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: matz, gcc-patches, fortran

Dear Dominique,

> I think it is the automatic array in the subroutine trisolve. Note that the
> speedup is rather 27->19s and may be darwin specific (slow malloc).

I saw a speed-up of similar order on FC9/x86_64.

I strongly doubt that it is anything to do with the automatic array -
if it is there is an error somewhere, since none of the references to
trisolve need copy-in/copy-out.

>
> Note also that -fstack-arrays prevents some optimizations on
> fatigue: 4.7->7s. This may be related to pr45810.

Has PR45810 converged now?  If I have understood properly, a patch has
been devised that cures the problem and does not cause slow-downs
anywhere else?

Cheers

Paul

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-13  9:42         ` Paul Richard Thomas
@ 2011-04-13 10:38           ` Richard Guenther
  2011-04-13 11:13             ` Richard Guenther
  2011-04-13 13:46             ` Paul Richard Thomas
  0 siblings, 2 replies; 40+ messages in thread
From: Richard Guenther @ 2011-04-13 10:38 UTC (permalink / raw)
  To: Paul Richard Thomas; +Cc: Dominique Dhumieres, matz, gcc-patches, fortran

On Wed, Apr 13, 2011 at 11:42 AM, Paul Richard Thomas
<paul.richard.thomas@gmail.com> wrote:
> Dear Dominique,
>
>> I think it is the automatic array in the subroutine trisolve. Note that the
>> speedup is rather 27->19s and may be darwin specific (slow malloc).
>
> I saw a speed-up of similar order on FC9/x86_64.
>
> I strongly doubt that it is anything to do with the automatic array -
> if it is there is an error somewhere, since none of the references to
> trisolve need copy-in/copy-out.
>
>>
>> Note also that -fstack-arrays prevents some optimizations on
>> fatigue: 4.7->7s. This may be related to pr45810.
>
> Has PR45810 converged now?  If I have understood properly, a patch has
> been devised that cures the problem and does not cause slow-downs
> anywhere else?

VLAs and malloc based arrays may behave differently with respect to alias
analysis (I'd have to look at some examples).  All effects other than malloc
overhead I would attribute to that.  That said, the general idea of the patch
is sound and I see nothing wrong with it.  Both performance improvements
and regressions are worth looking at - they hint at possible improvements
in the middle-end parts of the compiler.

Richard.

> Cheers
>
> Paul
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-13 10:38           ` Richard Guenther
@ 2011-04-13 11:13             ` Richard Guenther
  2011-04-14 13:32               ` Dominique Dhumieres
  2011-04-13 13:46             ` Paul Richard Thomas
  1 sibling, 1 reply; 40+ messages in thread
From: Richard Guenther @ 2011-04-13 11:13 UTC (permalink / raw)
  To: Paul Richard Thomas; +Cc: Dominique Dhumieres, matz, gcc-patches, fortran

On Wed, Apr 13, 2011 at 12:38 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Wed, Apr 13, 2011 at 11:42 AM, Paul Richard Thomas
> <paul.richard.thomas@gmail.com> wrote:
>> Dear Dominique,
>>
>>> I think it is the automatic array in the subroutine trisolve. Note that the
>>> speedup is rather 27->19s and may be darwin specific (slow malloc).
>>
>> I saw a speed-up of similar order on FC9/x86_64.
>>
>> I strongly doubt that it is anything to do with the automatic array -
>> if it is there is an error somewhere, since none of the references to
>> trisolve need copy-in/copy-out.
>>
>>>
>>> Note also that -fstack-arrays prevents some optimizations on
>>> fatigue: 4.7->7s. This may be related to pr45810.
>>
>> Has PR45810 converged now?  If I have understood properly, a patch has
>> been devised that cures the problem and does not cause slow-downs
>> anywhere else?
>
> VLAs and malloc based arrays may behave differently with respect to alias
> analysis (I'd have to look at some examples).  All effects other than malloc
> overhead I would attribute to that.  That said, the general idea of the patch
> is sound and I see nothing wrong with it.  Both performance improvements
> and regressions are worth looking at - they hint at possible improvements
> in the middle-end parts of the compiler.

I have opened PR48590 for at least one issue that I see.  It has a very
simple patch which you might want to try ...

Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-13 10:38           ` Richard Guenther
  2011-04-13 11:13             ` Richard Guenther
@ 2011-04-13 13:46             ` Paul Richard Thomas
  1 sibling, 0 replies; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-13 13:46 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Dominique Dhumieres, matz, gcc-patches, fortran

Dear Richard and Dominique,

> VLAs and malloc based arrays may behave differently with respect to alias
> analysis (I'd have to look at some examples).  All effects other than malloc

- hmmm, yes.  I was forgetting what might happen with inlining.  It's
not evident, is it?

> overhead I would attribute to that.  That said, the general idea of the patch
> is sound and I see nothing wrong with it.  Both performance improvements
> and regressions are worth looking at - they hint at possible improvements
> in the middle-end parts of the compiler.

I absolutely agree.  I realise my input to this thread is possibly a
bit negative but this is far from my reaction to the general idea and
what is evidently within our reach.

Cheers

Paul

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-13 11:13             ` Richard Guenther
@ 2011-04-14 13:32               ` Dominique Dhumieres
  0 siblings, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-14 13:32 UTC (permalink / raw)
  To: richard.guenther, paul.richard.thomas; +Cc: matz, gcc-patches, fortran, dominiq

> I have opened PR48590 for at least one issue that I see. ...

The fix for pr48590 (r172427) does not speedup fatigue.f90 compiled
with -fstack-arrays.

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-12  6:35       ` Dominique Dhumieres
  2011-04-13  9:42         ` Paul Richard Thomas
@ 2011-04-14 13:59         ` Michael Matz
  2011-04-14 15:28           ` Michael Matz
  1 sibling, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-14 13:59 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran

Hi,

On Tue, 12 Apr 2011, Dominique Dhumieres wrote:

> > The resulting speed up for nf.f90 is rather remarkable.  What specific
> > feature of the fortran leads to a 30=>15s ?
> 
> I think it is the automatic array in the subroutine trisolve. Note that the 
> speedup is rather 27->19s and may be darwin specific (slow malloc).
> 
> Note also that -fstack-arrays prevents some optimizations on
> fatigue: 4.7->7s. This may be related to pr45810.

That's the effect of -finline-limit=600 that you use it seems.  For me 
(opteron 2356) fatigue behaves like this: (base options: -march=native 
-ffast-math -funroll-loops -O3)
                                 no stack-arrays    with stack-arrays
no addtional options:                      10.2s          8.8s
+ -fwhole-program:                          7.1s          8.8s
+ -fwhole-program -flto:                   10.1s          8.9s
+ -fwhole-program -flto -finline-limit=600  4.8s          8.7s

The perdida subroutine isn't inlined with -fstack-arrays, although it's 
called only once.  The dump report doesn't reveal any reasons, perdida 
doesn't seem to be in the candidate list for the called-once functions.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-14 13:59         ` Michael Matz
@ 2011-04-14 15:28           ` Michael Matz
  2011-04-14 19:29             ` Dominique Dhumieres
  2011-04-15  2:01             ` Dominique Dhumieres
  0 siblings, 2 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-14 15:28 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran

Hi,

On Thu, 14 Apr 2011, Michael Matz wrote:

>                                  no stack-arrays    with stack-arrays
> no addtional options:                      10.2s          8.8s
> + -fwhole-program:                          7.1s          8.8s
> + -fwhole-program -flto:                   10.1s          8.9s
> + -fwhole-program -flto -finline-limit=600  4.8s          8.7s
> 
> The perdida subroutine isn't inlined with -fstack-arrays, although it's 
> called only once.  The dump report doesn't reveal any reasons, perdida 
> doesn't seem to be in the candidate list for the called-once functions.

See http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01087.html .  With that 
patch, -fwhole-program -finline-limit=600 and stack arrays I get down to 
3.6s.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-14 15:28           ` Michael Matz
@ 2011-04-14 19:29             ` Dominique Dhumieres
  2011-04-15  2:01             ` Dominique Dhumieres
  1 sibling, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-14 19:29 UTC (permalink / raw)
  To: matz, dominiq; +Cc: paul.richard.thomas, gcc-patches, fortran

> See http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01087.html . ...

With this patch on top of revision 172429 (and the second patch for
-fstack-arrays) I now get

================================================================================
Date & Time     : 14 Apr 2011 20:48:24
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      8.12       54576      8.12       2  0.0062
      aermod    178.02     1476528     18.62       2  0.1531
         air     25.70       89992      6.82       5  0.5056
    capacita     13.54      105440     40.10       2  0.0287
     channel      3.12       34448      2.94       5  0.0849
       doduc     27.50      212280     27.18       2  0.0754
     fatigue     10.14       77032      2.76       4  0.1889
     gas_dyn     24.89      144112      4.67       2  0.1927
      induct     24.24      189696     14.22       2  0.0738
       linpk      2.56       21536     21.69       2  0.0507
        mdbx      9.17       84776     12.55       2  0.0120
          nf     34.27      124640     18.41       4  0.1577
     protein     28.37      155624     35.37       2  0.0170
      rnflow     38.31      183696     26.73       2  0.0224
    test_fpu     20.22      141720     10.85       2  0.0921
        tfft      1.71       22072      3.29       2  0.0152

Geometric Mean Execution Time =      11.57 seconds

================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved

Pretty spectacular!-)

Thanks for the work.

Dominique

PS. for the final patch for -fstack-array, don't forget to initialize tmp.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-11 16:04   ` Michael Matz
                       ` (5 preceding siblings ...)
  2011-04-12  6:23     ` Paul Richard Thomas
@ 2011-04-14 20:23     ` Tobias Burnus
  2011-04-14 20:30       ` Tobias Burnus
  6 siblings, 1 reply; 40+ messages in thread
From: Tobias Burnus @ 2011-04-14 20:23 UTC (permalink / raw)
  To: Michael Matz; +Cc: fortran, gcc-patches

Michael Matz wrote:
> Try this patch.  I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 .  In fact all
> of polyhedron works for me on these flags.  (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).

The patch is OK.

* As mentioned a couple of times, you need a "tmp = NULL_TREE" in 
gfc_trans_auto_array_allocation - either when "tmp" is declared or in 
the "if (...flag_stack_arrays)" block.

* Can you also update http://gcc.gnu.org/wiki/GFortran#GCC4.7 and/or 
http://gcc.gnu.org/gcc-4.7/changes.html#fortran ?

Thanks a lot for your patch(es). As Dominique has remarked, the results 
are "Pretty spectacular". (Twice as fast for fatigue and the geometric 
mean of the Polyhedron test improves by 6%.)

  * * *

Side remark: For 4.7 I now hope that inlining gets better tuned for 
Fortran as the is room for improvement; quoting your numbers for fatigue:

                                  no stack-arrays    with stack-arrays
+ -fwhole-program -flto:                   10.1s          8.9s
+ -fwhole-program -flto -finline-limit=600  4.8s          3.6s


Thus, the program is twice as fast with the special inline setting. But 
as users will typically never set more than "-O3 -march=native 
-ffast-math" (and maybe: -funroll-loops), the inline opportunity is lost 
in real life.

Tobias

> 	* trans-array.c (toplevel): Include gimple.h.
> 	(gfc_trans_allocate_array_storage): Check flag_stack_arrays,
> 	properly expand variable length arrays.
> 	(gfc_trans_auto_array_allocation): If flag_stack_arrays create
> 	variable length decls and associate them with their scope.
> 	* gfortran.h (gfc_option_t): Add flag_stack_arrays member.
> 	* options.c (gfc_init_options): Handle -fstack_arrays option.
> 	* lang.opt (fstack-arrays): Add option.
> 	* invoke.texi (Code Gen Options): Document it.
> 	* Make-lang.in (trans-array.o): Depend on GIMPLE_H.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-14 20:23     ` Tobias Burnus
@ 2011-04-14 20:30       ` Tobias Burnus
  0 siblings, 0 replies; 40+ messages in thread
From: Tobias Burnus @ 2011-04-14 20:30 UTC (permalink / raw)
  To: Michael Matz; +Cc: fortran, gcc-patches

Tobias Burnus wrote:
>                                  no stack-arrays    with stack-arrays
> + -fwhole-program -flto:                   10.1s          8.9s
> + -fwhole-program -flto -finline-limit=600  4.8s          3.6s

I wonder whether the following is special to my system* or generally 
true. I use:
gfortran -O3 -march=native -ffast-math -funroll-loops -fwhole-program 
-finline-limit=600 fatigue.f90

        no stack-arrays  with stack-arrays
        0m6.622s         0m8.174s
-flto  0m8.444s         0m8.174s

Thus, the non "-flto" version is faster (in particular without stack 
arrays). I assume that it has to do with the declaration issues of the 
front end. However, the last time I tried to find the problem, I failed 
to spot anything which looked wrong - especially, the UIDs seemed to be 
OK. (Besides, I would like to have the -fno-lto performance also with 
.-flto ... ;-)

Tobias

* AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2.4 GHz), x86-64 Linux

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-14 15:28           ` Michael Matz
  2011-04-14 19:29             ` Dominique Dhumieres
@ 2011-04-15  2:01             ` Dominique Dhumieres
  2011-04-15 12:38               ` Michael Matz
  1 sibling, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-15  2:01 UTC (permalink / raw)
  To: matz, dominiq; +Cc: paul.richard.thomas, gcc-patches, fortran

I have forgotten to mentionned that I have a variant of fatigue
in which I have done the inlining manually along with few other
optimizations and the timing for it is

[macbook] lin/test% gfc -Ofast fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.793u 0.002s 0:02.79 100.0%	0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.680u 0.003s 0:02.68 100.0%	0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -flto fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.671u 0.002s 0:02.67 100.0%	0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -fstack-arrays fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.680u 0.003s 0:02.68 100.0%	0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -flto -fstack-arrays fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.677u 0.003s 0:02.68 99.6%	0+0k 0+0io 0pf+0w

So the timing of the original code with 
-Ofast -finline-limit=600 -fwhole-program -flto -fstack-arrays
is quite close to this lower bound.

I have also looked at the failure for gfortran.dg/elemental_dependency_1.f90
and it seems due to a spurious integer(kind=4) A.37[4]; (and friends) in

    integer(kind=8) D.1674;
    integer(kind=4) A.37[4];
    struct array1_integer(kind=4) atmp.36;
    void * D.1669;
    integer(kind=8) D.1668;
    struct array1_integer(kind=4) parm.35;

    parm.35.dtype = 265;
    parm.35.dim[0].lbound = 1;
    parm.35.dim[0].ubound = 4;
    parm.35.dim[0].stride = 1;
    parm.35.data = (void *) &a[1];
    parm.35.offset = -2;
    atmp.36.dtype = 265;
    atmp.36.dim[0].stride = 1;
    atmp.36.dim[0].lbound = 0;
    atmp.36.dim[0].ubound = 3;
        integer(kind=4) A.37[4];
    atmp.36.data = (void * restrict) &A.37;

compared to

    integer(kind=8) D.1658;
    integer(kind=4) A.37[4];
    struct array1_integer(kind=4) atmp.36;
    void * D.1653;
    integer(kind=8) D.1652;
    struct array1_integer(kind=4) parm.35;

    parm.35.dtype = 265;
    parm.35.dim[0].lbound = 1;
    parm.35.dim[0].ubound = 4;
    parm.35.dim[0].stride = 1;
    parm.35.data = (void *) &a[1];
    parm.35.offset = -2;
    atmp.36.dtype = 265;
    atmp.36.dim[0].stride = 1;
    atmp.36.dim[0].lbound = 0;
    atmp.36.dim[0].ubound = 3;
    atmp.36.data = (void * restrict) &A.37;

Note that this is without the -fstack-arrays option.
Same thing for gfortran.dg/vector_subscript_4.f90.

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-15  2:01             ` Dominique Dhumieres
@ 2011-04-15 12:38               ` Michael Matz
  2011-04-15 14:06                 ` Dominique Dhumieres
  0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-15 12:38 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran

Hi,

On Thu, 14 Apr 2011, Dominique Dhumieres wrote:

> I have forgotten to mentionned that I have a variant of fatigue
> in which I have done the inlining manually along with few other
> optimizations and the timing for it is
> 
> [macbook] lin/test% gfc -Ofast fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.793u 0.002s 0:02.79 100.0%	0+0k 0+1io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.680u 0.003s 0:02.68 100.0%	0+0k 0+2io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program -flto fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.671u 0.002s 0:02.67 100.0%	0+0k 0+2io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program -fstack-arrays fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.680u 0.003s 0:02.68 100.0%	0+0k 0+2io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program -flto -fstack-arrays fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.677u 0.003s 0:02.68 99.6%	0+0k 0+0io 0pf+0w
> 
> So the timing of the original code with 
> -Ofast -finline-limit=600 -fwhole-program -flto -fstack-arrays
> is quite close to this lower bound.
> 
> I have also looked at the failure for gfortran.dg/elemental_dependency_1.f90
> and it seems due to a spurious integer(kind=4) A.37[4]; (and friends) in

Yes, this is due to the DECL_EXPR statement which is rendered by the 
dumper just the same as a normal decl.  The testcase looks for exactly one 
such decl, but with -fstack-arrays there are exactly two for each such 
array.

>     integer(kind=4) A.37[4];

So, this is the normal decl.

>     atmp.36.dim[0].ubound = 3;
>         integer(kind=4) A.37[4];

And this is the DECL_EXPR statement, which then actually is transformed 
into the stack_save/alloca/stack_restore sequence.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-15 12:38               ` Michael Matz
@ 2011-04-15 14:06                 ` Dominique Dhumieres
  2011-04-15 15:19                   ` Michael Matz
  0 siblings, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-15 14:06 UTC (permalink / raw)
  To: matz, dominiq; +Cc: paul.richard.thomas, gcc-patches, fortran

Michael,

> Yes, this is due to the DECL_EXPR statement which is rendered by the 
> dumper just the same as a normal decl.  The testcase looks for exactly one 
> such decl, but with -fstack-arrays there are exactly two for each such 
> array.

The testsuite is run without -fstack-arrays, so I dont' understand why
the "DECL_EXPR statement" appears.

Dominique

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-15 14:06                 ` Dominique Dhumieres
@ 2011-04-15 15:19                   ` Michael Matz
  2011-04-15 15:26                     ` Jerry DeLisle
  0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-15 15:19 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran

Hi,

On Fri, 15 Apr 2011, Dominique Dhumieres wrote:

> Michael,
> 
> > Yes, this is due to the DECL_EXPR statement which is rendered by the 
> > dumper just the same as a normal decl.  The testcase looks for exactly one 
> > such decl, but with -fstack-arrays there are exactly two for each such 
> > array.
> 
> The testsuite is run without -fstack-arrays, so I dont' understand why
> the "DECL_EXPR statement" appears.

Bummer, you're right.  I unconditionally emit a DECL_EXPR for arrays even 
when they don't have a variable length.  It's harmless, but makes the 
testcase fail (I wasn't seeing the fail because I've changed the testcase 
already to make it not fail with -fstack-arrays).

I'll make the DECL_EXPR conditional on the size being variable.  As Tobias 
already okayed the patch I'm planning to check in the slightly modified 
variant as below, after a new round of testing.


Ciao,
Michael.

	* trans-array.c (toplevel): Include gimple.h.
	(gfc_trans_allocate_array_storage): Check flag_stack_arrays,
	properly expand variable length arrays.
	(gfc_trans_auto_array_allocation): If flag_stack_arrays create
	variable length decls and associate them with their scope.
	* gfortran.h (gfc_option_t): Add flag_stack_arrays member.
	* options.c (gfc_init_options): Handle -fstack_arrays option.
	* lang.opt (fstack-arrays): Add option.
	* invoke.texi (Code Gen Options): Document it.
	* Make-lang.in (trans-array.o): Depend on GIMPLE_H.

Index: fortran/trans-array.c
===================================================================
*** fortran/trans-array.c	(revision 172431)
--- fortran/trans-array.c	(working copy)
*************** along with GCC; see the file COPYING3.
*** 81,86 ****
--- 81,87 ----
  #include "system.h"
  #include "coretypes.h"
  #include "tree.h"
+ #include "gimple.h"
  #include "diagnostic-core.h"	/* For internal_error/fatal_error.  */
  #include "flags.h"
  #include "gfortran.h"
*************** gfc_trans_allocate_array_storage (stmtbl
*** 630,647 ****
      {
        /* Allocate the temporary.  */
        onstack = !dynamic && initial == NULL_TREE
! 			 && gfc_can_put_var_on_stack (size);
  
        if (onstack)
  	{
  	  /* Make a temporary variable to hold the data.  */
  	  tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
  				 nelem, gfc_index_one_node);
  	  tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
  				  tmp);
  	  tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
  				  tmp);
  	  tmp = gfc_create_var (tmp, "A");
  	  tmp = gfc_build_addr_expr (NULL_TREE, tmp);
  	  gfc_conv_descriptor_data_set (pre, desc, tmp);
  	}
--- 631,657 ----
      {
        /* Allocate the temporary.  */
        onstack = !dynamic && initial == NULL_TREE
! 			 && (gfc_option.flag_stack_arrays
! 			     || gfc_can_put_var_on_stack (size));
  
        if (onstack)
  	{
  	  /* Make a temporary variable to hold the data.  */
  	  tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
  				 nelem, gfc_index_one_node);
+ 	  tmp = gfc_evaluate_now (tmp, pre);
  	  tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
  				  tmp);
  	  tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
  				  tmp);
  	  tmp = gfc_create_var (tmp, "A");
+ 	  /* If we're here only because of -fstack-arrays we have to
+ 	     emit a DECL_EXPR to make the gimplifier emit alloca calls.  */
+ 	  if (!gfc_can_put_var_on_stack (size))
+ 	    gfc_add_expr_to_block (pre,
+ 				   fold_build1_loc (input_location,
+ 						    DECL_EXPR, TREE_TYPE (tmp),
+ 						    tmp));
  	  tmp = gfc_build_addr_expr (NULL_TREE, tmp);
  	  gfc_conv_descriptor_data_set (pre, desc, tmp);
  	}
*************** gfc_trans_auto_array_allocation (tree de
*** 4759,4767 ****
  {
    stmtblock_t init;
    tree type;
!   tree tmp;
    tree size;
    tree offset;
    bool onstack;
  
    gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
--- 4769,4779 ----
  {
    stmtblock_t init;
    tree type;
!   tree tmp = NULL_TREE;
    tree size;
    tree offset;
+   tree space;
+   tree inittree;
    bool onstack;
  
    gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
*************** gfc_trans_auto_array_allocation (tree de
*** 4818,4832 ****
        return;
      }
  
!   /* The size is the number of elements in the array, so multiply by the
!      size of an element to get the total size.  */
!   tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
!   size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! 			  size, fold_convert (gfc_array_index_type, tmp));
  
!   /* Allocate memory to hold the data.  */
!   tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
!   gfc_add_modify (&init, decl, tmp);
  
    /* Set offset of the array.  */
    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
--- 4830,4859 ----
        return;
      }
  
!   if (gfc_option.flag_stack_arrays)
!     {
!       gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
!       space = build_decl (sym->declared_at.lb->location,
! 			  VAR_DECL, create_tmp_var_name ("A"),
! 			  TREE_TYPE (TREE_TYPE (decl)));
!       gfc_trans_vla_type_sizes (sym, &init);
!     }
!   else
!     {
!       /* The size is the number of elements in the array, so multiply by the
! 	 size of an element to get the total size.  */
!       tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
!       size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! 			      size, fold_convert (gfc_array_index_type, tmp));
  
!       /* Allocate memory to hold the data.  */
!       tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
!       gfc_add_modify (&init, decl, tmp);
! 
!       /* Free the temporary.  */
!       tmp = gfc_call_free (convert (pvoid_type_node, decl));
!       space = NULL_TREE;
!     }
  
    /* Set offset of the array.  */
    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
*************** gfc_trans_auto_array_allocation (tree de
*** 4835,4844 ****
    /* Automatic arrays should not have initializers.  */
    gcc_assert (!sym->value);
  
!   /* Free the temporary.  */
!   tmp = gfc_call_free (convert (pvoid_type_node, decl));
  
!   gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
  }
  
  
--- 4862,4887 ----
    /* Automatic arrays should not have initializers.  */
    gcc_assert (!sym->value);
  
!   inittree = gfc_finish_block (&init);
! 
!   if (space)
!     {
!       tree addr;
!       pushdecl (space);
  
!       /* Don't create new scope, emit the DECL_EXPR in exactly the scope
!          where also space is located.  */
!       gfc_init_block (&init);
!       tmp = fold_build1_loc (input_location, DECL_EXPR,
! 			     TREE_TYPE (space), space);
!       gfc_add_expr_to_block (&init, tmp);
!       addr = fold_build1_loc (sym->declared_at.lb->location,
! 			      ADDR_EXPR, TREE_TYPE (decl), space);
!       gfc_add_modify (&init, decl, addr);
!       gfc_add_init_cleanup (block, gfc_finish_block (&init), NULL_TREE);
!       tmp = NULL_TREE;
!     }
!   gfc_add_init_cleanup (block, inittree, tmp);
  }
  
  
Index: fortran/Make-lang.in
===================================================================
*** fortran/Make-lang.in	(revision 172431)
--- fortran/Make-lang.in	(working copy)
*************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
*** 353,359 ****
  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
    fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
    gt-fortran-trans-intrinsic.h
  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
--- 353,359 ----
  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
    fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
    gt-fortran-trans-intrinsic.h
  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
Index: fortran/gfortran.h
===================================================================
*** fortran/gfortran.h	(revision 172431)
--- fortran/gfortran.h	(working copy)
*************** typedef struct
*** 2221,2226 ****
--- 2221,2227 ----
    int flag_d_lines;
    int gfc_flag_openmp;
    int flag_sign_zero;
+   int flag_stack_arrays;
    int flag_module_private;
    int flag_recursive;
    int flag_init_local_zero;
Index: fortran/lang.opt
===================================================================
*** fortran/lang.opt	(revision 172431)
--- fortran/lang.opt	(working copy)
*************** fmax-stack-var-size=
*** 462,467 ****
--- 462,471 ----
  Fortran RejectNegative Joined UInteger
  -fmax-stack-var-size=<n>	Size in bytes of the largest array that will be put on the stack
  
+ fstack-arrays
+ Fortran
+ Put all local arrays on stack.
+ 
  fmodule-private
  Fortran
  Set default accessibility of module entities to PRIVATE.
Index: fortran/invoke.texi
===================================================================
*** fortran/invoke.texi	(revision 172431)
--- fortran/invoke.texi	(working copy)
*************** and warnings}.
*** 167,172 ****
--- 167,173 ----
  -fbounds-check -fcheck-array-temporaries  -fmax-array-constructor =@var{n} @gol
  -fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
  -fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
+ -fstack-arrays @gol
  -fpack-derived  -frepack-arrays  -fshort-enums  -fexternal-blas @gol
  -fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
  -finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
*************** Future versions of GNU Fortran may impro
*** 1370,1375 ****
--- 1371,1383 ----
  
  The default value for @var{n} is 32768.
  
+ @item -fstack-arrays
+ @opindex @code{fstack-arrays}
+ Adding this option will make the fortran compiler put all local arrays,
+ even those of unknown size onto stack memory.  If your program uses very
+ large local arrays it's possible that you'll have to extend your runtime
+ limits for stack memory on some operating systems.
+ 
  @item -fpack-derived
  @opindex @code{fpack-derived}
  @cindex structure packing
Index: fortran/options.c
===================================================================
*** fortran/options.c	(revision 172431)
--- fortran/options.c	(working copy)
*************** gfc_init_options (unsigned int decoded_o
*** 124,129 ****
--- 124,130 ----
  
    /* Default value of flag_max_stack_var_size is set in gfc_post_options.  */
    gfc_option.flag_max_stack_var_size = -2;
+   gfc_option.flag_stack_arrays = 0;
  
    gfc_option.flag_range_check = 1;
    gfc_option.flag_pack_derived = 0;
*************** gfc_handle_option (size_t scode, const c
*** 795,800 ****
--- 796,805 ----
        gfc_option.flag_max_stack_var_size = value;
        break;
  
+     case OPT_fstack_arrays:
+       gfc_option.flag_stack_arrays = value;
+       break;
+ 
      case OPT_fmodule_private:
        gfc_option.flag_module_private = value;
        break;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-15 15:19                   ` Michael Matz
@ 2011-04-15 15:26                     ` Jerry DeLisle
  2011-04-15 22:41                       ` Michael Matz
  0 siblings, 1 reply; 40+ messages in thread
From: Jerry DeLisle @ 2011-04-15 15:26 UTC (permalink / raw)
  To: Michael Matz
  Cc: Dominique Dhumieres, paul.richard.thomas, gcc-patches, fortran

On 04/15/2011 07:28 AM, Michael Matz wrote:
--- snip ---
>
> I'll make the DECL_EXPR conditional on the size being variable.  As Tobias
> already okayed the patch I'm planning to check in the slightly modified
> variant as below, after a new round of testing.
>

Thats A-OK

Thanks,

Jerry

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-15 15:26                     ` Jerry DeLisle
@ 2011-04-15 22:41                       ` Michael Matz
  0 siblings, 0 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-15 22:41 UTC (permalink / raw)
  To: Jerry DeLisle
  Cc: Dominique Dhumieres, paul.richard.thomas, gcc-patches, fortran

Hi,

On Fri, 15 Apr 2011, Jerry DeLisle wrote:

> > I'll make the DECL_EXPR conditional on the size being variable.  As 
> > Tobias already okayed the patch I'm planning to check in the slightly 
> > modified variant as below, after a new round of testing.
> 
> Thats A-OK

r172524


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-09  9:02   ` Eric Botcazou
@ 2011-04-09  9:49     ` N.M. Maclaren
  0 siblings, 0 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-09  9:49 UTC (permalink / raw)
  To: gcc-patches, fortran

On Apr 9 2011, Magnus Fromreide wrote:
>> 
>> There is actually a much better approach, which was done in Algol 68
>> and seems now to be done only in Ada.  As far as the compiler
>> implementation goes, it is a trivial variation on what you have done,
>> but there is a little more work in the run-time system.
>> 
>> That is to use primary and secondary stacks. ...
>
>How does your secondary stack interact with threads? Do it force the use
>of more memory per thread?

No, there is no difference.  You need a pair of stacks per thread, rather
than one, but that's all.


On Apr 9 2011, Eric Botcazou wrote:
>
> Obviously this depends on the compiler. GNAT allocates all local arrays 
> on the primary stack for example.

Interesting.  I was relying on what (several) other people had told me.
But, yes, it's entirely compiler-dependent, and can be done or not in
any conventional 3GL language.

>> You need a single secondary stack pointer (and preferably limit, for
>> checking), which can be global variables.  If a procedure uses the
>> secondary stack, it needs to restore it on leaving, or when leaving
>> a scope including an allocation - but you have already implemented
>> that!
>
> Not exactly, managing a secondary stack isn't supported in the GENERIC 
> IL, you need to do it explicitly by inserting calls to the runtime at 
> relevant points. As a consequence, this isn't an efficient mechanism, 
> hence the restricted use.

Naturally, every mechanism can be done badly, and I agree that it is
very common for the best solution to be infeasible because of some other
aspect.  You still get the cache benefits with a pair of calls, but they
do add a lot of overhead.


Regards,
Nick Maclaren.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-09  8:21 ` N.M. Maclaren
  2011-04-09  8:51   ` Magnus Fromreide
@ 2011-04-09  9:02   ` Eric Botcazou
  2011-04-09  9:49     ` N.M. Maclaren
  1 sibling, 1 reply; 40+ messages in thread
From: Eric Botcazou @ 2011-04-09  9:02 UTC (permalink / raw)
  To: N.M. Maclaren; +Cc: gcc-patches, fortran

> There is actually a much better approach, which was done in Algol 68
> and seems now to be done only in Ada.  As far as the compiler
> implementation goes, it is a trivial variation on what you have done,
> but there is a little more work in the run-time system.

Obviously this depends on the compiler.  GNAT allocates all local arrays on the 
primary stack for example.

> That is to use primary and secondary stacks.  The primary is for the
> linkage, scalars, array descriptors and very small, fixed-size arrays.
> The secondary is for all other arrays with automatic scope.  You get
> all of the benefits that you mention, because both are true stacks,
> and the scope of the secondary stack is controlled by that of the
> primary.

GNAT has a secondary stack, but it is only used to return objects whose size is 
self-referential, i.e. isn't given by the return type of the function.

> You need a single secondary stack pointer (and preferably limit, for
> checking), which can be global variables.  If a procedure uses the
> secondary stack, it needs to restore it on leaving, or when leaving
> a scope including an allocation - but you have already implemented
> that!

Not exactly, managing a secondary stack isn't supported in the GENERIC IL, you 
need to do it explicitly by inserting calls to the runtime at relevant points.
As a consequence, this isn't an efficient mechanism, hence the restricted use.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-09  8:21 ` N.M. Maclaren
@ 2011-04-09  8:51   ` Magnus Fromreide
  2011-04-09  9:02   ` Eric Botcazou
  1 sibling, 0 replies; 40+ messages in thread
From: Magnus Fromreide @ 2011-04-09  8:51 UTC (permalink / raw)
  To: N.M. Maclaren; +Cc: gcc-patches, fortran

On Sat, 2011-04-09 at 09:21 +0100, N.M. Maclaren wrote:
> On Apr 8 2011, Michael Matz wrote:
> >
> >It adds a new option -fstack-arrays which makes the frontend put 
> >all local arrays on stack memory.  ...
> 
> Excellent!
> 
> >I haven't rechecked performance now, but four months ago this was the 
> >result for the fortran benchmarks in cpu2006:
> 
> There is actually a much better approach, which was done in Algol 68
> and seems now to be done only in Ada.  As far as the compiler
> implementation goes, it is a trivial variation on what you have done,
> but there is a little more work in the run-time system.
> 
> That is to use primary and secondary stacks. ...

How does your secondary stack interact with threads? Do it force the use
of more memory per thread?

/MF

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Implement stack arrays even for unknown sizes
  2011-04-08 21:29 Michael Matz
@ 2011-04-09  8:21 ` N.M. Maclaren
  2011-04-09  8:51   ` Magnus Fromreide
  2011-04-09  9:02   ` Eric Botcazou
  0 siblings, 2 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-09  8:21 UTC (permalink / raw)
  To: gcc-patches, fortran

On Apr 8 2011, Michael Matz wrote:
>
>It adds a new option -fstack-arrays which makes the frontend put 
>all local arrays on stack memory.  ...

Excellent!

>I haven't rechecked performance now, but four months ago this was the 
>result for the fortran benchmarks in cpu2006:

There is actually a much better approach, which was done in Algol 68
and seems now to be done only in Ada.  As far as the compiler
implementation goes, it is a trivial variation on what you have done,
but there is a little more work in the run-time system.

That is to use primary and secondary stacks.  The primary is for the
linkage, scalars, array descriptors and very small, fixed-size arrays.
The secondary is for all other arrays with automatic scope.  You get
all of the benefits that you mention, because both are true stacks,
and the scope of the secondary stack is controlled by that of the
primary.

You need a single secondary stack pointer (and preferably limit, for
checking), which can be global variables.  If a procedure uses the
secondary stack, it needs to restore it on leaving, or when leaving
a scope including an allocation - but you have already implemented
that!

The reason that it gives better performance is locality.  Because the
primary stack is much smaller and contains the critical scalars and
pointers to arrays etc., calling procedures can involve MANY fewer
cache misses.

It also reduces the number of array overflows that cause complete chaos,
because most small bounds errors trash other data, not constants or
linkage.

It also helps with Unixoid systems with fixed (and small) stack limits,
because the secondary stack is just another data segment.

I would love to say that I would try to work on this, but I really can't
promise to do anything else until I have delivered on some of my long
overdue promises.  But I am very happy to discuss details - not that there
ARE many more than I have described!

Regards,
Nick Maclaren.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Implement stack arrays even for unknown sizes
@ 2011-04-08 21:29 Michael Matz
  2011-04-09  8:21 ` N.M. Maclaren
  0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-08 21:29 UTC (permalink / raw)
  To: gcc-patches; +Cc: fortran

Hello,

I developed this patch during stage3 of 4.6, so it wasn't appropriate 
then, but now should be a good time.

It adds a new option -fstack-arrays which makes the frontend put 
all local arrays on stack memory.  Even those of non-constant size, by 
using alloca.  Or rather not by explicitely using alloca, but by 
leveraging the middle-ends possibilities of having variable length arrays 
(which that then implements via alloca/stack_save/stack_restore).

Via that we'll get automatic allocation and free (the latter being the 
important part for stack memory) when entering or leaving the scope of the 
so declared arrays.  That means temporary arrays generated for looping 
constructs only need stack memory when the loop is active, not during the 
whole function.

By doing that we save one malloc/free pair per such loop iteration, and 
some benchmarks _heavily_ churn in the libc allocator (tonto for 
instance).  Also in some cases the data layout will be much better 
(e.g. bwaves).  This patch bootstraps fine on x86_64-linux.  Even if I'm 
forcing the use of stack arrays by default the testsuite runs without 
regressions, except for a few cases where regexps need to be changed to 
accept the changed output (e.g. explicitely matching for builtin_free).
The patch as is (i.e. without using stack arrays by default) doesn't 
regress at all.

I haven't rechecked performance now, but four months ago this was the 
result for the fortran benchmarks in cpu2006:

(base is -O3 -g -ffast-math -funroll-loops -fpeel-loops
 peak is + -fstack-arrays)

410.bwaves      13590   1340         10.2   *   13590    813         16.7   *
416.gamess      19580   1400         14.0   *   19580   1410         13.9   *
434.zeusmp       9100    798         11.4   *    9100    799         11.4   *
435.gromacs      7140    622         11.5   *    7140    621         11.5   *
436.cactusADM   11950   1280          9.36  *   11950   1300          9.18 *
437.leslie3d     9400    765         12.3   *    9400    765         12.3   *
454.calculix     8250    734         11.2   *    8250    735         11.2   *
459.GemsFDTD    10610   1070          9.95  *   10610   1060         10.0   *
465.tonto        9840   1090          9.00  *    9840    952         10.3   *
481.wrf         11170    899         12.4   *   11170    794         14.1 *

That is tonto and wrf show nice speedups, and bwaves increases performance 
by 80% (that's data layout, bwaves doesn't heavily call malloc/free).

It should be noted that e.g. ICC does this placement of local arrays on 
stack by default, and that doing it for GCC has similar requirements.  For 
cpu2006 one needs to increase the stack ulimit quite much (around 1G) to 
not segfault.

I would consider enabling stack-arrays for -Ofast, but that would be a 
separate patch.

So, what do you think?


Ciao,
Michael.

	* trans-array.c (toplevel): Include gimple.h.
	(gfc_trans_allocate_array_storage): Check flag_stack_arrays,
	properly expand variable length arrays.
	(gfc_trans_auto_array_allocation): If flag_stack_arrays create
	variable length decls and associate them with their scope.
	* gfortran.h (gfc_option_t): Add flag_stack_arrays member.
	* options.c (gfc_init_options): Handle -fstack_arrays option.
	* lang.opt (fstack-arrays): Add option.
	* invoke.texi (Code Gen Options): Document it.
	* Make-lang.in (trans-array.o): Depend on GIMPLE_H.

Index: trans-array.c
===================================================================
*** trans-array.c	(Revision 172206)
--- trans-array.c	(Arbeitskopie)
*************** along with GCC; see the file COPYING3.
*** 81,86 ****
--- 81,87 ----
  #include "system.h"
  #include "coretypes.h"
  #include "tree.h"
+ #include "gimple.h"
  #include "diagnostic-core.h"	/* For internal_error/fatal_error.  */
  #include "flags.h"
  #include "gfortran.h"
*************** gfc_trans_allocate_array_storage (stmtbl
*** 630,647 ****
      {
        /* Allocate the temporary.  */
        onstack = !dynamic && initial == NULL_TREE
! 			 && gfc_can_put_var_on_stack (size);
  
        if (onstack)
  	{
  	  /* Make a temporary variable to hold the data.  */
  	  tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
  				 nelem, gfc_index_one_node);
  	  tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
  				  tmp);
  	  tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
  				  tmp);
  	  tmp = gfc_create_var (tmp, "A");
  	  tmp = gfc_build_addr_expr (NULL_TREE, tmp);
  	  gfc_conv_descriptor_data_set (pre, desc, tmp);
  	}
--- 631,654 ----
      {
        /* Allocate the temporary.  */
        onstack = !dynamic && initial == NULL_TREE
! 			 && (gfc_option.flag_stack_arrays
! 			     || gfc_can_put_var_on_stack (size));
  
        if (onstack)
  	{
  	  /* Make a temporary variable to hold the data.  */
  	  tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
  				 nelem, gfc_index_one_node);
+ 	  tmp = gfc_evaluate_now (tmp, pre);
  	  tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
  				  tmp);
  	  tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
  				  tmp);
  	  tmp = gfc_create_var (tmp, "A");
+ 	  gfc_add_expr_to_block (pre,
+ 				 fold_build1_loc (input_location,
+ 						  DECL_EXPR, TREE_TYPE (tmp),
+ 						  tmp));
  	  tmp = gfc_build_addr_expr (NULL_TREE, tmp);
  	  gfc_conv_descriptor_data_set (pre, desc, tmp);
  	}
*************** gfc_trans_auto_array_allocation (tree de
*** 4744,4749 ****
--- 4751,4758 ----
    tree tmp;
    tree size;
    tree offset;
+   tree space;
+   tree inittree;
    bool onstack;
  
    gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
*************** gfc_trans_auto_array_allocation (tree de
*** 4800,4814 ****
        return;
      }
  
!   /* The size is the number of elements in the array, so multiply by the
!      size of an element to get the total size.  */
!   tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
!   size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! 			  size, fold_convert (gfc_array_index_type, tmp));
  
!   /* Allocate memory to hold the data.  */
!   tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
!   gfc_add_modify (&init, decl, tmp);
  
    /* Set offset of the array.  */
    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
--- 4809,4846 ----
        return;
      }
  
!   if (gfc_option.flag_stack_arrays)
!     {
!       tree addr;
!       gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
!       space = build_decl (sym->declared_at.lb->location,
! 			  VAR_DECL, create_tmp_var_name ("A"),
! 			  TREE_TYPE (TREE_TYPE (decl)));
!       gfc_trans_vla_type_sizes (sym, &init);
!       tmp = fold_build1_loc (input_location, DECL_EXPR,
! 			     TREE_TYPE (space), space);
!       gfc_add_expr_to_block (&init, tmp);
!       addr = fold_build1_loc (sym->declared_at.lb->location,
! 			      ADDR_EXPR, TREE_TYPE (decl), space);
!       gfc_add_modify (&init, decl, addr);
!       tmp = NULL_TREE;
!     }
!   else
!     {
!       /* The size is the number of elements in the array, so multiply by the
! 	 size of an element to get the total size.  */
!       tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
!       size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! 			      size, fold_convert (gfc_array_index_type, tmp));
! 
!       /* Allocate memory to hold the data.  */
!       tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
!       gfc_add_modify (&init, decl, tmp);
  
!       /* Free the temporary.  */
!       tmp = gfc_call_free (convert (pvoid_type_node, decl));
!       space = NULL_TREE;
!     }
  
    /* Set offset of the array.  */
    if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
*************** gfc_trans_auto_array_allocation (tree de
*** 4817,4826 ****
    /* Automatic arrays should not have initializers.  */
    gcc_assert (!sym->value);
  
!   /* Free the temporary.  */
!   tmp = gfc_call_free (convert (pvoid_type_node, decl));
! 
!   gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
  }
  
  
--- 4849,4858 ----
    /* Automatic arrays should not have initializers.  */
    gcc_assert (!sym->value);
  
!   inittree = gfc_finish_block (&init);
!   if (space)
!     pushdecl (space);
!   gfc_add_init_cleanup (block, inittree, tmp);
  }
  
  
Index: Make-lang.in
===================================================================
*** Make-lang.in	(Revision 172206)
--- Make-lang.in	(Arbeitskopie)
*************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
*** 353,359 ****
  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
    fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
    gt-fortran-trans-intrinsic.h
  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
--- 353,359 ----
  fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
  fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
    fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
  fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
    gt-fortran-trans-intrinsic.h
  fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
Index: gfortran.h
===================================================================
*** gfortran.h	(Revision 172206)
--- gfortran.h	(Arbeitskopie)
*************** typedef struct
*** 2220,2225 ****
--- 2220,2226 ----
    int flag_d_lines;
    int gfc_flag_openmp;
    int flag_sign_zero;
+   int flag_stack_arrays;
    int flag_module_private;
    int flag_recursive;
    int flag_init_local_zero;
Index: lang.opt
===================================================================
*** lang.opt	(Revision 172206)
--- lang.opt	(Arbeitskopie)
*************** fmax-stack-var-size=
*** 454,459 ****
--- 454,463 ----
  Fortran RejectNegative Joined UInteger
  -fmax-stack-var-size=<n>	Size in bytes of the largest array that will be put on the stack
  
+ fstack-arrays
+ Fortran
+ Put all local arrays on stack.
+ 
  fmodule-private
  Fortran
  Set default accessibility of module entities to PRIVATE.
Index: invoke.texi
===================================================================
*** invoke.texi	(Revision 172206)
--- invoke.texi	(Arbeitskopie)
*************** and warnings}.
*** 167,172 ****
--- 167,173 ----
  -fbounds-check -fcheck-array-temporaries  -fmax-array-constructor =@var{n} @gol
  -fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
  -fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
+ -fstack-arrays @gol
  -fpack-derived  -frepack-arrays  -fshort-enums  -fexternal-blas @gol
  -fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
  -finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
*************** Future versions of GNU Fortran may impro
*** 1361,1366 ****
--- 1362,1374 ----
  
  The default value for @var{n} is 32768.
  
+ @item -fstack-arrays
+ @opindex @code{fstack-arrays}
+ Adding this option will make the fortran compiler put all local arrays,
+ even those of unknown size onto stack memory.  If your program uses very
+ large local arrays it's possible that you'll have to extend your runtime
+ limits for stack memory on some operating systems.
+ 
  @item -fpack-derived
  @opindex @code{fpack-derived}
  @cindex structure packing
Index: options.c
===================================================================
*** options.c	(Revision 172206)
--- options.c	(Arbeitskopie)
*************** gfc_init_options (unsigned int decoded_o
*** 123,128 ****
--- 123,129 ----
  
    /* Default value of flag_max_stack_var_size is set in gfc_post_options.  */
    gfc_option.flag_max_stack_var_size = -2;
+   gfc_option.flag_stack_arrays = 0;
  
    gfc_option.flag_range_check = 1;
    gfc_option.flag_pack_derived = 0;
*************** gfc_handle_option (size_t scode, const c
*** 783,788 ****
--- 784,793 ----
        gfc_option.flag_max_stack_var_size = value;
        break;
  
+     case OPT_fstack_arrays:
+       gfc_option.flag_stack_arrays = value;
+       break;
+ 
      case OPT_fmodule_private:
        gfc_option.flag_module_private = value;
        break;

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-04-15 21:49 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-09 10:08 Implement stack arrays even for unknown sizes Dominique Dhumieres
2011-04-09 12:17 ` Paul Richard Thomas
2011-04-10 13:29   ` Dominique Dhumieres
2011-04-11 11:49     ` Michael Matz
2011-04-11 11:58       ` Axel Freyn
2011-04-11 13:35       ` Eric Botcazou
2011-04-11 14:06         ` Dominique Dhumieres
2011-04-11 14:59         ` Michael Matz
2011-04-11 16:04   ` Michael Matz
2011-04-11 16:45     ` Steven Bosscher
2011-04-12 11:54       ` Michael Matz
2011-04-12 12:42         ` N.M. Maclaren
2011-04-11 16:46     ` Dominique Dhumieres
2011-04-11 16:59     ` H.J. Lu
2011-04-11 17:06       ` N.M. Maclaren
2011-04-11 18:28     ` Tobias Burnus
2011-04-11 21:18     ` Dominique Dhumieres
2011-04-12  6:23     ` Paul Richard Thomas
2011-04-12  6:35       ` Dominique Dhumieres
2011-04-13  9:42         ` Paul Richard Thomas
2011-04-13 10:38           ` Richard Guenther
2011-04-13 11:13             ` Richard Guenther
2011-04-14 13:32               ` Dominique Dhumieres
2011-04-13 13:46             ` Paul Richard Thomas
2011-04-14 13:59         ` Michael Matz
2011-04-14 15:28           ` Michael Matz
2011-04-14 19:29             ` Dominique Dhumieres
2011-04-15  2:01             ` Dominique Dhumieres
2011-04-15 12:38               ` Michael Matz
2011-04-15 14:06                 ` Dominique Dhumieres
2011-04-15 15:19                   ` Michael Matz
2011-04-15 15:26                     ` Jerry DeLisle
2011-04-15 22:41                       ` Michael Matz
2011-04-14 20:23     ` Tobias Burnus
2011-04-14 20:30       ` Tobias Burnus
  -- strict thread matches above, loose matches on Subject: below --
2011-04-08 21:29 Michael Matz
2011-04-09  8:21 ` N.M. Maclaren
2011-04-09  8:51   ` Magnus Fromreide
2011-04-09  9:02   ` Eric Botcazou
2011-04-09  9:49     ` N.M. Maclaren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).