* Re: Implement stack arrays even for unknown sizes
@ 2011-04-09 10:08 Dominique Dhumieres
2011-04-09 12:17 ` Paul Richard Thomas
0 siblings, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-09 10:08 UTC (permalink / raw)
To: fortran; +Cc: gcc-patches, matz
Michael,
I have applied your patch on top of revision 172217 on x86_64-apple-darwin10.7.0.
So far I have only limited tests on the polyhedron test suite.
The test nf.f90 (containing an automatic array) executes in less than 20s, compares
to ~28s without the patch. However capacita.f90 is miscompiled (Segmentation fault
at -O) and fatigue.f90 with -fwhole-program executes in ~7.5s instead of ~6.3s
(with -Ofast).
Cheers,
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-09 10:08 Implement stack arrays even for unknown sizes Dominique Dhumieres
@ 2011-04-09 12:17 ` Paul Richard Thomas
2011-04-10 13:29 ` Dominique Dhumieres
2011-04-11 16:04 ` Michael Matz
0 siblings, 2 replies; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-09 12:17 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: fortran, gcc-patches, matz
Dear Dominique and Michael,
I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.
Cheers
Paul
On Sat, Apr 9, 2011 at 12:08 PM, Dominique Dhumieres <dominiq@lps.ens.fr> wrote:
> Michael,
>
> I have applied your patch on top of revision 172217 on x86_64-apple-darwin10.7.0.
> So far I have only limited tests on the polyhedron test suite.
> The test nf.f90 (containing an automatic array) executes in less than 20s, compares
> to ~28s without the patch. However capacita.f90 is miscompiled (Segmentation fault
> at -O) and fatigue.f90 with -fwhole-program executes in ~7.5s instead of ~6.3s
> (with -Ofast).
>
> Cheers,
>
> Dominique
>
--
The knack of flying is learning how to throw yourself at the ground and miss.
--Hitchhikers Guide to the Galaxy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-09 12:17 ` Paul Richard Thomas
@ 2011-04-10 13:29 ` Dominique Dhumieres
2011-04-11 11:49 ` Michael Matz
2011-04-11 16:04 ` Michael Matz
1 sibling, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-10 13:29 UTC (permalink / raw)
To: paul.richard.thomas, dominiq; +Cc: matz, gcc-patches, fortran
> I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.
On x86_64-apple-darwin10, nf.f90 "works". However if I run it through
valgrind I get
==64815== Memcheck, a memory error detector
==64815== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==64815== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==64815== Command: a.out --max-stackframe=2118496
==64815==
==64815== Warning: set address range perms: large range [0x7ffe6c000000, 0x7fff5bc01000) (defined)
==64815== Warning: client switching stacks? SP change: 0x7fff5bffe410 --> 0x7fff5be0cef0
==64815== to suppress, use: --max-stackframe=2037024 or greater
==64815== Invalid write of size 8
==64815== at 0x100003B22: nf2dprecon.1828 (nf.f90:262)
==64815== by 0x10000C3B6: nfcg_ (nf.f90:279)
==64815== by 0x3FEFFFFFFFFFFFFF: ???
==64815== by 0x3FEFFFFFFFFFFFFF: ???
==64815== Address 0x7fff5be0cee8 is on thread 1's stack
==64815==
==64815== Invalid read of size 8
==64815== at 0x100001296: trisolve.1839 (nf.f90:238)
==64815== by 0x10000C3B6: nfcg_ (nf.f90:279)
==64815== by 0x3FEFFFFFFFFFFFFF: ???
==64815== by 0x3FEFFFFFFFFFFFFF: ???
==64815== Address 0x7fff5be0cee8 is on thread 1's stack
==64815==
...
==64815== Invalid write of size 8
==64815== at 0x100000F38: trisolve.1839 (nf.f90:231)
==64815== by 0x10000C3B6: nfcg_ (nf.f90:279)
==64815== by 0x3FEFFFFFFFFFFFFF: ???
==64815== by 0x3FEFFFFFFFFFFFFF: ???
==64815== Address 0x7fff5bffdde8 is on thread 1's stack
==64815==
==64815== Invalid read of size 8
==64815== at 0x100000F64: trisolve.1839 (nf.f90:233)
==64815== by 0xC02396760B69D62B: ???
==64815== by 0xC02396760B69D62C: ???
==64815== by 0xC02396760B69D62D: ???
==64815== by 0xC02396760B69D62D: ???
==64815== by 0xC02396760B69D62D: ???
==64815== by 0xC02396760B69D62D: ???
==64815== by 0xC02396760B69D62D: ???
==64815== by 0xC02396760B69D62D: ???
==64815== by 0xC02396760B69D62B: ???
==64815== by 0xC02396760B69D62B: ???
==64815== by 0xC02396760B69D62B: ???
==64815== Address 0x7fff5bffdde8 is on thread 1's stack
==64815==
...
Time for setup 1.316
Time per iteration 3.250
Total Time 144.295
==64815==
==64815== HEAP SUMMARY:
==64815== in use at exit: 656 bytes in 13 blocks
==64815== total heap usage: 411 allocs, 398 frees, 387,314,592 bytes allocated
==64815==
==64815== LEAK SUMMARY:
==64815== definitely lost: 0 bytes in 0 blocks
==64815== indirectly lost: 0 bytes in 0 blocks
==64815== possibly lost: 0 bytes in 0 blocks
==64815== still reachable: 656 bytes in 13 blocks
==64815== suppressed: 0 bytes in 0 blocks
==64815== Rerun with --leak-check=full to see details of leaked memory
==64815==
==64815== For counts of detected and suppressed errors, rerun with: -v
==64815== ERROR SUMMARY: 40050 errors from 1000 contexts (suppressed: 0 from 0)
The segfault for capacita.f90 occurs in the subroutine fourir at the line
write(unit=*, fmt=*) "error in fourier: n=", ntot
AFAICT the problem occurs in the loop
do m=1,ntot/4-1
E(m) = exp(m*h)
end do
If I print ntot, loc(ntot) before it I get
2048 140734799794712
After the loop loc(ntot) is
9205357642636066816
and any attemp to print its values yields a segfault.
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-10 13:29 ` Dominique Dhumieres
@ 2011-04-11 11:49 ` Michael Matz
2011-04-11 11:58 ` Axel Freyn
2011-04-11 13:35 ` Eric Botcazou
0 siblings, 2 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-11 11:49 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran
Hi,
On Sun, 10 Apr 2011, Dominique Dhumieres wrote:
> > I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.
>
> On x86_64-apple-darwin10, nf.f90 "works". However if I run it through
> valgrind I get
>
> ==64815== Memcheck, a memory error detector
> ==64815== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==64815== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
> ==64815== Command: a.out --max-stackframe=2118496
> ==64815==
> ==64815== Warning: set address range perms: large range [0x7ffe6c000000, 0x7fff5bc01000) (defined)
> ==64815== Warning: client switching stacks? SP change: 0x7fff5bffe410 --> 0x7fff5be0cef0
> ==64815== to suppress, use: --max-stackframe=2037024 or greater
See? That's whay I meant with having to use a large ulimit for stack
size. Usually stack overflows symptom is a simple segfault. What ulimit
-s have you used for your capacita tests?
> The segfault for capacita.f90 occurs in the subroutine fourir at the line
>
> write(unit=*, fmt=*) "error in fourier: n=", ntot
>
> AFAICT the problem occurs in the loop
>
> do m=1,ntot/4-1
> E(m) = exp(m*h)
> end do
>
> If I print ntot, loc(ntot) before it I get
>
> 2048 140734799794712
>
> After the loop loc(ntot) is
>
> 9205357642636066816
>
> and any attemp to print its values yields a segfault.
I'll poke at polyhedron somewhat.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 11:49 ` Michael Matz
@ 2011-04-11 11:58 ` Axel Freyn
2011-04-11 13:35 ` Eric Botcazou
1 sibling, 0 replies; 40+ messages in thread
From: Axel Freyn @ 2011-04-11 11:58 UTC (permalink / raw)
To: Michael Matz
Cc: Dominique Dhumieres, paul.richard.thomas, gcc-patches, fortran
Hi,
On Mon, Apr 11, 2011 at 01:49:16PM +0200, Michael Matz wrote:
> Hi,
>
> On Sun, 10 Apr 2011, Dominique Dhumieres wrote:
>
> > > I find that both nf.f90 and capacita.f90 segfault in runtime for any stack size.
> >
> > On x86_64-apple-darwin10, nf.f90 "works". However if I run it through
> > valgrind I get
> >
> > ==64815== Memcheck, a memory error detector
> > ==64815== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> > ==64815== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
> > ==64815== Command: a.out --max-stackframe=2118496
> > ==64815==
> > ==64815== Warning: set address range perms: large range [0x7ffe6c000000, 0x7fff5bc01000) (defined)
> > ==64815== Warning: client switching stacks? SP change: 0x7fff5bffe410 --> 0x7fff5be0cef0
> > ==64815== to suppress, use: --max-stackframe=2037024 or greater
>
> See? That's whay I meant with having to use a large ulimit for stack
> size. Usually stack overflows symptom is a simple segfault. What ulimit
> -s have you used for your capacita tests?
Just a small remark: according to the message from valgrind, the
"max-stackframe" parameter is at the WRONG position. It has to be
valgrind --max-stackframe=2118496 ./a.out
and NOT
valgrind ./a.out --max-stackframe=2118496
(the second way, valgrind ignores "max-stackframe" and passes it as
parameter to ./a.out. According to the valgrind-output, you passed the
parameter AFTER the "./a.out". That might explain the problem?
Axel
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 11:49 ` Michael Matz
2011-04-11 11:58 ` Axel Freyn
@ 2011-04-11 13:35 ` Eric Botcazou
2011-04-11 14:06 ` Dominique Dhumieres
2011-04-11 14:59 ` Michael Matz
1 sibling, 2 replies; 40+ messages in thread
From: Eric Botcazou @ 2011-04-11 13:35 UTC (permalink / raw)
To: Michael Matz
Cc: gcc-patches, Dominique Dhumieres, paul.richard.thomas, fortran
> See? That's whay I meant with having to use a large ulimit for stack
> size. Usually stack overflows symptom is a simple segfault. What ulimit
> -s have you used for your capacita tests?
Compiling with -fstack-check should give the segfault reliably.
--
Eric Botcazou
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 13:35 ` Eric Botcazou
@ 2011-04-11 14:06 ` Dominique Dhumieres
2011-04-11 14:59 ` Michael Matz
1 sibling, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-11 14:06 UTC (permalink / raw)
To: matz, ebotcazou; +Cc: paul.richard.thomas, gcc-patches, fortran, dominiq
> What ulimit -s have you used for your capacita tests?
On *-apple-darwin* the stacksize is limited to (kbytes, -s) 65532
and it is hard coded. It is my (very limited) understanding that
this limit can be bypassed by using something such as
-Wl,-stack_size,0xf0000000 (for 1Gbytes in 64 bit mode -in 32 bit mode
you need an additional trick). Doing so for capacita does not help.
For nf the best I get is
[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays nf.f90 -g -save-temps
[macbook] lin/test% valgrind --max-stackframe=64118496 a.out
==25739== Memcheck, a memory error detector
==25739== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==25739== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==25739== Command: a.out
==25739==
Solve 97 by 105 by 99 FD problem - method=NFCG
Band elements are 100.00, 10.00, and 1.00. stiffness= 1000.00
Target RMS residual = 0.1000E-06 Maximum Iterations = 50
Iter Alpha Beta RMS Residual Sum of Residuals
0 0.9958682E-01 100.0000
0 1.00000 0.1412538E-01 -0.2314427E-09
1 0.08835 0.00000 0.9698062E-02 -0.2522486E-09
...
44 0.07743 0.55715 0.8934363E-07 -0.4388281E-14
Time for setup 0.868
Time per iteration 1.709
Total Time 76.049
==25739==
==25739== HEAP SUMMARY:
==25739== in use at exit: 656 bytes in 13 blocks
==25739== total heap usage: 409 allocs, 396 frees, 387,298,208 bytes allocated
==25739==
==25739== LEAK SUMMARY:
==25739== definitely lost: 0 bytes in 0 blocks
==25739== indirectly lost: 0 bytes in 0 blocks
==25739== possibly lost: 0 bytes in 0 blocks
==25739== still reachable: 656 bytes in 13 blocks
==25739== suppressed: 0 bytes in 0 blocks
==25739== Rerun with --leak-check=full to see details of leaked memory
==25739==
==25739== For counts of detected and suppressed errors, rerun with: -v
==25739== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
> Compiling with -fstack-check should give the segfault reliably.
It does not seems to work for me.
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 13:35 ` Eric Botcazou
2011-04-11 14:06 ` Dominique Dhumieres
@ 2011-04-11 14:59 ` Michael Matz
1 sibling, 0 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-11 14:59 UTC (permalink / raw)
To: Eric Botcazou
Cc: gcc-patches, Dominique Dhumieres, paul.richard.thomas, fortran
Hi,
On Mon, 11 Apr 2011, Eric Botcazou wrote:
> > See? That's whay I meant with having to use a large ulimit for stack
> > size. Usually stack overflows symptom is a simple segfault. What ulimit
> > -s have you used for your capacita tests?
>
> Compiling with -fstack-check should give the segfault reliably.
Yeah, I see the problem. Strange that it doesn't occure with spec or the
testsuite, it's a scoping problem.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-09 12:17 ` Paul Richard Thomas
2011-04-10 13:29 ` Dominique Dhumieres
@ 2011-04-11 16:04 ` Michael Matz
2011-04-11 16:45 ` Steven Bosscher
` (6 more replies)
1 sibling, 7 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-11 16:04 UTC (permalink / raw)
To: Paul Richard Thomas; +Cc: Dominique Dhumieres, fortran, gcc-patches
On Sat, 9 Apr 2011, Paul Richard Thomas wrote:
> I find that both nf.f90 and capacita.f90 segfault in runtime for any
> stack size.
Try this patch. I've verified that capacita and nf work with it and
-march=native -ffast-math -funroll-loops -fstack-arrays -O3 . In fact all
of polyhedron works for me on these flags. (I've set a ulimit -s of
512MB, but I don't know if such a large amount is required).
Ciao,
Michael.
* trans-array.c (toplevel): Include gimple.h.
(gfc_trans_allocate_array_storage): Check flag_stack_arrays,
properly expand variable length arrays.
(gfc_trans_auto_array_allocation): If flag_stack_arrays create
variable length decls and associate them with their scope.
* gfortran.h (gfc_option_t): Add flag_stack_arrays member.
* options.c (gfc_init_options): Handle -fstack_arrays option.
* lang.opt (fstack-arrays): Add option.
* invoke.texi (Code Gen Options): Document it.
* Make-lang.in (trans-array.o): Depend on GIMPLE_H.
Index: trans-array.c
===================================================================
*** trans-array.c (revision 172206)
--- trans-array.c (working copy)
*************** along with GCC; see the file COPYING3.
*** 81,86 ****
--- 81,87 ----
#include "system.h"
#include "coretypes.h"
#include "tree.h"
+ #include "gimple.h"
#include "diagnostic-core.h" /* For internal_error/fatal_error. */
#include "flags.h"
#include "gfortran.h"
*************** gfc_trans_allocate_array_storage (stmtbl
*** 630,647 ****
{
/* Allocate the temporary. */
onstack = !dynamic && initial == NULL_TREE
! && gfc_can_put_var_on_stack (size);
if (onstack)
{
/* Make a temporary variable to hold the data. */
tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
nelem, gfc_index_one_node);
tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
tmp);
tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
tmp);
tmp = gfc_create_var (tmp, "A");
tmp = gfc_build_addr_expr (NULL_TREE, tmp);
gfc_conv_descriptor_data_set (pre, desc, tmp);
}
--- 631,654 ----
{
/* Allocate the temporary. */
onstack = !dynamic && initial == NULL_TREE
! && (gfc_option.flag_stack_arrays
! || gfc_can_put_var_on_stack (size));
if (onstack)
{
/* Make a temporary variable to hold the data. */
tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
nelem, gfc_index_one_node);
+ tmp = gfc_evaluate_now (tmp, pre);
tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
tmp);
tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
tmp);
tmp = gfc_create_var (tmp, "A");
+ gfc_add_expr_to_block (pre,
+ fold_build1_loc (input_location,
+ DECL_EXPR, TREE_TYPE (tmp),
+ tmp));
tmp = gfc_build_addr_expr (NULL_TREE, tmp);
gfc_conv_descriptor_data_set (pre, desc, tmp);
}
*************** gfc_trans_auto_array_allocation (tree de
*** 4744,4749 ****
--- 4751,4758 ----
tree tmp;
tree size;
tree offset;
+ tree space;
+ tree inittree;
bool onstack;
gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
*************** gfc_trans_auto_array_allocation (tree de
*** 4800,4814 ****
return;
}
! /* The size is the number of elements in the array, so multiply by the
! size of an element to get the total size. */
! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! size, fold_convert (gfc_array_index_type, tmp));
! /* Allocate memory to hold the data. */
! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
! gfc_add_modify (&init, decl, tmp);
/* Set offset of the array. */
if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
--- 4809,4838 ----
return;
}
! if (gfc_option.flag_stack_arrays)
! {
! gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
! space = build_decl (sym->declared_at.lb->location,
! VAR_DECL, create_tmp_var_name ("A"),
! TREE_TYPE (TREE_TYPE (decl)));
! gfc_trans_vla_type_sizes (sym, &init);
! }
! else
! {
! /* The size is the number of elements in the array, so multiply by the
! size of an element to get the total size. */
! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! size, fold_convert (gfc_array_index_type, tmp));
! /* Allocate memory to hold the data. */
! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
! gfc_add_modify (&init, decl, tmp);
!
! /* Free the temporary. */
! tmp = gfc_call_free (convert (pvoid_type_node, decl));
! space = NULL_TREE;
! }
/* Set offset of the array. */
if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
*************** gfc_trans_auto_array_allocation (tree de
*** 4817,4826 ****
/* Automatic arrays should not have initializers. */
gcc_assert (!sym->value);
! /* Free the temporary. */
! tmp = gfc_call_free (convert (pvoid_type_node, decl));
! gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
}
--- 4841,4866 ----
/* Automatic arrays should not have initializers. */
gcc_assert (!sym->value);
! inittree = gfc_finish_block (&init);
!
! if (space)
! {
! tree addr;
! pushdecl (space);
! /* Don't create new scope, emit the DECL_EXPR in exactly the scope
! where also space is located. */
! gfc_init_block (&init);
! tmp = fold_build1_loc (input_location, DECL_EXPR,
! TREE_TYPE (space), space);
! gfc_add_expr_to_block (&init, tmp);
! addr = fold_build1_loc (sym->declared_at.lb->location,
! ADDR_EXPR, TREE_TYPE (decl), space);
! gfc_add_modify (&init, decl, addr);
! gfc_add_init_cleanup (block, gfc_finish_block (&init), NULL_TREE);
! tmp = NULL_TREE;
! }
! gfc_add_init_cleanup (block, inittree, tmp);
}
Index: Make-lang.in
===================================================================
*** Make-lang.in (revision 172206)
--- Make-lang.in (working copy)
*************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
*** 353,359 ****
fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
gt-fortran-trans-intrinsic.h
fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
--- 353,359 ----
fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
gt-fortran-trans-intrinsic.h
fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
Index: gfortran.h
===================================================================
*** gfortran.h (revision 172206)
--- gfortran.h (working copy)
*************** typedef struct
*** 2220,2225 ****
--- 2220,2226 ----
int flag_d_lines;
int gfc_flag_openmp;
int flag_sign_zero;
+ int flag_stack_arrays;
int flag_module_private;
int flag_recursive;
int flag_init_local_zero;
Index: lang.opt
===================================================================
*** lang.opt (revision 172206)
--- lang.opt (working copy)
*************** fmax-stack-var-size=
*** 454,459 ****
--- 454,463 ----
Fortran RejectNegative Joined UInteger
-fmax-stack-var-size=<n> Size in bytes of the largest array that will be put on the stack
+ fstack-arrays
+ Fortran
+ Put all local arrays on stack.
+
fmodule-private
Fortran
Set default accessibility of module entities to PRIVATE.
Index: invoke.texi
===================================================================
*** invoke.texi (revision 172206)
--- invoke.texi (working copy)
*************** and warnings}.
*** 167,172 ****
--- 167,173 ----
-fbounds-check -fcheck-array-temporaries -fmax-array-constructor =@var{n} @gol
-fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
-fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
+ -fstack-arrays @gol
-fpack-derived -frepack-arrays -fshort-enums -fexternal-blas @gol
-fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
-finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
*************** Future versions of GNU Fortran may impro
*** 1361,1366 ****
--- 1362,1374 ----
The default value for @var{n} is 32768.
+ @item -fstack-arrays
+ @opindex @code{fstack-arrays}
+ Adding this option will make the fortran compiler put all local arrays,
+ even those of unknown size onto stack memory. If your program uses very
+ large local arrays it's possible that you'll have to extend your runtime
+ limits for stack memory on some operating systems.
+
@item -fpack-derived
@opindex @code{fpack-derived}
@cindex structure packing
Index: options.c
===================================================================
*** options.c (revision 172206)
--- options.c (working copy)
*************** gfc_init_options (unsigned int decoded_o
*** 123,128 ****
--- 123,129 ----
/* Default value of flag_max_stack_var_size is set in gfc_post_options. */
gfc_option.flag_max_stack_var_size = -2;
+ gfc_option.flag_stack_arrays = 0;
gfc_option.flag_range_check = 1;
gfc_option.flag_pack_derived = 0;
*************** gfc_handle_option (size_t scode, const c
*** 783,788 ****
--- 784,793 ----
gfc_option.flag_max_stack_var_size = value;
break;
+ case OPT_fstack_arrays:
+ gfc_option.flag_stack_arrays = value;
+ break;
+
case OPT_fmodule_private:
gfc_option.flag_module_private = value;
break;
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:04 ` Michael Matz
@ 2011-04-11 16:45 ` Steven Bosscher
2011-04-12 11:54 ` Michael Matz
2011-04-11 16:46 ` Dominique Dhumieres
` (5 subsequent siblings)
6 siblings, 1 reply; 40+ messages in thread
From: Steven Bosscher @ 2011-04-11 16:45 UTC (permalink / raw)
To: Michael Matz
Cc: Paul Richard Thomas, Dominique Dhumieres, fortran, gcc-patches
On Mon, Apr 11, 2011 at 6:04 PM, Michael Matz <matz@suse.de> wrote:
> On Sat, 9 Apr 2011, Paul Richard Thomas wrote:
>
>> I find that both nf.f90 and capacita.f90 segfault in runtime for any
>> stack size.
>
> Try this patch. I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 . In fact all
> of polyhedron works for me on these flags. (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).
FWIW, I don't think it is reasonable to require ulimit -s to run
polyhedron. Isn't there a way to put a maximum on the size of the
arrays on stack, e.g. -fstack-arrays-limit= or something like that?
BTW why does trans-array need gimple.h?
Ciao!
Steven
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:04 ` Michael Matz
2011-04-11 16:45 ` Steven Bosscher
@ 2011-04-11 16:46 ` Dominique Dhumieres
2011-04-11 16:59 ` H.J. Lu
` (4 subsequent siblings)
6 siblings, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-11 16:46 UTC (permalink / raw)
To: paul.richard.thomas, matz; +Cc: gcc-patches, fortran, dominiq
I am trying the new patch. Updating failed with
../../work/gcc/fortran/trans-array.c: In function 'gfc_trans_auto_array_allocation':
../../work/gcc/fortran/trans-array.c:4881:24: error: 'tmp' may be used uninitialized in this function [-Werror=uninitialized]
cc1: all warnings being treated as errors
I am continuing after having changed
tree tmp;
to
tree tmp = NULL_TREE;
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:04 ` Michael Matz
2011-04-11 16:45 ` Steven Bosscher
2011-04-11 16:46 ` Dominique Dhumieres
@ 2011-04-11 16:59 ` H.J. Lu
2011-04-11 17:06 ` N.M. Maclaren
2011-04-11 18:28 ` Tobias Burnus
` (3 subsequent siblings)
6 siblings, 1 reply; 40+ messages in thread
From: H.J. Lu @ 2011-04-11 16:59 UTC (permalink / raw)
To: Michael Matz
Cc: Paul Richard Thomas, Dominique Dhumieres, fortran, gcc-patches
On Mon, Apr 11, 2011 at 9:04 AM, Michael Matz <matz@suse.de> wrote:
> On Sat, 9 Apr 2011, Paul Richard Thomas wrote:
>
>> I find that both nf.f90 and capacita.f90 segfault in runtime for any
>> stack size.
>
> Try this patch. I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 . In fact all
> of polyhedron works for me on these flags. (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).
>
>
Is that possible to raise stack limit when -fstack-arrays is used?
--
H.J.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:59 ` H.J. Lu
@ 2011-04-11 17:06 ` N.M. Maclaren
0 siblings, 0 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-11 17:06 UTC (permalink / raw)
To: fortran, gcc-patches
On Apr 11 2011, H.J. Lu wrote:
>
>Is that possible to raise stack limit when -fstack-arrays is used?
In some systems, yes. In others (including most Unices), not without
evil contortions that will assuredly break something else (like
debuggers and some MPIs).
Regards,
Nick Maclaren.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:04 ` Michael Matz
` (2 preceding siblings ...)
2011-04-11 16:59 ` H.J. Lu
@ 2011-04-11 18:28 ` Tobias Burnus
2011-04-11 21:18 ` Dominique Dhumieres
` (2 subsequent siblings)
6 siblings, 0 replies; 40+ messages in thread
From: Tobias Burnus @ 2011-04-11 18:28 UTC (permalink / raw)
To: Michael Matz; +Cc: fortran, gcc-patches
On 04/11/2011 06:04 PM, Michael Matz wrote:
> ===================================================================
> *** trans-array.c (revision 172206)
> --- trans-array.c (working copy)
> *************** gfc_trans_auto_array_allocation (tree de
[...]
> ! gfc_add_init_cleanup (block, inittree, tmp);
> }
I get the fatal warning:
.../gcc/fortran/trans-array.c: In function
‘gfc_trans_auto_array_allocation’:
..../gcc/fortran/trans-array.c:4881:24: error: ‘tmp’ may be used
uninitialized in this function [-Werror=uninitialized]
which is the line quoted above.
* * *
Regarding the goal of the patch: I like the idea and I think it is
rather useful. It is unfortunate that running out of stack space causes
segfaults, which is not very user friendly, but I think HPC users - and
in particular Intel compiler users - are used to have "ulimit -s
unlimited" (or some large value), which also HPC admins set. For
instance, on the login node of our HPC cluster (2*quadcore/node; 24 GB
RAM/node, 2200 nodes) one has 3 GB stack space by default (soft and hard
limit) - I think the value on the compute nodes is the same.
Tobias
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:04 ` Michael Matz
` (3 preceding siblings ...)
2011-04-11 18:28 ` Tobias Burnus
@ 2011-04-11 21:18 ` Dominique Dhumieres
2011-04-12 6:23 ` Paul Richard Thomas
2011-04-14 20:23 ` Tobias Burnus
6 siblings, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-11 21:18 UTC (permalink / raw)
To: paul.richard.thomas, matz; +Cc: gcc-patches, fortran, dominiq
With the new patch (+the fix for the uninitialized tmp), the polyhedron
tests pass:
================================================================================
Date & Time : 11 Apr 2011 21:20:59
Test Name : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 300.0
Target Error % : 0.200
Minimum Repeats : 2
Maximum Repeats : 5
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 8.10 54576 8.16 2 0.0000
aermod 172.91 1468672 18.58 2 0.1884
air 25.79 89992 6.81 5 0.1852
capacita 15.01 101384 40.50 2 0.0012
channel 3.43 34448 2.98 5 0.2042
doduc 27.18 212280 27.25 2 0.0220
fatigue 9.87 85264 7.07 2 0.0636
gas_dyn 23.07 135944 4.72 2 0.0636
induct 23.90 185600 14.24 2 0.1194
linpk 2.54 21536 21.74 2 0.0207
mdbx 9.11 84776 12.51 2 0.0080
nf 15.29 87848 19.06 2 0.1862
protein 28.07 155624 35.40 2 0.0268
rnflow 38.21 183696 26.76 2 0.0374
test_fpu 19.94 141720 10.87 2 0.0598
tfft 1.70 22072 3.29 2 0.0759
Geometric Mean Execution Time = 12.33 seconds
================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved
For comparison the results without -fstack-arrays are
================================================================================
Date & Time : 11 Apr 2011 21:59:38
Test Name : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 300.0
Target Error % : 0.200
Minimum Repeats : 2
Maximum Repeats : 5
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 8.11 54576 8.17 2 0.0061
aermod 173.69 1472520 18.81 2 0.1063
air 25.43 89992 6.79 2 0.0516
capacita 13.47 101344 40.33 2 0.0012
channel 3.11 34448 2.97 5 0.2000
doduc 27.33 212280 27.25 2 0.0239
fatigue 10.08 81128 4.71 2 0.0212
gas_dyn 24.89 144112 4.68 5 0.2030
induct 23.86 185600 14.26 3 0.1813
linpk 2.53 21536 21.75 2 0.0460
mdbx 9.44 84776 12.50 2 0.0920
nf 30.54 120568 28.22 5 0.2760
protein 28.05 155624 35.55 2 0.0928
rnflow 37.60 183696 26.83 2 0.0075
test_fpu 20.20 145816 11.07 2 0.0135
tfft 1.67 22072 3.30 4 0.1569
Geometric Mean Execution Time = 12.33 seconds
================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved
Valgrind does not report any error for capacita, however it requires
--max-stackframe=8118496 for nf (--max-stackframe=6118496 yields
many errors).
Thanks for the work.
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:04 ` Michael Matz
` (4 preceding siblings ...)
2011-04-11 21:18 ` Dominique Dhumieres
@ 2011-04-12 6:23 ` Paul Richard Thomas
2011-04-12 6:35 ` Dominique Dhumieres
2011-04-14 20:23 ` Tobias Burnus
6 siblings, 1 reply; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-12 6:23 UTC (permalink / raw)
To: Michael Matz; +Cc: Dominique Dhumieres, fortran, gcc-patches
Dear Michael,
Thanks for updating the patch. I am afraid that my attention to
gfortran is somewhat limited at present. However, I see that
Dominique has verified your patch and that all is well.
The resulting speed up for nf.f90 is rather remarkable. What specific
feature of the fortran leads to a 30=>15s ?
Cheers
Paul
On Mon, Apr 11, 2011 at 6:04 PM, Michael Matz <matz@suse.de> wrote:
> On Sat, 9 Apr 2011, Paul Richard Thomas wrote:
>
>> I find that both nf.f90 and capacita.f90 segfault in runtime for any
>> stack size.
>
> Try this patch. I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 . In fact all
> of polyhedron works for me on these flags. (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).
>
>
> Ciao,
> Michael.
>
> * trans-array.c (toplevel): Include gimple.h.
> (gfc_trans_allocate_array_storage): Check flag_stack_arrays,
> properly expand variable length arrays.
> (gfc_trans_auto_array_allocation): If flag_stack_arrays create
> variable length decls and associate them with their scope.
> * gfortran.h (gfc_option_t): Add flag_stack_arrays member.
> * options.c (gfc_init_options): Handle -fstack_arrays option.
> * lang.opt (fstack-arrays): Add option.
> * invoke.texi (Code Gen Options): Document it.
> * Make-lang.in (trans-array.o): Depend on GIMPLE_H.
>
> Index: trans-array.c
> ===================================================================
> *** trans-array.c (revision 172206)
> --- trans-array.c (working copy)
> *************** along with GCC; see the file COPYING3.
> *** 81,86 ****
> --- 81,87 ----
> #include "system.h"
> #include "coretypes.h"
> #include "tree.h"
> + #include "gimple.h"
> #include "diagnostic-core.h" /* For internal_error/fatal_error. */
> #include "flags.h"
> #include "gfortran.h"
> *************** gfc_trans_allocate_array_storage (stmtbl
> *** 630,647 ****
> {
> /* Allocate the temporary. */
> onstack = !dynamic && initial == NULL_TREE
> ! && gfc_can_put_var_on_stack (size);
>
> if (onstack)
> {
> /* Make a temporary variable to hold the data. */
> tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
> nelem, gfc_index_one_node);
> tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
> tmp);
> tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
> tmp);
> tmp = gfc_create_var (tmp, "A");
> tmp = gfc_build_addr_expr (NULL_TREE, tmp);
> gfc_conv_descriptor_data_set (pre, desc, tmp);
> }
> --- 631,654 ----
> {
> /* Allocate the temporary. */
> onstack = !dynamic && initial == NULL_TREE
> ! && (gfc_option.flag_stack_arrays
> ! || gfc_can_put_var_on_stack (size));
>
> if (onstack)
> {
> /* Make a temporary variable to hold the data. */
> tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
> nelem, gfc_index_one_node);
> + tmp = gfc_evaluate_now (tmp, pre);
> tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
> tmp);
> tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
> tmp);
> tmp = gfc_create_var (tmp, "A");
> + gfc_add_expr_to_block (pre,
> + fold_build1_loc (input_location,
> + DECL_EXPR, TREE_TYPE (tmp),
> + tmp));
> tmp = gfc_build_addr_expr (NULL_TREE, tmp);
> gfc_conv_descriptor_data_set (pre, desc, tmp);
> }
> *************** gfc_trans_auto_array_allocation (tree de
> *** 4744,4749 ****
> --- 4751,4758 ----
> tree tmp;
> tree size;
> tree offset;
> + tree space;
> + tree inittree;
> bool onstack;
>
> gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
> *************** gfc_trans_auto_array_allocation (tree de
> *** 4800,4814 ****
> return;
> }
>
> ! /* The size is the number of elements in the array, so multiply by the
> ! size of an element to get the total size. */
> ! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
> ! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
> ! size, fold_convert (gfc_array_index_type, tmp));
>
> ! /* Allocate memory to hold the data. */
> ! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
> ! gfc_add_modify (&init, decl, tmp);
>
> /* Set offset of the array. */
> if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
> --- 4809,4838 ----
> return;
> }
>
> ! if (gfc_option.flag_stack_arrays)
> ! {
> ! gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
> ! space = build_decl (sym->declared_at.lb->location,
> ! VAR_DECL, create_tmp_var_name ("A"),
> ! TREE_TYPE (TREE_TYPE (decl)));
> ! gfc_trans_vla_type_sizes (sym, &init);
> ! }
> ! else
> ! {
> ! /* The size is the number of elements in the array, so multiply by the
> ! size of an element to get the total size. */
> ! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
> ! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
> ! size, fold_convert (gfc_array_index_type, tmp));
>
> ! /* Allocate memory to hold the data. */
> ! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
> ! gfc_add_modify (&init, decl, tmp);
> !
> ! /* Free the temporary. */
> ! tmp = gfc_call_free (convert (pvoid_type_node, decl));
> ! space = NULL_TREE;
> ! }
>
> /* Set offset of the array. */
> if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
> *************** gfc_trans_auto_array_allocation (tree de
> *** 4817,4826 ****
> /* Automatic arrays should not have initializers. */
> gcc_assert (!sym->value);
>
> ! /* Free the temporary. */
> ! tmp = gfc_call_free (convert (pvoid_type_node, decl));
>
> ! gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
> }
>
>
> --- 4841,4866 ----
> /* Automatic arrays should not have initializers. */
> gcc_assert (!sym->value);
>
> ! inittree = gfc_finish_block (&init);
> !
> ! if (space)
> ! {
> ! tree addr;
> ! pushdecl (space);
>
> ! /* Don't create new scope, emit the DECL_EXPR in exactly the scope
> ! where also space is located. */
> ! gfc_init_block (&init);
> ! tmp = fold_build1_loc (input_location, DECL_EXPR,
> ! TREE_TYPE (space), space);
> ! gfc_add_expr_to_block (&init, tmp);
> ! addr = fold_build1_loc (sym->declared_at.lb->location,
> ! ADDR_EXPR, TREE_TYPE (decl), space);
> ! gfc_add_modify (&init, decl, addr);
> ! gfc_add_init_cleanup (block, gfc_finish_block (&init), NULL_TREE);
> ! tmp = NULL_TREE;
> ! }
> ! gfc_add_init_cleanup (block, inittree, tmp);
> }
>
>
> Index: Make-lang.in
> ===================================================================
> *** Make-lang.in (revision 172206)
> --- Make-lang.in (working copy)
> *************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
> *** 353,359 ****
> fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
> fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
> fortran/ioparm.def
> ! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
> fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
> gt-fortran-trans-intrinsic.h
> fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
> --- 353,359 ----
> fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
> fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
> fortran/ioparm.def
> ! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
> fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
> gt-fortran-trans-intrinsic.h
> fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
> Index: gfortran.h
> ===================================================================
> *** gfortran.h (revision 172206)
> --- gfortran.h (working copy)
> *************** typedef struct
> *** 2220,2225 ****
> --- 2220,2226 ----
> int flag_d_lines;
> int gfc_flag_openmp;
> int flag_sign_zero;
> + int flag_stack_arrays;
> int flag_module_private;
> int flag_recursive;
> int flag_init_local_zero;
> Index: lang.opt
> ===================================================================
> *** lang.opt (revision 172206)
> --- lang.opt (working copy)
> *************** fmax-stack-var-size=
> *** 454,459 ****
> --- 454,463 ----
> Fortran RejectNegative Joined UInteger
> -fmax-stack-var-size=<n> Size in bytes of the largest array that will be put on the stack
>
> + fstack-arrays
> + Fortran
> + Put all local arrays on stack.
> +
> fmodule-private
> Fortran
> Set default accessibility of module entities to PRIVATE.
> Index: invoke.texi
> ===================================================================
> *** invoke.texi (revision 172206)
> --- invoke.texi (working copy)
> *************** and warnings}.
> *** 167,172 ****
> --- 167,173 ----
> -fbounds-check -fcheck-array-temporaries -fmax-array-constructor =@var{n} @gol
> -fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
> -fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
> + -fstack-arrays @gol
> -fpack-derived -frepack-arrays -fshort-enums -fexternal-blas @gol
> -fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
> -finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
> *************** Future versions of GNU Fortran may impro
> *** 1361,1366 ****
> --- 1362,1374 ----
>
> The default value for @var{n} is 32768.
>
> + @item -fstack-arrays
> + @opindex @code{fstack-arrays}
> + Adding this option will make the fortran compiler put all local arrays,
> + even those of unknown size onto stack memory. If your program uses very
> + large local arrays it's possible that you'll have to extend your runtime
> + limits for stack memory on some operating systems.
> +
> @item -fpack-derived
> @opindex @code{fpack-derived}
> @cindex structure packing
> Index: options.c
> ===================================================================
> *** options.c (revision 172206)
> --- options.c (working copy)
> *************** gfc_init_options (unsigned int decoded_o
> *** 123,128 ****
> --- 123,129 ----
>
> /* Default value of flag_max_stack_var_size is set in gfc_post_options. */
> gfc_option.flag_max_stack_var_size = -2;
> + gfc_option.flag_stack_arrays = 0;
>
> gfc_option.flag_range_check = 1;
> gfc_option.flag_pack_derived = 0;
> *************** gfc_handle_option (size_t scode, const c
> *** 783,788 ****
> --- 784,793 ----
> gfc_option.flag_max_stack_var_size = value;
> break;
>
> + case OPT_fstack_arrays:
> + gfc_option.flag_stack_arrays = value;
> + break;
> +
> case OPT_fmodule_private:
> gfc_option.flag_module_private = value;
> break;
>
--
The knack of flying is learning how to throw yourself at the ground and miss.
--Hitchhikers Guide to the Galaxy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-12 6:23 ` Paul Richard Thomas
@ 2011-04-12 6:35 ` Dominique Dhumieres
2011-04-13 9:42 ` Paul Richard Thomas
2011-04-14 13:59 ` Michael Matz
0 siblings, 2 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-12 6:35 UTC (permalink / raw)
To: paul.richard.thomas, matz; +Cc: gcc-patches, fortran, dominiq
> The resulting speed up for nf.f90 is rather remarkable. What specific
> feature of the fortran leads to a 30=>15s ?
I think it is the automatic array in the subroutine trisolve. Note that the
speedup is rather 27->19s and may be darwin specific (slow malloc).
Note also that -fstack-arrays prevents some optimizations on
fatigue: 4.7->7s. This may be related to pr45810.
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:45 ` Steven Bosscher
@ 2011-04-12 11:54 ` Michael Matz
2011-04-12 12:42 ` N.M. Maclaren
0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-12 11:54 UTC (permalink / raw)
To: Steven Bosscher
Cc: Paul Richard Thomas, Dominique Dhumieres, fortran, gcc-patches
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1456 bytes --]
Hello,
On Mon, 11 Apr 2011, Steven Bosscher wrote:
> > Try this patch. Â I've verified that capacita and nf work with it and
> > -march=native -ffast-math -funroll-loops -fstack-arrays -O3 . Â In fact all
> > of polyhedron works for me on these flags. Â (I've set a ulimit -s of
> > 512MB, but I don't know if such a large amount is required).
>
> FWIW, I don't think it is reasonable to require ulimit -s to run
> polyhedron.
Actually only nf requires about 16MB of stack size, all other benchmarks
run fine with 8MB (which is the default for my systems). Note that ICC
compiled binaries have the same behaviour and also require the user to set
ulimits.
> Isn't there a way to put a maximum on the size of the arrays on stack,
> e.g. -fstack-arrays-limit= or something like that?
Not without generating contorded code. The problem is that these arrays
are variable length, hence runtime tests would be required, branching to
malloc or stack_save/alloca, and also for the deallocation side, requiring
a flag for branching to free or stack_restore. And as we use the variable
length facilities this is actually not controllable in such a way, the
frontend doesn't generate alloca calls at all.
Nope, it's either all or nothing, like with ICC.
> BTW why does trans-array need gimple.h?
create_tmp_var_name. gfc_create_var_np doesn't take a location and at
that point input_location is already at the end of file.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-12 11:54 ` Michael Matz
@ 2011-04-12 12:42 ` N.M. Maclaren
0 siblings, 0 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-12 12:42 UTC (permalink / raw)
To: fortran, gcc-patches
On Apr 12 2011, Michael Matz wrote:
>On Mon, 11 Apr 2011, Steven Bosscher wrote:
>
>> Isn't there a way to put a maximum on the size of the arrays on stack,
>> e.g. -fstack-arrays-limit= or something like that?
>
>Not without generating contorded code. The problem is that these arrays
>are variable length, hence runtime tests would be required, branching to
>malloc or stack_save/alloca, and also for the deallocation side, requiring
>a flag for branching to free or stack_restore. And as we use the variable
>length facilities this is actually not controllable in such a way, the
>frontend doesn't generate alloca calls at all.
It would almost certainly be cleaner and simpler to use a secondary
stack than to have such dynamic testing. It is very similar (conceptually)
to the dynamic code that switches between inline and library code for
sizes of DO-loop or array on some specialist HPC compilers. And that gets
very messy, very fast.
Regards,
Nick Maclaren.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-12 6:35 ` Dominique Dhumieres
@ 2011-04-13 9:42 ` Paul Richard Thomas
2011-04-13 10:38 ` Richard Guenther
2011-04-14 13:59 ` Michael Matz
1 sibling, 1 reply; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-13 9:42 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: matz, gcc-patches, fortran
Dear Dominique,
> I think it is the automatic array in the subroutine trisolve. Note that the
> speedup is rather 27->19s and may be darwin specific (slow malloc).
I saw a speed-up of similar order on FC9/x86_64.
I strongly doubt that it is anything to do with the automatic array -
if it is there is an error somewhere, since none of the references to
trisolve need copy-in/copy-out.
>
> Note also that -fstack-arrays prevents some optimizations on
> fatigue: 4.7->7s. This may be related to pr45810.
Has PR45810 converged now? If I have understood properly, a patch has
been devised that cures the problem and does not cause slow-downs
anywhere else?
Cheers
Paul
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-13 9:42 ` Paul Richard Thomas
@ 2011-04-13 10:38 ` Richard Guenther
2011-04-13 11:13 ` Richard Guenther
2011-04-13 13:46 ` Paul Richard Thomas
0 siblings, 2 replies; 40+ messages in thread
From: Richard Guenther @ 2011-04-13 10:38 UTC (permalink / raw)
To: Paul Richard Thomas; +Cc: Dominique Dhumieres, matz, gcc-patches, fortran
On Wed, Apr 13, 2011 at 11:42 AM, Paul Richard Thomas
<paul.richard.thomas@gmail.com> wrote:
> Dear Dominique,
>
>> I think it is the automatic array in the subroutine trisolve. Note that the
>> speedup is rather 27->19s and may be darwin specific (slow malloc).
>
> I saw a speed-up of similar order on FC9/x86_64.
>
> I strongly doubt that it is anything to do with the automatic array -
> if it is there is an error somewhere, since none of the references to
> trisolve need copy-in/copy-out.
>
>>
>> Note also that -fstack-arrays prevents some optimizations on
>> fatigue: 4.7->7s. This may be related to pr45810.
>
> Has PR45810 converged now? If I have understood properly, a patch has
> been devised that cures the problem and does not cause slow-downs
> anywhere else?
VLAs and malloc based arrays may behave differently with respect to alias
analysis (I'd have to look at some examples). All effects other than malloc
overhead I would attribute to that. That said, the general idea of the patch
is sound and I see nothing wrong with it. Both performance improvements
and regressions are worth looking at - they hint at possible improvements
in the middle-end parts of the compiler.
Richard.
> Cheers
>
> Paul
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-13 10:38 ` Richard Guenther
@ 2011-04-13 11:13 ` Richard Guenther
2011-04-14 13:32 ` Dominique Dhumieres
2011-04-13 13:46 ` Paul Richard Thomas
1 sibling, 1 reply; 40+ messages in thread
From: Richard Guenther @ 2011-04-13 11:13 UTC (permalink / raw)
To: Paul Richard Thomas; +Cc: Dominique Dhumieres, matz, gcc-patches, fortran
On Wed, Apr 13, 2011 at 12:38 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Wed, Apr 13, 2011 at 11:42 AM, Paul Richard Thomas
> <paul.richard.thomas@gmail.com> wrote:
>> Dear Dominique,
>>
>>> I think it is the automatic array in the subroutine trisolve. Note that the
>>> speedup is rather 27->19s and may be darwin specific (slow malloc).
>>
>> I saw a speed-up of similar order on FC9/x86_64.
>>
>> I strongly doubt that it is anything to do with the automatic array -
>> if it is there is an error somewhere, since none of the references to
>> trisolve need copy-in/copy-out.
>>
>>>
>>> Note also that -fstack-arrays prevents some optimizations on
>>> fatigue: 4.7->7s. This may be related to pr45810.
>>
>> Has PR45810 converged now? If I have understood properly, a patch has
>> been devised that cures the problem and does not cause slow-downs
>> anywhere else?
>
> VLAs and malloc based arrays may behave differently with respect to alias
> analysis (I'd have to look at some examples). All effects other than malloc
> overhead I would attribute to that. That said, the general idea of the patch
> is sound and I see nothing wrong with it. Both performance improvements
> and regressions are worth looking at - they hint at possible improvements
> in the middle-end parts of the compiler.
I have opened PR48590 for at least one issue that I see. It has a very
simple patch which you might want to try ...
Richard.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-13 10:38 ` Richard Guenther
2011-04-13 11:13 ` Richard Guenther
@ 2011-04-13 13:46 ` Paul Richard Thomas
1 sibling, 0 replies; 40+ messages in thread
From: Paul Richard Thomas @ 2011-04-13 13:46 UTC (permalink / raw)
To: Richard Guenther; +Cc: Dominique Dhumieres, matz, gcc-patches, fortran
Dear Richard and Dominique,
> VLAs and malloc based arrays may behave differently with respect to alias
> analysis (I'd have to look at some examples). All effects other than malloc
- hmmm, yes. I was forgetting what might happen with inlining. It's
not evident, is it?
> overhead I would attribute to that. That said, the general idea of the patch
> is sound and I see nothing wrong with it. Both performance improvements
> and regressions are worth looking at - they hint at possible improvements
> in the middle-end parts of the compiler.
I absolutely agree. I realise my input to this thread is possibly a
bit negative but this is far from my reaction to the general idea and
what is evidently within our reach.
Cheers
Paul
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-13 11:13 ` Richard Guenther
@ 2011-04-14 13:32 ` Dominique Dhumieres
0 siblings, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-14 13:32 UTC (permalink / raw)
To: richard.guenther, paul.richard.thomas; +Cc: matz, gcc-patches, fortran, dominiq
> I have opened PR48590 for at least one issue that I see. ...
The fix for pr48590 (r172427) does not speedup fatigue.f90 compiled
with -fstack-arrays.
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-12 6:35 ` Dominique Dhumieres
2011-04-13 9:42 ` Paul Richard Thomas
@ 2011-04-14 13:59 ` Michael Matz
2011-04-14 15:28 ` Michael Matz
1 sibling, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-14 13:59 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran
Hi,
On Tue, 12 Apr 2011, Dominique Dhumieres wrote:
> > The resulting speed up for nf.f90 is rather remarkable. What specific
> > feature of the fortran leads to a 30=>15s ?
>
> I think it is the automatic array in the subroutine trisolve. Note that the
> speedup is rather 27->19s and may be darwin specific (slow malloc).
>
> Note also that -fstack-arrays prevents some optimizations on
> fatigue: 4.7->7s. This may be related to pr45810.
That's the effect of -finline-limit=600 that you use it seems. For me
(opteron 2356) fatigue behaves like this: (base options: -march=native
-ffast-math -funroll-loops -O3)
no stack-arrays with stack-arrays
no addtional options: 10.2s 8.8s
+ -fwhole-program: 7.1s 8.8s
+ -fwhole-program -flto: 10.1s 8.9s
+ -fwhole-program -flto -finline-limit=600 4.8s 8.7s
The perdida subroutine isn't inlined with -fstack-arrays, although it's
called only once. The dump report doesn't reveal any reasons, perdida
doesn't seem to be in the candidate list for the called-once functions.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-14 13:59 ` Michael Matz
@ 2011-04-14 15:28 ` Michael Matz
2011-04-14 19:29 ` Dominique Dhumieres
2011-04-15 2:01 ` Dominique Dhumieres
0 siblings, 2 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-14 15:28 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran
Hi,
On Thu, 14 Apr 2011, Michael Matz wrote:
> no stack-arrays with stack-arrays
> no addtional options: 10.2s 8.8s
> + -fwhole-program: 7.1s 8.8s
> + -fwhole-program -flto: 10.1s 8.9s
> + -fwhole-program -flto -finline-limit=600 4.8s 8.7s
>
> The perdida subroutine isn't inlined with -fstack-arrays, although it's
> called only once. The dump report doesn't reveal any reasons, perdida
> doesn't seem to be in the candidate list for the called-once functions.
See http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01087.html . With that
patch, -fwhole-program -finline-limit=600 and stack arrays I get down to
3.6s.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-14 15:28 ` Michael Matz
@ 2011-04-14 19:29 ` Dominique Dhumieres
2011-04-15 2:01 ` Dominique Dhumieres
1 sibling, 0 replies; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-14 19:29 UTC (permalink / raw)
To: matz, dominiq; +Cc: paul.richard.thomas, gcc-patches, fortran
> See http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01087.html . ...
With this patch on top of revision 172429 (and the second patch for
-fstack-arrays) I now get
================================================================================
Date & Time : 14 Apr 2011 20:48:24
Test Name : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 300.0
Target Error % : 0.200
Minimum Repeats : 2
Maximum Repeats : 5
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 8.12 54576 8.12 2 0.0062
aermod 178.02 1476528 18.62 2 0.1531
air 25.70 89992 6.82 5 0.5056
capacita 13.54 105440 40.10 2 0.0287
channel 3.12 34448 2.94 5 0.0849
doduc 27.50 212280 27.18 2 0.0754
fatigue 10.14 77032 2.76 4 0.1889
gas_dyn 24.89 144112 4.67 2 0.1927
induct 24.24 189696 14.22 2 0.0738
linpk 2.56 21536 21.69 2 0.0507
mdbx 9.17 84776 12.55 2 0.0120
nf 34.27 124640 18.41 4 0.1577
protein 28.37 155624 35.37 2 0.0170
rnflow 38.31 183696 26.73 2 0.0224
test_fpu 20.22 141720 10.85 2 0.0921
tfft 1.71 22072 3.29 2 0.0152
Geometric Mean Execution Time = 11.57 seconds
================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved
Pretty spectacular!-)
Thanks for the work.
Dominique
PS. for the final patch for -fstack-array, don't forget to initialize tmp.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-11 16:04 ` Michael Matz
` (5 preceding siblings ...)
2011-04-12 6:23 ` Paul Richard Thomas
@ 2011-04-14 20:23 ` Tobias Burnus
2011-04-14 20:30 ` Tobias Burnus
6 siblings, 1 reply; 40+ messages in thread
From: Tobias Burnus @ 2011-04-14 20:23 UTC (permalink / raw)
To: Michael Matz; +Cc: fortran, gcc-patches
Michael Matz wrote:
> Try this patch. I've verified that capacita and nf work with it and
> -march=native -ffast-math -funroll-loops -fstack-arrays -O3 . In fact all
> of polyhedron works for me on these flags. (I've set a ulimit -s of
> 512MB, but I don't know if such a large amount is required).
The patch is OK.
* As mentioned a couple of times, you need a "tmp = NULL_TREE" in
gfc_trans_auto_array_allocation - either when "tmp" is declared or in
the "if (...flag_stack_arrays)" block.
* Can you also update http://gcc.gnu.org/wiki/GFortran#GCC4.7 and/or
http://gcc.gnu.org/gcc-4.7/changes.html#fortran ?
Thanks a lot for your patch(es). As Dominique has remarked, the results
are "Pretty spectacular". (Twice as fast for fatigue and the geometric
mean of the Polyhedron test improves by 6%.)
* * *
Side remark: For 4.7 I now hope that inlining gets better tuned for
Fortran as the is room for improvement; quoting your numbers for fatigue:
no stack-arrays with stack-arrays
+ -fwhole-program -flto: 10.1s 8.9s
+ -fwhole-program -flto -finline-limit=600 4.8s 3.6s
Thus, the program is twice as fast with the special inline setting. But
as users will typically never set more than "-O3 -march=native
-ffast-math" (and maybe: -funroll-loops), the inline opportunity is lost
in real life.
Tobias
> * trans-array.c (toplevel): Include gimple.h.
> (gfc_trans_allocate_array_storage): Check flag_stack_arrays,
> properly expand variable length arrays.
> (gfc_trans_auto_array_allocation): If flag_stack_arrays create
> variable length decls and associate them with their scope.
> * gfortran.h (gfc_option_t): Add flag_stack_arrays member.
> * options.c (gfc_init_options): Handle -fstack_arrays option.
> * lang.opt (fstack-arrays): Add option.
> * invoke.texi (Code Gen Options): Document it.
> * Make-lang.in (trans-array.o): Depend on GIMPLE_H.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-14 20:23 ` Tobias Burnus
@ 2011-04-14 20:30 ` Tobias Burnus
0 siblings, 0 replies; 40+ messages in thread
From: Tobias Burnus @ 2011-04-14 20:30 UTC (permalink / raw)
To: Michael Matz; +Cc: fortran, gcc-patches
Tobias Burnus wrote:
> no stack-arrays with stack-arrays
> + -fwhole-program -flto: 10.1s 8.9s
> + -fwhole-program -flto -finline-limit=600 4.8s 3.6s
I wonder whether the following is special to my system* or generally
true. I use:
gfortran -O3 -march=native -ffast-math -funroll-loops -fwhole-program
-finline-limit=600 fatigue.f90
no stack-arrays with stack-arrays
0m6.622s 0m8.174s
-flto 0m8.444s 0m8.174s
Thus, the non "-flto" version is faster (in particular without stack
arrays). I assume that it has to do with the declaration issues of the
front end. However, the last time I tried to find the problem, I failed
to spot anything which looked wrong - especially, the UIDs seemed to be
OK. (Besides, I would like to have the -fno-lto performance also with
.-flto ... ;-)
Tobias
* AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2.4 GHz), x86-64 Linux
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-14 15:28 ` Michael Matz
2011-04-14 19:29 ` Dominique Dhumieres
@ 2011-04-15 2:01 ` Dominique Dhumieres
2011-04-15 12:38 ` Michael Matz
1 sibling, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-15 2:01 UTC (permalink / raw)
To: matz, dominiq; +Cc: paul.richard.thomas, gcc-patches, fortran
I have forgotten to mentionned that I have a variant of fatigue
in which I have done the inlining manually along with few other
optimizations and the timing for it is
[macbook] lin/test% gfc -Ofast fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.793u 0.002s 0:02.79 100.0% 0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -flto fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.671u 0.002s 0:02.67 100.0% 0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -fstack-arrays fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program -flto -fstack-arrays fatigue_v8.f90
[macbook] lin/test% time a.out > /dev/null
2.677u 0.003s 0:02.68 99.6% 0+0k 0+0io 0pf+0w
So the timing of the original code with
-Ofast -finline-limit=600 -fwhole-program -flto -fstack-arrays
is quite close to this lower bound.
I have also looked at the failure for gfortran.dg/elemental_dependency_1.f90
and it seems due to a spurious integer(kind=4) A.37[4]; (and friends) in
integer(kind=8) D.1674;
integer(kind=4) A.37[4];
struct array1_integer(kind=4) atmp.36;
void * D.1669;
integer(kind=8) D.1668;
struct array1_integer(kind=4) parm.35;
parm.35.dtype = 265;
parm.35.dim[0].lbound = 1;
parm.35.dim[0].ubound = 4;
parm.35.dim[0].stride = 1;
parm.35.data = (void *) &a[1];
parm.35.offset = -2;
atmp.36.dtype = 265;
atmp.36.dim[0].stride = 1;
atmp.36.dim[0].lbound = 0;
atmp.36.dim[0].ubound = 3;
integer(kind=4) A.37[4];
atmp.36.data = (void * restrict) &A.37;
compared to
integer(kind=8) D.1658;
integer(kind=4) A.37[4];
struct array1_integer(kind=4) atmp.36;
void * D.1653;
integer(kind=8) D.1652;
struct array1_integer(kind=4) parm.35;
parm.35.dtype = 265;
parm.35.dim[0].lbound = 1;
parm.35.dim[0].ubound = 4;
parm.35.dim[0].stride = 1;
parm.35.data = (void *) &a[1];
parm.35.offset = -2;
atmp.36.dtype = 265;
atmp.36.dim[0].stride = 1;
atmp.36.dim[0].lbound = 0;
atmp.36.dim[0].ubound = 3;
atmp.36.data = (void * restrict) &A.37;
Note that this is without the -fstack-arrays option.
Same thing for gfortran.dg/vector_subscript_4.f90.
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-15 2:01 ` Dominique Dhumieres
@ 2011-04-15 12:38 ` Michael Matz
2011-04-15 14:06 ` Dominique Dhumieres
0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-15 12:38 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran
Hi,
On Thu, 14 Apr 2011, Dominique Dhumieres wrote:
> I have forgotten to mentionned that I have a variant of fatigue
> in which I have done the inlining manually along with few other
> optimizations and the timing for it is
>
> [macbook] lin/test% gfc -Ofast fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.793u 0.002s 0:02.79 100.0% 0+0k 0+1io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program -flto fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.671u 0.002s 0:02.67 100.0% 0+0k 0+2io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program -fstack-arrays fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.680u 0.003s 0:02.68 100.0% 0+0k 0+2io 0pf+0w
> [macbook] lin/test% gfc -Ofast -fwhole-program -flto -fstack-arrays fatigue_v8.f90
> [macbook] lin/test% time a.out > /dev/null
> 2.677u 0.003s 0:02.68 99.6% 0+0k 0+0io 0pf+0w
>
> So the timing of the original code with
> -Ofast -finline-limit=600 -fwhole-program -flto -fstack-arrays
> is quite close to this lower bound.
>
> I have also looked at the failure for gfortran.dg/elemental_dependency_1.f90
> and it seems due to a spurious integer(kind=4) A.37[4]; (and friends) in
Yes, this is due to the DECL_EXPR statement which is rendered by the
dumper just the same as a normal decl. The testcase looks for exactly one
such decl, but with -fstack-arrays there are exactly two for each such
array.
> integer(kind=4) A.37[4];
So, this is the normal decl.
> atmp.36.dim[0].ubound = 3;
> integer(kind=4) A.37[4];
And this is the DECL_EXPR statement, which then actually is transformed
into the stack_save/alloca/stack_restore sequence.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-15 12:38 ` Michael Matz
@ 2011-04-15 14:06 ` Dominique Dhumieres
2011-04-15 15:19 ` Michael Matz
0 siblings, 1 reply; 40+ messages in thread
From: Dominique Dhumieres @ 2011-04-15 14:06 UTC (permalink / raw)
To: matz, dominiq; +Cc: paul.richard.thomas, gcc-patches, fortran
Michael,
> Yes, this is due to the DECL_EXPR statement which is rendered by the
> dumper just the same as a normal decl. The testcase looks for exactly one
> such decl, but with -fstack-arrays there are exactly two for each such
> array.
The testsuite is run without -fstack-arrays, so I dont' understand why
the "DECL_EXPR statement" appears.
Dominique
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-15 14:06 ` Dominique Dhumieres
@ 2011-04-15 15:19 ` Michael Matz
2011-04-15 15:26 ` Jerry DeLisle
0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-15 15:19 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: paul.richard.thomas, gcc-patches, fortran
Hi,
On Fri, 15 Apr 2011, Dominique Dhumieres wrote:
> Michael,
>
> > Yes, this is due to the DECL_EXPR statement which is rendered by the
> > dumper just the same as a normal decl. The testcase looks for exactly one
> > such decl, but with -fstack-arrays there are exactly two for each such
> > array.
>
> The testsuite is run without -fstack-arrays, so I dont' understand why
> the "DECL_EXPR statement" appears.
Bummer, you're right. I unconditionally emit a DECL_EXPR for arrays even
when they don't have a variable length. It's harmless, but makes the
testcase fail (I wasn't seeing the fail because I've changed the testcase
already to make it not fail with -fstack-arrays).
I'll make the DECL_EXPR conditional on the size being variable. As Tobias
already okayed the patch I'm planning to check in the slightly modified
variant as below, after a new round of testing.
Ciao,
Michael.
* trans-array.c (toplevel): Include gimple.h.
(gfc_trans_allocate_array_storage): Check flag_stack_arrays,
properly expand variable length arrays.
(gfc_trans_auto_array_allocation): If flag_stack_arrays create
variable length decls and associate them with their scope.
* gfortran.h (gfc_option_t): Add flag_stack_arrays member.
* options.c (gfc_init_options): Handle -fstack_arrays option.
* lang.opt (fstack-arrays): Add option.
* invoke.texi (Code Gen Options): Document it.
* Make-lang.in (trans-array.o): Depend on GIMPLE_H.
Index: fortran/trans-array.c
===================================================================
*** fortran/trans-array.c (revision 172431)
--- fortran/trans-array.c (working copy)
*************** along with GCC; see the file COPYING3.
*** 81,86 ****
--- 81,87 ----
#include "system.h"
#include "coretypes.h"
#include "tree.h"
+ #include "gimple.h"
#include "diagnostic-core.h" /* For internal_error/fatal_error. */
#include "flags.h"
#include "gfortran.h"
*************** gfc_trans_allocate_array_storage (stmtbl
*** 630,647 ****
{
/* Allocate the temporary. */
onstack = !dynamic && initial == NULL_TREE
! && gfc_can_put_var_on_stack (size);
if (onstack)
{
/* Make a temporary variable to hold the data. */
tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
nelem, gfc_index_one_node);
tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
tmp);
tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
tmp);
tmp = gfc_create_var (tmp, "A");
tmp = gfc_build_addr_expr (NULL_TREE, tmp);
gfc_conv_descriptor_data_set (pre, desc, tmp);
}
--- 631,657 ----
{
/* Allocate the temporary. */
onstack = !dynamic && initial == NULL_TREE
! && (gfc_option.flag_stack_arrays
! || gfc_can_put_var_on_stack (size));
if (onstack)
{
/* Make a temporary variable to hold the data. */
tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
nelem, gfc_index_one_node);
+ tmp = gfc_evaluate_now (tmp, pre);
tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
tmp);
tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
tmp);
tmp = gfc_create_var (tmp, "A");
+ /* If we're here only because of -fstack-arrays we have to
+ emit a DECL_EXPR to make the gimplifier emit alloca calls. */
+ if (!gfc_can_put_var_on_stack (size))
+ gfc_add_expr_to_block (pre,
+ fold_build1_loc (input_location,
+ DECL_EXPR, TREE_TYPE (tmp),
+ tmp));
tmp = gfc_build_addr_expr (NULL_TREE, tmp);
gfc_conv_descriptor_data_set (pre, desc, tmp);
}
*************** gfc_trans_auto_array_allocation (tree de
*** 4759,4767 ****
{
stmtblock_t init;
tree type;
! tree tmp;
tree size;
tree offset;
bool onstack;
gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
--- 4769,4779 ----
{
stmtblock_t init;
tree type;
! tree tmp = NULL_TREE;
tree size;
tree offset;
+ tree space;
+ tree inittree;
bool onstack;
gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
*************** gfc_trans_auto_array_allocation (tree de
*** 4818,4832 ****
return;
}
! /* The size is the number of elements in the array, so multiply by the
! size of an element to get the total size. */
! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! size, fold_convert (gfc_array_index_type, tmp));
! /* Allocate memory to hold the data. */
! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
! gfc_add_modify (&init, decl, tmp);
/* Set offset of the array. */
if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
--- 4830,4859 ----
return;
}
! if (gfc_option.flag_stack_arrays)
! {
! gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
! space = build_decl (sym->declared_at.lb->location,
! VAR_DECL, create_tmp_var_name ("A"),
! TREE_TYPE (TREE_TYPE (decl)));
! gfc_trans_vla_type_sizes (sym, &init);
! }
! else
! {
! /* The size is the number of elements in the array, so multiply by the
! size of an element to get the total size. */
! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! size, fold_convert (gfc_array_index_type, tmp));
! /* Allocate memory to hold the data. */
! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
! gfc_add_modify (&init, decl, tmp);
!
! /* Free the temporary. */
! tmp = gfc_call_free (convert (pvoid_type_node, decl));
! space = NULL_TREE;
! }
/* Set offset of the array. */
if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
*************** gfc_trans_auto_array_allocation (tree de
*** 4835,4844 ****
/* Automatic arrays should not have initializers. */
gcc_assert (!sym->value);
! /* Free the temporary. */
! tmp = gfc_call_free (convert (pvoid_type_node, decl));
! gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
}
--- 4862,4887 ----
/* Automatic arrays should not have initializers. */
gcc_assert (!sym->value);
! inittree = gfc_finish_block (&init);
!
! if (space)
! {
! tree addr;
! pushdecl (space);
! /* Don't create new scope, emit the DECL_EXPR in exactly the scope
! where also space is located. */
! gfc_init_block (&init);
! tmp = fold_build1_loc (input_location, DECL_EXPR,
! TREE_TYPE (space), space);
! gfc_add_expr_to_block (&init, tmp);
! addr = fold_build1_loc (sym->declared_at.lb->location,
! ADDR_EXPR, TREE_TYPE (decl), space);
! gfc_add_modify (&init, decl, addr);
! gfc_add_init_cleanup (block, gfc_finish_block (&init), NULL_TREE);
! tmp = NULL_TREE;
! }
! gfc_add_init_cleanup (block, inittree, tmp);
}
Index: fortran/Make-lang.in
===================================================================
*** fortran/Make-lang.in (revision 172431)
--- fortran/Make-lang.in (working copy)
*************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
*** 353,359 ****
fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
gt-fortran-trans-intrinsic.h
fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
--- 353,359 ----
fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
gt-fortran-trans-intrinsic.h
fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
Index: fortran/gfortran.h
===================================================================
*** fortran/gfortran.h (revision 172431)
--- fortran/gfortran.h (working copy)
*************** typedef struct
*** 2221,2226 ****
--- 2221,2227 ----
int flag_d_lines;
int gfc_flag_openmp;
int flag_sign_zero;
+ int flag_stack_arrays;
int flag_module_private;
int flag_recursive;
int flag_init_local_zero;
Index: fortran/lang.opt
===================================================================
*** fortran/lang.opt (revision 172431)
--- fortran/lang.opt (working copy)
*************** fmax-stack-var-size=
*** 462,467 ****
--- 462,471 ----
Fortran RejectNegative Joined UInteger
-fmax-stack-var-size=<n> Size in bytes of the largest array that will be put on the stack
+ fstack-arrays
+ Fortran
+ Put all local arrays on stack.
+
fmodule-private
Fortran
Set default accessibility of module entities to PRIVATE.
Index: fortran/invoke.texi
===================================================================
*** fortran/invoke.texi (revision 172431)
--- fortran/invoke.texi (working copy)
*************** and warnings}.
*** 167,172 ****
--- 167,173 ----
-fbounds-check -fcheck-array-temporaries -fmax-array-constructor =@var{n} @gol
-fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
-fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
+ -fstack-arrays @gol
-fpack-derived -frepack-arrays -fshort-enums -fexternal-blas @gol
-fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
-finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
*************** Future versions of GNU Fortran may impro
*** 1370,1375 ****
--- 1371,1383 ----
The default value for @var{n} is 32768.
+ @item -fstack-arrays
+ @opindex @code{fstack-arrays}
+ Adding this option will make the fortran compiler put all local arrays,
+ even those of unknown size onto stack memory. If your program uses very
+ large local arrays it's possible that you'll have to extend your runtime
+ limits for stack memory on some operating systems.
+
@item -fpack-derived
@opindex @code{fpack-derived}
@cindex structure packing
Index: fortran/options.c
===================================================================
*** fortran/options.c (revision 172431)
--- fortran/options.c (working copy)
*************** gfc_init_options (unsigned int decoded_o
*** 124,129 ****
--- 124,130 ----
/* Default value of flag_max_stack_var_size is set in gfc_post_options. */
gfc_option.flag_max_stack_var_size = -2;
+ gfc_option.flag_stack_arrays = 0;
gfc_option.flag_range_check = 1;
gfc_option.flag_pack_derived = 0;
*************** gfc_handle_option (size_t scode, const c
*** 795,800 ****
--- 796,805 ----
gfc_option.flag_max_stack_var_size = value;
break;
+ case OPT_fstack_arrays:
+ gfc_option.flag_stack_arrays = value;
+ break;
+
case OPT_fmodule_private:
gfc_option.flag_module_private = value;
break;
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-15 15:19 ` Michael Matz
@ 2011-04-15 15:26 ` Jerry DeLisle
2011-04-15 22:41 ` Michael Matz
0 siblings, 1 reply; 40+ messages in thread
From: Jerry DeLisle @ 2011-04-15 15:26 UTC (permalink / raw)
To: Michael Matz
Cc: Dominique Dhumieres, paul.richard.thomas, gcc-patches, fortran
On 04/15/2011 07:28 AM, Michael Matz wrote:
--- snip ---
>
> I'll make the DECL_EXPR conditional on the size being variable. As Tobias
> already okayed the patch I'm planning to check in the slightly modified
> variant as below, after a new round of testing.
>
Thats A-OK
Thanks,
Jerry
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-15 15:26 ` Jerry DeLisle
@ 2011-04-15 22:41 ` Michael Matz
0 siblings, 0 replies; 40+ messages in thread
From: Michael Matz @ 2011-04-15 22:41 UTC (permalink / raw)
To: Jerry DeLisle
Cc: Dominique Dhumieres, paul.richard.thomas, gcc-patches, fortran
Hi,
On Fri, 15 Apr 2011, Jerry DeLisle wrote:
> > I'll make the DECL_EXPR conditional on the size being variable. As
> > Tobias already okayed the patch I'm planning to check in the slightly
> > modified variant as below, after a new round of testing.
>
> Thats A-OK
r172524
Ciao,
Michael.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-09 9:02 ` Eric Botcazou
@ 2011-04-09 9:49 ` N.M. Maclaren
0 siblings, 0 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-09 9:49 UTC (permalink / raw)
To: gcc-patches, fortran
On Apr 9 2011, Magnus Fromreide wrote:
>>
>> There is actually a much better approach, which was done in Algol 68
>> and seems now to be done only in Ada. As far as the compiler
>> implementation goes, it is a trivial variation on what you have done,
>> but there is a little more work in the run-time system.
>>
>> That is to use primary and secondary stacks. ...
>
>How does your secondary stack interact with threads? Do it force the use
>of more memory per thread?
No, there is no difference. You need a pair of stacks per thread, rather
than one, but that's all.
On Apr 9 2011, Eric Botcazou wrote:
>
> Obviously this depends on the compiler. GNAT allocates all local arrays
> on the primary stack for example.
Interesting. I was relying on what (several) other people had told me.
But, yes, it's entirely compiler-dependent, and can be done or not in
any conventional 3GL language.
>> You need a single secondary stack pointer (and preferably limit, for
>> checking), which can be global variables. If a procedure uses the
>> secondary stack, it needs to restore it on leaving, or when leaving
>> a scope including an allocation - but you have already implemented
>> that!
>
> Not exactly, managing a secondary stack isn't supported in the GENERIC
> IL, you need to do it explicitly by inserting calls to the runtime at
> relevant points. As a consequence, this isn't an efficient mechanism,
> hence the restricted use.
Naturally, every mechanism can be done badly, and I agree that it is
very common for the best solution to be infeasible because of some other
aspect. You still get the cache benefits with a pair of calls, but they
do add a lot of overhead.
Regards,
Nick Maclaren.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-09 8:21 ` N.M. Maclaren
2011-04-09 8:51 ` Magnus Fromreide
@ 2011-04-09 9:02 ` Eric Botcazou
2011-04-09 9:49 ` N.M. Maclaren
1 sibling, 1 reply; 40+ messages in thread
From: Eric Botcazou @ 2011-04-09 9:02 UTC (permalink / raw)
To: N.M. Maclaren; +Cc: gcc-patches, fortran
> There is actually a much better approach, which was done in Algol 68
> and seems now to be done only in Ada. As far as the compiler
> implementation goes, it is a trivial variation on what you have done,
> but there is a little more work in the run-time system.
Obviously this depends on the compiler. GNAT allocates all local arrays on the
primary stack for example.
> That is to use primary and secondary stacks. The primary is for the
> linkage, scalars, array descriptors and very small, fixed-size arrays.
> The secondary is for all other arrays with automatic scope. You get
> all of the benefits that you mention, because both are true stacks,
> and the scope of the secondary stack is controlled by that of the
> primary.
GNAT has a secondary stack, but it is only used to return objects whose size is
self-referential, i.e. isn't given by the return type of the function.
> You need a single secondary stack pointer (and preferably limit, for
> checking), which can be global variables. If a procedure uses the
> secondary stack, it needs to restore it on leaving, or when leaving
> a scope including an allocation - but you have already implemented
> that!
Not exactly, managing a secondary stack isn't supported in the GENERIC IL, you
need to do it explicitly by inserting calls to the runtime at relevant points.
As a consequence, this isn't an efficient mechanism, hence the restricted use.
--
Eric Botcazou
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-09 8:21 ` N.M. Maclaren
@ 2011-04-09 8:51 ` Magnus Fromreide
2011-04-09 9:02 ` Eric Botcazou
1 sibling, 0 replies; 40+ messages in thread
From: Magnus Fromreide @ 2011-04-09 8:51 UTC (permalink / raw)
To: N.M. Maclaren; +Cc: gcc-patches, fortran
On Sat, 2011-04-09 at 09:21 +0100, N.M. Maclaren wrote:
> On Apr 8 2011, Michael Matz wrote:
> >
> >It adds a new option -fstack-arrays which makes the frontend put
> >all local arrays on stack memory. ...
>
> Excellent!
>
> >I haven't rechecked performance now, but four months ago this was the
> >result for the fortran benchmarks in cpu2006:
>
> There is actually a much better approach, which was done in Algol 68
> and seems now to be done only in Ada. As far as the compiler
> implementation goes, it is a trivial variation on what you have done,
> but there is a little more work in the run-time system.
>
> That is to use primary and secondary stacks. ...
How does your secondary stack interact with threads? Do it force the use
of more memory per thread?
/MF
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Implement stack arrays even for unknown sizes
2011-04-08 21:29 Michael Matz
@ 2011-04-09 8:21 ` N.M. Maclaren
2011-04-09 8:51 ` Magnus Fromreide
2011-04-09 9:02 ` Eric Botcazou
0 siblings, 2 replies; 40+ messages in thread
From: N.M. Maclaren @ 2011-04-09 8:21 UTC (permalink / raw)
To: gcc-patches, fortran
On Apr 8 2011, Michael Matz wrote:
>
>It adds a new option -fstack-arrays which makes the frontend put
>all local arrays on stack memory. ...
Excellent!
>I haven't rechecked performance now, but four months ago this was the
>result for the fortran benchmarks in cpu2006:
There is actually a much better approach, which was done in Algol 68
and seems now to be done only in Ada. As far as the compiler
implementation goes, it is a trivial variation on what you have done,
but there is a little more work in the run-time system.
That is to use primary and secondary stacks. The primary is for the
linkage, scalars, array descriptors and very small, fixed-size arrays.
The secondary is for all other arrays with automatic scope. You get
all of the benefits that you mention, because both are true stacks,
and the scope of the secondary stack is controlled by that of the
primary.
You need a single secondary stack pointer (and preferably limit, for
checking), which can be global variables. If a procedure uses the
secondary stack, it needs to restore it on leaving, or when leaving
a scope including an allocation - but you have already implemented
that!
The reason that it gives better performance is locality. Because the
primary stack is much smaller and contains the critical scalars and
pointers to arrays etc., calling procedures can involve MANY fewer
cache misses.
It also reduces the number of array overflows that cause complete chaos,
because most small bounds errors trash other data, not constants or
linkage.
It also helps with Unixoid systems with fixed (and small) stack limits,
because the secondary stack is just another data segment.
I would love to say that I would try to work on this, but I really can't
promise to do anything else until I have delivered on some of my long
overdue promises. But I am very happy to discuss details - not that there
ARE many more than I have described!
Regards,
Nick Maclaren.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Implement stack arrays even for unknown sizes
@ 2011-04-08 21:29 Michael Matz
2011-04-09 8:21 ` N.M. Maclaren
0 siblings, 1 reply; 40+ messages in thread
From: Michael Matz @ 2011-04-08 21:29 UTC (permalink / raw)
To: gcc-patches; +Cc: fortran
Hello,
I developed this patch during stage3 of 4.6, so it wasn't appropriate
then, but now should be a good time.
It adds a new option -fstack-arrays which makes the frontend put
all local arrays on stack memory. Even those of non-constant size, by
using alloca. Or rather not by explicitely using alloca, but by
leveraging the middle-ends possibilities of having variable length arrays
(which that then implements via alloca/stack_save/stack_restore).
Via that we'll get automatic allocation and free (the latter being the
important part for stack memory) when entering or leaving the scope of the
so declared arrays. That means temporary arrays generated for looping
constructs only need stack memory when the loop is active, not during the
whole function.
By doing that we save one malloc/free pair per such loop iteration, and
some benchmarks _heavily_ churn in the libc allocator (tonto for
instance). Also in some cases the data layout will be much better
(e.g. bwaves). This patch bootstraps fine on x86_64-linux. Even if I'm
forcing the use of stack arrays by default the testsuite runs without
regressions, except for a few cases where regexps need to be changed to
accept the changed output (e.g. explicitely matching for builtin_free).
The patch as is (i.e. without using stack arrays by default) doesn't
regress at all.
I haven't rechecked performance now, but four months ago this was the
result for the fortran benchmarks in cpu2006:
(base is -O3 -g -ffast-math -funroll-loops -fpeel-loops
peak is + -fstack-arrays)
410.bwaves 13590 1340 10.2 * 13590 813 16.7 *
416.gamess 19580 1400 14.0 * 19580 1410 13.9 *
434.zeusmp 9100 798 11.4 * 9100 799 11.4 *
435.gromacs 7140 622 11.5 * 7140 621 11.5 *
436.cactusADM 11950 1280 9.36 * 11950 1300 9.18 *
437.leslie3d 9400 765 12.3 * 9400 765 12.3 *
454.calculix 8250 734 11.2 * 8250 735 11.2 *
459.GemsFDTD 10610 1070 9.95 * 10610 1060 10.0 *
465.tonto 9840 1090 9.00 * 9840 952 10.3 *
481.wrf 11170 899 12.4 * 11170 794 14.1 *
That is tonto and wrf show nice speedups, and bwaves increases performance
by 80% (that's data layout, bwaves doesn't heavily call malloc/free).
It should be noted that e.g. ICC does this placement of local arrays on
stack by default, and that doing it for GCC has similar requirements. For
cpu2006 one needs to increase the stack ulimit quite much (around 1G) to
not segfault.
I would consider enabling stack-arrays for -Ofast, but that would be a
separate patch.
So, what do you think?
Ciao,
Michael.
* trans-array.c (toplevel): Include gimple.h.
(gfc_trans_allocate_array_storage): Check flag_stack_arrays,
properly expand variable length arrays.
(gfc_trans_auto_array_allocation): If flag_stack_arrays create
variable length decls and associate them with their scope.
* gfortran.h (gfc_option_t): Add flag_stack_arrays member.
* options.c (gfc_init_options): Handle -fstack_arrays option.
* lang.opt (fstack-arrays): Add option.
* invoke.texi (Code Gen Options): Document it.
* Make-lang.in (trans-array.o): Depend on GIMPLE_H.
Index: trans-array.c
===================================================================
*** trans-array.c (Revision 172206)
--- trans-array.c (Arbeitskopie)
*************** along with GCC; see the file COPYING3.
*** 81,86 ****
--- 81,87 ----
#include "system.h"
#include "coretypes.h"
#include "tree.h"
+ #include "gimple.h"
#include "diagnostic-core.h" /* For internal_error/fatal_error. */
#include "flags.h"
#include "gfortran.h"
*************** gfc_trans_allocate_array_storage (stmtbl
*** 630,647 ****
{
/* Allocate the temporary. */
onstack = !dynamic && initial == NULL_TREE
! && gfc_can_put_var_on_stack (size);
if (onstack)
{
/* Make a temporary variable to hold the data. */
tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
nelem, gfc_index_one_node);
tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
tmp);
tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
tmp);
tmp = gfc_create_var (tmp, "A");
tmp = gfc_build_addr_expr (NULL_TREE, tmp);
gfc_conv_descriptor_data_set (pre, desc, tmp);
}
--- 631,654 ----
{
/* Allocate the temporary. */
onstack = !dynamic && initial == NULL_TREE
! && (gfc_option.flag_stack_arrays
! || gfc_can_put_var_on_stack (size));
if (onstack)
{
/* Make a temporary variable to hold the data. */
tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem),
nelem, gfc_index_one_node);
+ tmp = gfc_evaluate_now (tmp, pre);
tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node,
tmp);
tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)),
tmp);
tmp = gfc_create_var (tmp, "A");
+ gfc_add_expr_to_block (pre,
+ fold_build1_loc (input_location,
+ DECL_EXPR, TREE_TYPE (tmp),
+ tmp));
tmp = gfc_build_addr_expr (NULL_TREE, tmp);
gfc_conv_descriptor_data_set (pre, desc, tmp);
}
*************** gfc_trans_auto_array_allocation (tree de
*** 4744,4749 ****
--- 4751,4758 ----
tree tmp;
tree size;
tree offset;
+ tree space;
+ tree inittree;
bool onstack;
gcc_assert (!(sym->attr.pointer || sym->attr.allocatable));
*************** gfc_trans_auto_array_allocation (tree de
*** 4800,4814 ****
return;
}
! /* The size is the number of elements in the array, so multiply by the
! size of an element to get the total size. */
! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! size, fold_convert (gfc_array_index_type, tmp));
! /* Allocate memory to hold the data. */
! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
! gfc_add_modify (&init, decl, tmp);
/* Set offset of the array. */
if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
--- 4809,4846 ----
return;
}
! if (gfc_option.flag_stack_arrays)
! {
! tree addr;
! gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
! space = build_decl (sym->declared_at.lb->location,
! VAR_DECL, create_tmp_var_name ("A"),
! TREE_TYPE (TREE_TYPE (decl)));
! gfc_trans_vla_type_sizes (sym, &init);
! tmp = fold_build1_loc (input_location, DECL_EXPR,
! TREE_TYPE (space), space);
! gfc_add_expr_to_block (&init, tmp);
! addr = fold_build1_loc (sym->declared_at.lb->location,
! ADDR_EXPR, TREE_TYPE (decl), space);
! gfc_add_modify (&init, decl, addr);
! tmp = NULL_TREE;
! }
! else
! {
! /* The size is the number of elements in the array, so multiply by the
! size of an element to get the total size. */
! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type));
! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type,
! size, fold_convert (gfc_array_index_type, tmp));
!
! /* Allocate memory to hold the data. */
! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size);
! gfc_add_modify (&init, decl, tmp);
! /* Free the temporary. */
! tmp = gfc_call_free (convert (pvoid_type_node, decl));
! space = NULL_TREE;
! }
/* Set offset of the array. */
if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL)
*************** gfc_trans_auto_array_allocation (tree de
*** 4817,4826 ****
/* Automatic arrays should not have initializers. */
gcc_assert (!sym->value);
! /* Free the temporary. */
! tmp = gfc_call_free (convert (pvoid_type_node, decl));
!
! gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp);
}
--- 4849,4858 ----
/* Automatic arrays should not have initializers. */
gcc_assert (!sym->value);
! inittree = gfc_finish_block (&init);
! if (space)
! pushdecl (space);
! gfc_add_init_cleanup (block, inittree, tmp);
}
Index: Make-lang.in
===================================================================
*** Make-lang.in (Revision 172206)
--- Make-lang.in (Arbeitskopie)
*************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D
*** 353,359 ****
fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
gt-fortran-trans-intrinsic.h
fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
--- 353,359 ----
fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS)
fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \
fortran/ioparm.def
! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H)
fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \
gt-fortran-trans-intrinsic.h
fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h
Index: gfortran.h
===================================================================
*** gfortran.h (Revision 172206)
--- gfortran.h (Arbeitskopie)
*************** typedef struct
*** 2220,2225 ****
--- 2220,2226 ----
int flag_d_lines;
int gfc_flag_openmp;
int flag_sign_zero;
+ int flag_stack_arrays;
int flag_module_private;
int flag_recursive;
int flag_init_local_zero;
Index: lang.opt
===================================================================
*** lang.opt (Revision 172206)
--- lang.opt (Arbeitskopie)
*************** fmax-stack-var-size=
*** 454,459 ****
--- 454,463 ----
Fortran RejectNegative Joined UInteger
-fmax-stack-var-size=<n> Size in bytes of the largest array that will be put on the stack
+ fstack-arrays
+ Fortran
+ Put all local arrays on stack.
+
fmodule-private
Fortran
Set default accessibility of module entities to PRIVATE.
Index: invoke.texi
===================================================================
*** invoke.texi (Revision 172206)
--- invoke.texi (Arbeitskopie)
*************** and warnings}.
*** 167,172 ****
--- 167,173 ----
-fbounds-check -fcheck-array-temporaries -fmax-array-constructor =@var{n} @gol
-fcheck=@var{<all|array-temps|bounds|do|mem|pointer|recursion>} @gol
-fcoarray=@var{<none|single|lib>} -fmax-stack-var-size=@var{n} @gol
+ -fstack-arrays @gol
-fpack-derived -frepack-arrays -fshort-enums -fexternal-blas @gol
-fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol
-finit-integer=@var{n} -finit-real=@var{<zero|inf|-inf|nan|snan>} @gol
*************** Future versions of GNU Fortran may impro
*** 1361,1366 ****
--- 1362,1374 ----
The default value for @var{n} is 32768.
+ @item -fstack-arrays
+ @opindex @code{fstack-arrays}
+ Adding this option will make the fortran compiler put all local arrays,
+ even those of unknown size onto stack memory. If your program uses very
+ large local arrays it's possible that you'll have to extend your runtime
+ limits for stack memory on some operating systems.
+
@item -fpack-derived
@opindex @code{fpack-derived}
@cindex structure packing
Index: options.c
===================================================================
*** options.c (Revision 172206)
--- options.c (Arbeitskopie)
*************** gfc_init_options (unsigned int decoded_o
*** 123,128 ****
--- 123,129 ----
/* Default value of flag_max_stack_var_size is set in gfc_post_options. */
gfc_option.flag_max_stack_var_size = -2;
+ gfc_option.flag_stack_arrays = 0;
gfc_option.flag_range_check = 1;
gfc_option.flag_pack_derived = 0;
*************** gfc_handle_option (size_t scode, const c
*** 783,788 ****
--- 784,793 ----
gfc_option.flag_max_stack_var_size = value;
break;
+ case OPT_fstack_arrays:
+ gfc_option.flag_stack_arrays = value;
+ break;
+
case OPT_fmodule_private:
gfc_option.flag_module_private = value;
break;
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2011-04-15 21:49 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-09 10:08 Implement stack arrays even for unknown sizes Dominique Dhumieres
2011-04-09 12:17 ` Paul Richard Thomas
2011-04-10 13:29 ` Dominique Dhumieres
2011-04-11 11:49 ` Michael Matz
2011-04-11 11:58 ` Axel Freyn
2011-04-11 13:35 ` Eric Botcazou
2011-04-11 14:06 ` Dominique Dhumieres
2011-04-11 14:59 ` Michael Matz
2011-04-11 16:04 ` Michael Matz
2011-04-11 16:45 ` Steven Bosscher
2011-04-12 11:54 ` Michael Matz
2011-04-12 12:42 ` N.M. Maclaren
2011-04-11 16:46 ` Dominique Dhumieres
2011-04-11 16:59 ` H.J. Lu
2011-04-11 17:06 ` N.M. Maclaren
2011-04-11 18:28 ` Tobias Burnus
2011-04-11 21:18 ` Dominique Dhumieres
2011-04-12 6:23 ` Paul Richard Thomas
2011-04-12 6:35 ` Dominique Dhumieres
2011-04-13 9:42 ` Paul Richard Thomas
2011-04-13 10:38 ` Richard Guenther
2011-04-13 11:13 ` Richard Guenther
2011-04-14 13:32 ` Dominique Dhumieres
2011-04-13 13:46 ` Paul Richard Thomas
2011-04-14 13:59 ` Michael Matz
2011-04-14 15:28 ` Michael Matz
2011-04-14 19:29 ` Dominique Dhumieres
2011-04-15 2:01 ` Dominique Dhumieres
2011-04-15 12:38 ` Michael Matz
2011-04-15 14:06 ` Dominique Dhumieres
2011-04-15 15:19 ` Michael Matz
2011-04-15 15:26 ` Jerry DeLisle
2011-04-15 22:41 ` Michael Matz
2011-04-14 20:23 ` Tobias Burnus
2011-04-14 20:30 ` Tobias Burnus
-- strict thread matches above, loose matches on Subject: below --
2011-04-08 21:29 Michael Matz
2011-04-09 8:21 ` N.M. Maclaren
2011-04-09 8:51 ` Magnus Fromreide
2011-04-09 9:02 ` Eric Botcazou
2011-04-09 9:49 ` N.M. Maclaren
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).