From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10637 invoked by alias); 8 Apr 2011 21:29:47 -0000 Received: (qmail 10598 invoked by uid 22791); 8 Apr 2011 21:29:42 -0000 X-SWARE-Spam-Status: No, hits=-3.1 required=5.0 tests=AWL,BAYES_00,TW_WR,T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from cantor2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 08 Apr 2011 21:29:35 +0000 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.221.2]) by mx2.suse.de (Postfix) with ESMTP id DE0FE79727; Fri, 8 Apr 2011 23:29:33 +0200 (CEST) Date: Fri, 08 Apr 2011 21:29:00 -0000 From: Michael Matz To: gcc-patches@gcc.gnu.org Cc: fortran@gcc.gnu.org Subject: Implement stack arrays even for unknown sizes Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2011-04/txt/msg00664.txt.bz2 Hello, I developed this patch during stage3 of 4.6, so it wasn't appropriate then, but now should be a good time. It adds a new option -fstack-arrays which makes the frontend put all local arrays on stack memory. Even those of non-constant size, by using alloca. Or rather not by explicitely using alloca, but by leveraging the middle-ends possibilities of having variable length arrays (which that then implements via alloca/stack_save/stack_restore). Via that we'll get automatic allocation and free (the latter being the important part for stack memory) when entering or leaving the scope of the so declared arrays. That means temporary arrays generated for looping constructs only need stack memory when the loop is active, not during the whole function. By doing that we save one malloc/free pair per such loop iteration, and some benchmarks _heavily_ churn in the libc allocator (tonto for instance). Also in some cases the data layout will be much better (e.g. bwaves). This patch bootstraps fine on x86_64-linux. Even if I'm forcing the use of stack arrays by default the testsuite runs without regressions, except for a few cases where regexps need to be changed to accept the changed output (e.g. explicitely matching for builtin_free). The patch as is (i.e. without using stack arrays by default) doesn't regress at all. I haven't rechecked performance now, but four months ago this was the result for the fortran benchmarks in cpu2006: (base is -O3 -g -ffast-math -funroll-loops -fpeel-loops peak is + -fstack-arrays) 410.bwaves 13590 1340 10.2 * 13590 813 16.7 * 416.gamess 19580 1400 14.0 * 19580 1410 13.9 * 434.zeusmp 9100 798 11.4 * 9100 799 11.4 * 435.gromacs 7140 622 11.5 * 7140 621 11.5 * 436.cactusADM 11950 1280 9.36 * 11950 1300 9.18 * 437.leslie3d 9400 765 12.3 * 9400 765 12.3 * 454.calculix 8250 734 11.2 * 8250 735 11.2 * 459.GemsFDTD 10610 1070 9.95 * 10610 1060 10.0 * 465.tonto 9840 1090 9.00 * 9840 952 10.3 * 481.wrf 11170 899 12.4 * 11170 794 14.1 * That is tonto and wrf show nice speedups, and bwaves increases performance by 80% (that's data layout, bwaves doesn't heavily call malloc/free). It should be noted that e.g. ICC does this placement of local arrays on stack by default, and that doing it for GCC has similar requirements. For cpu2006 one needs to increase the stack ulimit quite much (around 1G) to not segfault. I would consider enabling stack-arrays for -Ofast, but that would be a separate patch. So, what do you think? Ciao, Michael. * trans-array.c (toplevel): Include gimple.h. (gfc_trans_allocate_array_storage): Check flag_stack_arrays, properly expand variable length arrays. (gfc_trans_auto_array_allocation): If flag_stack_arrays create variable length decls and associate them with their scope. * gfortran.h (gfc_option_t): Add flag_stack_arrays member. * options.c (gfc_init_options): Handle -fstack_arrays option. * lang.opt (fstack-arrays): Add option. * invoke.texi (Code Gen Options): Document it. * Make-lang.in (trans-array.o): Depend on GIMPLE_H. Index: trans-array.c =================================================================== *** trans-array.c (Revision 172206) --- trans-array.c (Arbeitskopie) *************** along with GCC; see the file COPYING3. *** 81,86 **** --- 81,87 ---- #include "system.h" #include "coretypes.h" #include "tree.h" + #include "gimple.h" #include "diagnostic-core.h" /* For internal_error/fatal_error. */ #include "flags.h" #include "gfortran.h" *************** gfc_trans_allocate_array_storage (stmtbl *** 630,647 **** { /* Allocate the temporary. */ onstack = !dynamic && initial == NULL_TREE ! && gfc_can_put_var_on_stack (size); if (onstack) { /* Make a temporary variable to hold the data. */ tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem), nelem, gfc_index_one_node); tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node, tmp); tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)), tmp); tmp = gfc_create_var (tmp, "A"); tmp = gfc_build_addr_expr (NULL_TREE, tmp); gfc_conv_descriptor_data_set (pre, desc, tmp); } --- 631,654 ---- { /* Allocate the temporary. */ onstack = !dynamic && initial == NULL_TREE ! && (gfc_option.flag_stack_arrays ! || gfc_can_put_var_on_stack (size)); if (onstack) { /* Make a temporary variable to hold the data. */ tmp = fold_build2_loc (input_location, MINUS_EXPR, TREE_TYPE (nelem), nelem, gfc_index_one_node); + tmp = gfc_evaluate_now (tmp, pre); tmp = build_range_type (gfc_array_index_type, gfc_index_zero_node, tmp); tmp = build_array_type (gfc_get_element_type (TREE_TYPE (desc)), tmp); tmp = gfc_create_var (tmp, "A"); + gfc_add_expr_to_block (pre, + fold_build1_loc (input_location, + DECL_EXPR, TREE_TYPE (tmp), + tmp)); tmp = gfc_build_addr_expr (NULL_TREE, tmp); gfc_conv_descriptor_data_set (pre, desc, tmp); } *************** gfc_trans_auto_array_allocation (tree de *** 4744,4749 **** --- 4751,4758 ---- tree tmp; tree size; tree offset; + tree space; + tree inittree; bool onstack; gcc_assert (!(sym->attr.pointer || sym->attr.allocatable)); *************** gfc_trans_auto_array_allocation (tree de *** 4800,4814 **** return; } ! /* The size is the number of elements in the array, so multiply by the ! size of an element to get the total size. */ ! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type)); ! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type, ! size, fold_convert (gfc_array_index_type, tmp)); ! /* Allocate memory to hold the data. */ ! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size); ! gfc_add_modify (&init, decl, tmp); /* Set offset of the array. */ if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL) --- 4809,4846 ---- return; } ! if (gfc_option.flag_stack_arrays) ! { ! tree addr; ! gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE); ! space = build_decl (sym->declared_at.lb->location, ! VAR_DECL, create_tmp_var_name ("A"), ! TREE_TYPE (TREE_TYPE (decl))); ! gfc_trans_vla_type_sizes (sym, &init); ! tmp = fold_build1_loc (input_location, DECL_EXPR, ! TREE_TYPE (space), space); ! gfc_add_expr_to_block (&init, tmp); ! addr = fold_build1_loc (sym->declared_at.lb->location, ! ADDR_EXPR, TREE_TYPE (decl), space); ! gfc_add_modify (&init, decl, addr); ! tmp = NULL_TREE; ! } ! else ! { ! /* The size is the number of elements in the array, so multiply by the ! size of an element to get the total size. */ ! tmp = TYPE_SIZE_UNIT (gfc_get_element_type (type)); ! size = fold_build2_loc (input_location, MULT_EXPR, gfc_array_index_type, ! size, fold_convert (gfc_array_index_type, tmp)); ! ! /* Allocate memory to hold the data. */ ! tmp = gfc_call_malloc (&init, TREE_TYPE (decl), size); ! gfc_add_modify (&init, decl, tmp); ! /* Free the temporary. */ ! tmp = gfc_call_free (convert (pvoid_type_node, decl)); ! space = NULL_TREE; ! } /* Set offset of the array. */ if (TREE_CODE (GFC_TYPE_ARRAY_OFFSET (type)) == VAR_DECL) *************** gfc_trans_auto_array_allocation (tree de *** 4817,4826 **** /* Automatic arrays should not have initializers. */ gcc_assert (!sym->value); ! /* Free the temporary. */ ! tmp = gfc_call_free (convert (pvoid_type_node, decl)); ! ! gfc_add_init_cleanup (block, gfc_finish_block (&init), tmp); } --- 4849,4858 ---- /* Automatic arrays should not have initializers. */ gcc_assert (!sym->value); ! inittree = gfc_finish_block (&init); ! if (space) ! pushdecl (space); ! gfc_add_init_cleanup (block, inittree, tmp); } Index: Make-lang.in =================================================================== *** Make-lang.in (Revision 172206) --- Make-lang.in (Arbeitskopie) *************** fortran/trans-stmt.o: $(GFORTRAN_TRANS_D *** 353,359 **** fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS) fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \ fortran/ioparm.def ! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \ gt-fortran-trans-intrinsic.h fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h --- 353,359 ---- fortran/trans-openmp.o: $(GFORTRAN_TRANS_DEPS) fortran/trans-io.o: $(GFORTRAN_TRANS_DEPS) gt-fortran-trans-io.h \ fortran/ioparm.def ! fortran/trans-array.o: $(GFORTRAN_TRANS_DEPS) $(GIMPLE_H) fortran/trans-intrinsic.o: $(GFORTRAN_TRANS_DEPS) fortran/mathbuiltins.def \ gt-fortran-trans-intrinsic.h fortran/dependency.o: $(GFORTRAN_TRANS_DEPS) fortran/dependency.h Index: gfortran.h =================================================================== *** gfortran.h (Revision 172206) --- gfortran.h (Arbeitskopie) *************** typedef struct *** 2220,2225 **** --- 2220,2226 ---- int flag_d_lines; int gfc_flag_openmp; int flag_sign_zero; + int flag_stack_arrays; int flag_module_private; int flag_recursive; int flag_init_local_zero; Index: lang.opt =================================================================== *** lang.opt (Revision 172206) --- lang.opt (Arbeitskopie) *************** fmax-stack-var-size= *** 454,459 **** --- 454,463 ---- Fortran RejectNegative Joined UInteger -fmax-stack-var-size= Size in bytes of the largest array that will be put on the stack + fstack-arrays + Fortran + Put all local arrays on stack. + fmodule-private Fortran Set default accessibility of module entities to PRIVATE. Index: invoke.texi =================================================================== *** invoke.texi (Revision 172206) --- invoke.texi (Arbeitskopie) *************** and warnings}. *** 167,172 **** --- 167,173 ---- -fbounds-check -fcheck-array-temporaries -fmax-array-constructor =@var{n} @gol -fcheck=@var{} @gol -fcoarray=@var{} -fmax-stack-var-size=@var{n} @gol + -fstack-arrays @gol -fpack-derived -frepack-arrays -fshort-enums -fexternal-blas @gol -fblas-matmul-limit=@var{n} -frecursive -finit-local-zero @gol -finit-integer=@var{n} -finit-real=@var{} @gol *************** Future versions of GNU Fortran may impro *** 1361,1366 **** --- 1362,1374 ---- The default value for @var{n} is 32768. + @item -fstack-arrays + @opindex @code{fstack-arrays} + Adding this option will make the fortran compiler put all local arrays, + even those of unknown size onto stack memory. If your program uses very + large local arrays it's possible that you'll have to extend your runtime + limits for stack memory on some operating systems. + @item -fpack-derived @opindex @code{fpack-derived} @cindex structure packing Index: options.c =================================================================== *** options.c (Revision 172206) --- options.c (Arbeitskopie) *************** gfc_init_options (unsigned int decoded_o *** 123,128 **** --- 123,129 ---- /* Default value of flag_max_stack_var_size is set in gfc_post_options. */ gfc_option.flag_max_stack_var_size = -2; + gfc_option.flag_stack_arrays = 0; gfc_option.flag_range_check = 1; gfc_option.flag_pack_derived = 0; *************** gfc_handle_option (size_t scode, const c *** 783,788 **** --- 784,793 ---- gfc_option.flag_max_stack_var_size = value; break; + case OPT_fstack_arrays: + gfc_option.flag_stack_arrays = value; + break; + case OPT_fmodule_private: gfc_option.flag_module_private = value; break;