From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-x32f.google.com (mail-wm1-x32f.google.com [IPv6:2a00:1450:4864:20::32f]) by sourceware.org (Postfix) with ESMTPS id 608253858D39 for ; Wed, 17 May 2023 15:24:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 608253858D39 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-wm1-x32f.google.com with SMTP id 5b1f17b1804b1-3f42d937d2eso6232395e9.2 for ; Wed, 17 May 2023 08:24:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1684337056; x=1686929056; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=uYz/AUa5nklw71xEjpZkO3/KuI/ecvzPYNDQGtWKSZM=; b=wbPhLwzHCMZh/3r14HFExozhd0rzZ3Wot2LbiVsFIbhTEw+GJDFBLlFtnx4AANVvB1 0/LyDXiVZwbbKN3YsG3F0GLDGKw0hArovO93M3i0eE0owHI6zaVD+dvZ4gbORJ8Wc+Mk JLI16EYC3vMTXO6X6f/AwSR3oVlv+1N61lsZxShRUCb2NUj6go2Ua5Fos7b9oYUzlkiS ggencbr6rVFUs6zbSI1fNwSRzQxg/krBl5ZVxTlces4sGS/CWTLP/nVtPY1MiRKph8gj hY7okhVFxFrljs19siKhV+kiOkr3W3zypw8Gywz7/UIRsrBOy9pOuk44m2qmHCXiIbpa x5fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684337056; x=1686929056; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=uYz/AUa5nklw71xEjpZkO3/KuI/ecvzPYNDQGtWKSZM=; b=Py74jpJQ6YMYyNPU2i/g/eIX0XSc9HAau68HgtgXA4ZBOBUUWMSVSYHrHdn0Ft1saJ 8o/ucdl3fv7GR9p20KhUEZJNe9I3ES0IdvY20KBpMzQJMuNn6Lq8h1lphQLaTyPrH6fB 7if/zcEcUXEQm8lUIHHc4VHp0/wcqjQABny6C46mgADHkRbSu2hOJLwv/vCSQEln/ejP txmyvYLDFntSUs5fMVPqBsOKCAqsiX9XRn4vK6UmRgfBFn3z7NzC6eaP3Nt/rGbXfcOd FFUyr9Oa5N6Zzo60H4Wr30oHRLAidFmrEgYFV/TuBGmrH/Qp9ofqdr/HXWPdYC/eSgs8 ts5g== X-Gm-Message-State: AC+VfDzmlpatKkHB4Qwvok87LozMJFwYlVfB18CS22IMvviXm+dJS8ou SMgL4DGbMdu7EZVLtmscFWlyzqoo0DC5CO+9glSV1dbInJdifLkktr4= X-Google-Smtp-Source: ACHHUZ7GqUIt1Z+GGY4UrDZugwa7zZnOLnKMS/tnbn4G8Y+avsJXDlqQOHM5tiQHOvAhqeDoFTiKR1UYrBrsxIFy5uQ= X-Received: by 2002:a5d:410a:0:b0:307:93ec:b323 with SMTP id l10-20020a5d410a000000b0030793ecb323mr892649wrp.69.1684337055909; Wed, 17 May 2023 08:24:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Prathamesh Kulkarni Date: Wed, 17 May 2023 20:53:38 +0530 Message-ID: Subject: Re: [aarch64] Code-gen for vector initialization involving constants To: Prathamesh Kulkarni , gcc Patches , richard.sandiford@arm.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, 16 May 2023 at 00:29, Richard Sandiford wrote: > > Prathamesh Kulkarni writes: > > Hi Richard, > > After committing the interleave+zip1 patch for vector initialization, > > it seems to regress the s32 case for this patch: > > > > int32x4_t f_s32(int32_t x) > > { > > return (int32x4_t) { x, x, x, 1 }; > > } > > > > code-gen: > > f_s32: > > movi v30.2s, 0x1 > > fmov s31, w0 > > dup v0.2s, v31.s[0] > > ins v30.s[0], v31.s[0] > > zip1 v0.4s, v0.4s, v30.4s > > ret > > > > instead of expected code-gen: > > f_s32: > > movi v31.2s, 0x1 > > dup v0.4s, w0 > > ins v0.s[3], v31.s[0] > > ret > > > > Cost for fallback sequence: 16 > > Cost for interleave and zip sequence: 12 > > > > For the above case, the cost for interleave+zip1 sequence is computed as: > > halves[0]: > > (set (reg:V2SI 96) > > (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))) > > cost = 8 > > > > halves[1]: > > (set (reg:V2SI 97) > > (const_vector:V2SI [ > > (const_int 1 [0x1]) repeated x2 > > ])) > > (set (reg:V2SI 97) > > (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ])) > > (reg:V2SI 97) > > (const_int 1 [0x1]))) > > cost = 8 > > > > followed by: > > (set (reg:V4SI 95) > > (unspec:V4SI [ > > (subreg:V4SI (reg:V2SI 96) 0) > > (subreg:V4SI (reg:V2SI 97) 0) > > ] UNSPEC_ZIP1)) > > cost = 4 > > > > So the total cost becomes > > max(costs[0], costs[1]) + zip1_insn_cost > > = max(8, 8) + 4 > > = 12 > > > > While the fallback rtl sequence is: > > (set (reg:V4SI 95) > > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) > > cost = 8 > > (set (reg:SI 98) > > (const_int 1 [0x1])) > > cost = 4 > > (set (reg:V4SI 95) > > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98)) > > (reg:V4SI 95) > > (const_int 8 [0x8]))) > > cost = 4 > > > > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence. > > > > I think the issue is probably that for the interleave+zip1 sequence we take > > max(costs[0], costs[1]) to reflect that both halves are interleaved, > > but for the fallback seq we use seq_cost, which assumes serial execution > > of insns in the sequence. > > For above fallback sequence, > > set (reg:V4SI 95) > > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) > > and > > (set (reg:SI 98) > > (const_int 1 [0x1])) > > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12. > > Agreed. > > A good-enough substitute for this might be to ignore scalar moves > (for both alternatives) when costing for speed. Thanks for the suggestions. Just wondering for aarch64, if there's an easy way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ? > > > I was wondering if we should we make cost for interleave+zip1 sequence > > more conservative > > by not taking max, but summing up costs[0] + costs[1] even for speed ? > > For this case, > > that would be 8 + 8 + 4 = 20. > > > > It generates the fallback sequence for other cases (s8, s16, s64) from > > the test-case. > > What does it do for the tests in the interleave+zip1 patch? If it doesn't > make a difference there then it sounds like we don't have enough tests. :) Oh right, the tests in interleave+zip1 patch only check for s16 case, sorry about that :/ Looking briefly at the code generated for s8, s32 and s64 case, (a) s8, and s16 seem to use same sequence for all cases. (b) s64 seems to use fallback sequence. (c) For vec-init-21.c, s8 and s16 cases prefer fallback sequence because costs are tied, while s32 case prefers interleave+zip1: int32x4_t f_s32(int32_t x, int32_t y) { return (int32x4_t) { x, y, 1, 2 }; } Code-gen with interleave+zip1 sequence: f_s32: movi v31.2s, 0x1 movi v0.2s, 0x2 ins v31.s[0], w0 ins v0.s[0], w1 zip1 v0.4s, v31.4s, v0.4s ret Code-gen with fallback sequence: f_s32: adrp x2, .LC0 ldr q0, [x2, #:lo12:.LC0] ins v0.s[0], w0 ins v0.s[1], w1 ret Fallback sequence cost = 20 interleave+zip1 sequence cost = 12 I assume interleave+zip1 sequence is better in this case (chosen currently) ? I will send a patch to add cases for s8, s16 and s64 in a follow up patch soon. > > Summing is only conservative if the fallback sequence is somehow "safer". > But I don't think it is. Building an N-element vector from N scalars > can be done using N instructions in the fallback case and N+1 instructions > in the interleave+zip1 case. But the interleave+zip1 case is still > better (speedwise) for N==16. Ack, thanks. Should we also prefer interleave+zip1 when the costs are tied ? For eg, for the following case: int32x4_t f_s32(int32_t x) { return (int32x4_t) { x, 1, x, 1 }; } costs for both fallback and interleave+zip1 sequence = 12, and we currently choose fallback sequence. Code-gen: f_s32: movi v0.4s, 0x1 fmov s31, w0 ins v0.s[0], v31.s[0] ins v0.s[2], v31.s[0] ret while, if we choose interleave+zip1, code-gen is: f_s32: dup v31.2s, w0 movi v0.2s, 0x1 zip1 v0.4s, v31.4s, v0.4s ret I suppose the interleave+zip1 sequence is better in this case ? And more generally, if the costs are tied, would it be OK to prefer interleave+zip1 sequence since it will have parallel execution of two halves, which may not always be the case with fallback sequence ? Also, would it be OK to commit the above patch that addresses the issue with single constant case and xfail the s32 case for now ? Thanks, Prathamesh > > Thanks, > Richard