From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jeffreyalaw@gmail.com>
Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536])
	by sourceware.org (Postfix) with ESMTPS id 35D403851AB7
	for <gcc-patches@gcc.gnu.org>; Fri, 26 Aug 2022 16:26:17 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 35D403851AB7
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-pg1-x536.google.com with SMTP id bh13so1823089pgb.4
        for <gcc-patches@gcc.gnu.org>; Fri, 26 Aug 2022 09:26:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc;
        bh=XGztFhng9wTvAR/WA5VpxnOvroAmVcCh+J7fYT+IQMQ=;
        b=L5M2JzUV+FsOJ2KIPz9Z5YQhUt3WMyJlLH2JGfkzAvkILL/G5OPy3BrHzVNsRvM+da
         ElZeUZE5pqVbpzNzNDsKTPxwfzS0bmmkh9RhINANirdkCBgUKzTEws5XVoDW/G6j4iwP
         f2K6LP/Hg73NI3/G+Bi0ROqjdMngvsBh6byurVDzd9qKNxIebge2hU6zRN2HseaOagJK
         jP5jt9fwTzYq5txRXIhgSS8WDWwFdQBDkHJqKSY6ayz8iW2tsqcl/W/uiyMS5n89V2Ey
         7npda33jWsweqSDvvsv1Ikhcp4sVF9+iefSKfMXIDboTqBE0Vx63nMW9KalOsFxH2eTj
         xoTw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc;
        bh=XGztFhng9wTvAR/WA5VpxnOvroAmVcCh+J7fYT+IQMQ=;
        b=XJjyKGZALltOmavPhlI+oRsTJh9qwehaB5mj0lncMcNMed3a2jocfyhYurbtvk4IRG
         ptQRVwLJ1NWHzjIWAV5UuVfNdp4u2mtOta2+sE13cl7dbuSdYcwCftKEP+gztKHKPn1c
         SEgeeolke57demRMRwsnELn+3I6428nbOfFDt453BbzCa7onsjwlRJprBIJGoSfDGlkD
         TfY8AHcTwW41kjRmeRgIr8pSIySwZW0D90FkMwrcfDxQ68Bx9/jk1bBOZevbalcrMlzB
         l62HZuz5R64L2GNds1KgBbElQ3GUimFZjk+tF0NbsVSbc5AwAJZLgtdJiY9D+WvCAzb8
         admQ==
X-Gm-Message-State: ACgBeo3HyTcSZF2gu09JosFPV31hRjy3+Mwi5Pso7an+HdOZqCfGVayT
	6kpvra76RwZ2C4nTjJ22HGn20MifCes=
X-Google-Smtp-Source: AA6agR4LDbQWXdXrcv0JMgBINNgjvn9pf80nKTUp2YnAXAB1ZvPVqr+HEwRRfLtQVAlBKyVdtead7g==
X-Received: by 2002:a63:e618:0:b0:42a:dcd6:25e0 with SMTP id g24-20020a63e618000000b0042adcd625e0mr3821531pgh.379.1661531175306;
        Fri, 26 Aug 2022 09:26:15 -0700 (PDT)
Received: from [172.31.0.204] (c-73-98-188-51.hsd1.ut.comcast.net. [73.98.188.51])
        by smtp.gmail.com with ESMTPSA id e23-20020a63db17000000b00429b6e6c539sm1620963pgg.61.2022.08.26.09.26.14
        for <gcc-patches@gcc.gnu.org>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 26 Aug 2022 09:26:14 -0700 (PDT)
Message-ID: <1106b505-65b0-4681-c53b-fe7579490e92@gmail.com>
Date: Fri, 26 Aug 2022 10:26:13 -0600
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Thunderbird/91.12.0
Subject: Re: [PATCH 6/6] Extend SLP permutation optimisations
Content-Language: en-US
To: gcc-patches@gcc.gnu.org
References: <mptwnawiids.fsf@arm.com> <mpt7d2wii99.fsf@arm.com>
From: Jeff Law <jeffreyalaw@gmail.com>
In-Reply-To: <mpt7d2wii99.fsf@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


On 8/25/2022 7:07 AM, Richard Sandiford via Gcc-patches wrote:
> Currently SLP tries to force permute operations "down" the graph
> from loads in the hope of reducing the total number of permutations
> needed or (in the best case) removing the need for the permutations
> entirely.  This patch tries to extend it as follows:
>
> - Allow loads to take a different permutation from the one they
>    started with, rather than choosing between "original permutation"
>    and "no permutation".
>
> - Allow changes in both directions, if the target supports the
>    reverse permutation.
>
> - Treat the placement of permutations as a two-way dataflow problem:
>    after propagating information from leaves to roots (as now), propagate
>    information back up the graph.
>
> - Take execution frequency into account when optimising for speed,
>    so that (for example) permutations inside loops have a higher
>    cost than permutations outside loops.
>
> - Try to reduce the total number of permutations when optimising for
>    size, even if that increases the number of permutations on a given
>    execution path.
>
> See the big block comment above vect_optimize_slp_pass for
> a detailed description.
>
> The original motivation for doing this was to add a framework that would
> allow other layout differences in future.  The two main ones are:
>
> - Make it easier to represent predicated operations, including
>    predicated operations with gaps.  E.g.:
>
>       a[0] += 1;
>       a[1] += 1;
>       a[3] += 1;
>
>    could be a single load/add/store for SVE.  We could handle this
>    by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>    (depending on what's being counted).  We might need to move
>    elements between lanes at various points, like with permutes.
>
>    (This would first mean adding support for stores with gaps.)
>
> - Make it easier to switch between an even/odd and unpermuted layout
>    when switching between wide and narrow elements.  E.g. if a widening
>    operation produces an even vector and an odd vector, we should try
>    to keep operations on the wide elements in that order rather than
>    force them to be permuted back "in order".
Very cool.  Richi and I discussed this a bit a year or so ago -- 
basically noting that bi-directional movement is really the way to go 
and that the worst thing to do is push a permute down into the *middle* 
of a computation chain since that will tend to break FMA generation.  
Moving to the loads or stores or to another permute point ought to be 
the goal.

It looks like you've covered that excessively well :-)

Jeff