From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR03-VE1-obe.outbound.protection.outlook.com (mail-eopbgr50048.outbound.protection.outlook.com [40.107.5.48]) by sourceware.org (Postfix) with ESMTPS id 50C15385C327 for ; Thu, 9 Jun 2022 04:41:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 50C15385C327 ARC-Seal: i=2; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=pass; b=aTibhI/JiwjIhAbSqpn7IdGQD8BCF6uzxxWArwNoYqDApepLiq794kDzsx30AuN527atrIwHxXEzdwl/f9Fo0dLve9DNOd51hx26k23Q8wg/m7AX4slFxiN/t+9k6kkr2ND9MYeomInLBsR/gIoa9W9E+h9Jg5+XF//YwP1GyINacCWkdoeFGFdbQmNlsNv705DOmbY7XJ1mneFJ53Oh+OpmzTDaJL4VCq8XSgsRgJK7pKT8oSVDmipA7Z8jOUsptq4CD4q4TBX94rh20KGqMKJx04UXybedUPa5tzD585Rlsp86H42hnUzpHoIm7e2XnQVH/h6iGtxLlfIg9T7bng== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rn81Isv9devO9YhqaJImbRSG7KR7KiDJU00DqXG2WSo=; b=GGRUnJscrBT103eoqifVaPKDhXjQ6VUK3TxhxBp8Xxn2bQO1LPJ2VV+dpo4N95S9Hwzj3N83rrtxEbgm8MyVSYr6kVzn1XUFK76hwxYRQu23RNaUP5kGYDla8hNKGvKMblAgM2YNlTQseYbrRG9YHUJIi9C46fHKlORBQ6W+SNhj63dinR7v73wnfebBSw3w1uVfUcViE9FW3OTRmchWX7rga3NqJC7UXcntzS+vxL+mhbL+w6oalnUUWRDd8eSiaQCYKsjlgh+T0NW1gDkrNkLknwKgTksXunrk2w9EcgYpWtb+phZeIvbsfPVH1YDu35hz1WosnRHrjj1K5EmGDQ== ARC-Authentication-Results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 63.35.35.123) smtp.rcpttodomain=gcc.gnu.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dkim=[1,1,header.d=arm.com] dmarc=[1,1,header.from=arm.com]) Received: from AM6P191CA0092.EURP191.PROD.OUTLOOK.COM (2603:10a6:209:8a::33) by DB7PR08MB3321.eurprd08.prod.outlook.com (2603:10a6:5:20::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5314.13; Thu, 9 Jun 2022 04:41:05 +0000 Received: from VE1EUR03FT035.eop-EUR03.prod.protection.outlook.com (2603:10a6:209:8a:cafe::e0) by AM6P191CA0092.outlook.office365.com (2603:10a6:209:8a::33) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5332.13 via Frontend Transport; Thu, 9 Jun 2022 04:41:05 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; pr=C Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by VE1EUR03FT035.mail.protection.outlook.com (10.152.18.110) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5332.12 via Frontend Transport; Thu, 9 Jun 2022 04:41:03 +0000 Received: ("Tessian outbound 4ab5a053767b:v120"); Thu, 09 Jun 2022 04:41:02 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: c47223038c87b286 X-CR-MTA-TID: 64aa7808 Received: from 8dc53bcd185d.2 by 64aa7808-outbound-1.mta.getcheckrecipient.com id C75D2D8F-848C-4A86-A070-E6E1CE3E07B9.1; Thu, 09 Jun 2022 04:40:56 +0000 Received: from EUR03-VE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 8dc53bcd185d.2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Thu, 09 Jun 2022 04:40:56 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=QL7fPGDLu+AOSOT4x+H7zrwUvwHhq4d+0Cy6/vA7nuVJVB2BzgmMlywWqpO9geM03bkv2hUHqy7mi4ngV0DIsKyBpf/iXy0uJv4+Zqmd0m3STkBAxV48DbjgHWght7iVlSVSp7faLhuqJpvo9rH414ppDDgrjT8MAzlJTAQ1qZVtLN8YS6iFzmHoqRyxfOsWzfYTFJrT/SbykgnDMiRaZd4Z2H1DDFdAPdZEiVrz7c0GYstp5S/C7eUzEXGFOaR+0j+oJaF2bS2wwnZJOd9fJ5TBjFxfPSxzzXfznSjsEEyCdrXl9FIGUzRfK9azwKTnEcY09ww/GJ9t6QvdxJfwiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rn81Isv9devO9YhqaJImbRSG7KR7KiDJU00DqXG2WSo=; b=oY3mZcxoHqU9ZXbuU4Z2uxTuCzjXYt1ja6CPdeglJm5ndiEwrkMZxlOm6I17wXEwu7PktXDkcZHuCYK9Gxgt9FGGYDP9o1ad6kj1NDV5vuDsnDKITs0/HYGgMSgLvWZUT4HvQ7rC3sYOXjMmmNBfHBnI4S+gEqVB6j8ev2mxyu/LxRPkpGNHfaBn64tjNlXtGuwoyzz3EGpt4Exqku+xsY3p+4c5vKiyIMK3art961qbus1/JYSOTOFYdFXVINjsHUxC3Dx5m2NUjhh5WqLyO7WgMq5Qlv+p8aM/xHSoXPFMOHd5cmgBpXq8wC5KDszXDiEKtJNitXVG7fHE+wh1hQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Authentication-Results-Original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; Received: from VI1PR08MB5325.eurprd08.prod.outlook.com (2603:10a6:803:13e::17) by VI1PR0802MB2143.eurprd08.prod.outlook.com (2603:10a6:800:9a::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5332.12; Thu, 9 Jun 2022 04:40:45 +0000 Received: from VI1PR08MB5325.eurprd08.prod.outlook.com ([fe80::54e5:594b:e5fd:a9b4]) by VI1PR08MB5325.eurprd08.prod.outlook.com ([fe80::54e5:594b:e5fd:a9b4%7]) with mapi id 15.20.5314.019; Thu, 9 Jun 2022 04:40:45 +0000 Date: Thu, 9 Jun 2022 05:40:43 +0100 From: Tamar Christina To: gcc-patches@gcc.gnu.org Cc: nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com, Kyrylo.Tkachov@arm.com, richard.sandiford@arm.com Subject: [PATCH 2/2]AArch64 aarch64: Add implementation for pow2 bitmask division. Message-ID: Content-Type: multipart/mixed; boundary="tX2WD1YoU5aqHstC" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-ClientProxiedBy: LO4P123CA0199.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:1a5::6) To VI1PR08MB5325.eurprd08.prod.outlook.com (2603:10a6:803:13e::17) MIME-Version: 1.0 X-MS-Office365-Filtering-Correlation-Id: 1ef572d8-4732-43ff-42d2-08da49d24284 X-MS-TrafficTypeDiagnostic: VI1PR0802MB2143:EE_|VE1EUR03FT035:EE_|DB7PR08MB3321:EE_ X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true NoDisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: NKCxbK44bKJeX+jZNm74ruKM+8SsTqlME3IIX2DcrKXZ2unx2jacQe/o18XxALIOWPC1Ym0Y3e+F6vW88RWyK6cCOx0EHHOpm4xUQkIAsd3Ig0SSYtDTS9MvvX0jH/ySJ2k7B8LGvHzACpHL9yVsy8Jd96XN7ztlynWzZVGzS1iF/YtvWYDn53prEpMjh95zbtgf/jROT7n5fDsX/NZ+ch3noNAe+5J6EpKMOWi18RjQwicBaJc32Xom1BpuH+e9BsLRQp/gaHQaBSvtQEbzcEGqH/l+FgS38VHSz6n5R7LY5hivMbwJxTFCBeZK4xiStuNFBz4FquQqgVJqEW/cDceUJ8Biza3uVfeWJk4oqtAQ15oLOuHAarV/tyuCCEanNHJmfHusOqn4U3SQF0lApMNz4xP8ssHiiYmS+6clyMwU1eeojinMgNRhpH4wLQoinl4Y9Nf11meigwHLPOibFDI1pHCSeU57jlHlSdyqWATTKm6izrNoI4D7RoqCOxGLcSgmWE8/qWTX3dz2030eDidWhVdryLl6L4nFJn9kWnBLyoaeAAwoJipCYGk/Y02Z1DZ5Gn3UfT2ZAJTM+x5ahr/iO2sCHjF8vgG4V4Ovs4uTebtcMGD4Fjw42Fib3uNXA79XZoz4hx9TpBv1ZhSA0QRGmK+Ep57jQnTWXh4vFq/wSl60xJf+Hnig+QXoFfNS0e5ZSmQLSez3O2jN14+21QpSsk56VMsaHVVJK0X5+q4kl8IkX+dX0/o8uxi2NZicjGrbIEUhmqeD0OxBQr7hj7u6nSJ5/aU+A/dYjxTw5zSt1Ggo0GZoggcmlElWAJ7Z X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VI1PR08MB5325.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230001)(4636009)(366004)(33964004)(235185007)(44144004)(66946007)(8676002)(66476007)(66556008)(5660300002)(6506007)(86362001)(26005)(6512007)(4743002)(316002)(6916009)(4326008)(508600001)(8936002)(6486002)(84970400001)(38100700002)(186003)(44832011)(2616005)(2906002)(36756003)(83380400001)(67856001)(2700100001); DIR:OUT; SFP:1101; X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0802MB2143 Original-Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: VE1EUR03FT035.eop-EUR03.prod.protection.outlook.com X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id-Prvs: f7061311-71a8-4486-5d2e-08da49d237f5 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 22XCzJ6//+CRRW/EwgcBwmBFLEtH5JjuBbkUV07mLY7WWrvqQvhzIDqF8fCN7oTupzsLZwejkk2qXcxPz8yPXA6TD6D2VjaMjgqMy/XKPeCoCWHx/zvhP7hXrzYleROMJOAMAh8teicUsm8m9BacqxCX4yMsXMaOOrxYJNNWiIwJWvvt7R9IkBx/B5n3l12sjihSEJruWhi26DE43jlfZVdpd3O9/GyXVryWaj4ihECX/RhQStzhhZEwi3Y4KkjiOeL8UKXB+IlHH+/xQ55Z4FTWxlk2q2+vVy3VTU2F01lQE0Hrx/m6RvwRlVPEP9olCfvLnXily5xz8Gy+YRz42OtCtia3C8+sMvG1GQ9r7p5ZS73WbB9T6GpICkQ9Kyv3sX399KBIPaKs+xNBXJpfMhpSvaK37V1rHfw1QQv6c8eSSnrj6mX5yKJKlse7zekCOwnobLZGdg/wEGrRhMoipsZBVw0arAXPGpXwBMNqKi2MgBjgJLZZwjtmFUYbuqlqm3/FsgO2YpH8791hVQpPg3r9cCcXGkkVNGobM2dJEh3Van5Eq6I6tyXT68xc2naHVzof5kRnuxP+rK74dmyTuY6R14oXYzX57qGGgIhFugga68a7H5reVRuv0ITrc3hWnHgFV1Anh5k0YBAoEWwGoHtcsM0vA2mvHHoe6zqieSTz+l+MTiyN5A8H0PHzIvKvsWLObYCGQksqZwSTFmEDq+XOC02UwlsaZpNyB2xWlrVieAcwK/4UtFSZH/dgkaCMAKKVG5Kk89wPYa5AJ9w8iL3Rz8oF8uKljmjsDwc/B0M7YB4QpYKpV/yP5xXAVO25 X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(13230001)(4636009)(36840700001)(46966006)(40470700004)(36756003)(70206006)(70586007)(356005)(84970400001)(82310400005)(6916009)(8676002)(36860700001)(4326008)(81166007)(316002)(8936002)(5660300002)(4743002)(235185007)(44144004)(83380400001)(6512007)(6506007)(26005)(86362001)(44832011)(6486002)(336012)(186003)(33964004)(2616005)(40460700003)(508600001)(2906002)(47076005)(67856001)(2700100001); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Jun 2022 04:41:03.2864 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 1ef572d8-4732-43ff-42d2-08da49d24284 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: VE1EUR03FT035.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR08MB3321 X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, FORGED_SPF_HELO, GIT_PATCH_0, KAM_DMARC_NONE, KAM_LOTSOFHASH, KAM_SHORT, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Jun 2022 04:41:24 -0000 --tX2WD1YoU5aqHstC Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Hi All, This adds an implementation for the new optab for unsigned pow2 bitmask for AArch64. The implementation rewrites: x = y / (2 ^ (sizeof (y)/2)-1 into e.g. (for bytes) (x + ((x + 257) >> 8)) >> 8 where it's required that the additions be done in double the precision of x such that we don't lose any bits during an overflow. Essentially the sequence decomposes the division into doing two smaller divisions, one for the top and bottom parts of the number and adding the results back together. To account for the fact that shift by 8 would be division by 256 we add 1 to both parts of x such that when 255 we still get 1 as the answer. Because the amount we shift are half the original datatype we can use the halfing instructions the ISA provides to do the operation instead of using actual shifts. For AArch64 this means we generate for: void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n) { for (int i = 0; i < (n & -16); i+=1) pixel[i] = (pixel[i] * level) / 0xff; } the following: movi v3.16b, 0x1 umull2 v1.8h, v0.16b, v2.16b umull v0.8h, v0.8b, v2.8b addhn v5.8b, v1.8h, v3.8h addhn v4.8b, v0.8h, v3.8h uaddw v1.8h, v1.8h, v5.8b uaddw v0.8h, v0.8h, v4.8b uzp2 v0.16b, v0.16b, v1.16b instead of: umull v2.8h, v1.8b, v5.8b umull2 v1.8h, v1.16b, v5.16b umull v0.4s, v2.4h, v3.4h umull2 v2.4s, v2.8h, v3.8h umull v4.4s, v1.4h, v3.4h umull2 v1.4s, v1.8h, v3.8h uzp2 v0.8h, v0.8h, v2.8h uzp2 v1.8h, v4.8h, v1.8h shrn v0.8b, v0.8h, 7 shrn2 v0.16b, v1.8h, 7 Which results in significantly faster code. Thanks for Wilco for the concept. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-simd.md (udiv_pow2_bitmask2): New. gcc/testsuite/ChangeLog: * gcc.target/aarch64/div-by-bitmask.c: New test. --- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 18733428f3fb91d937346aa360f6d1fe13ca1eae..6b0405924a03a243949a6741f4c0e989d9ca2869 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -4845,6 +4845,57 @@ (define_expand "aarch64_hn2" } ) +;; div optimizations using narrowings +;; we can do the division e.g. shorts by 255 faster by calculating it as +;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in +;; double the precision of x. +;; +;; If we imagine a short as being composed of two blocks of bytes then +;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalen to +;; adding 1 to each sub component: +;; +;; short value of 16-bits +;; ┌──────────────┬────────────────┐ +;; │ │ │ +;; └──────────────┴────────────────┘ +;; 8-bit part1 ▲ 8-bit part2 ▲ +;; │ │ +;; │ │ +;; +1 +1 +;; +;; after the first addition, we have to shift right by 8, and narrow the +;; results back to a byte. Remember that the addition must be done in +;; double the precision of the input. Since 8 is half the size of a short +;; we can use a narrowing halfing instruction in AArch64, addhn which also +;; does the addition in a wider precision and narrows back to a byte. The +;; shift itself is implicit in the operation as it writes back only the top +;; half of the result. i.e. bits 2*esize-1:esize. +;; +;; Since we have narrowed the result of the first part back to a byte, for +;; the second addition we can use a widening addition, uaddw. +;; +;; For the finaly shift, since it's unsigned arithmatic we emit an ushr by 8 +;; to shift and the vectorizer. +;; +;; The shift is later optimized by combine to a uzp2 with movi #0. +(define_expand "udiv_pow2_bitmask2" + [(match_operand:VQN 0 "register_operand") + (match_operand:VQN 1 "register_operand")] + "TARGET_SIMD" +{ + rtx addend = gen_reg_rtx (mode); + rtx val = aarch64_simd_gen_const_vector_dup (mode, 1); + emit_move_insn (addend, lowpart_subreg (mode, val, mode)); + rtx tmp1 = gen_reg_rtx (mode); + rtx tmp2 = gen_reg_rtx (mode); + emit_insn (gen_aarch64_addhn (tmp1, operands[1], addend)); + unsigned bitsize = GET_MODE_UNIT_BITSIZE (mode); + rtx shift_vector = aarch64_simd_gen_const_vector_dup (mode, bitsize); + emit_insn (gen_aarch64_uaddw (tmp2, operands[1], tmp1)); + emit_insn (gen_aarch64_simd_lshr (operands[0], tmp2, shift_vector)); + DONE; +}) + ;; pmul. (define_insn "aarch64_pmul" diff --git a/gcc/testsuite/gcc.target/aarch64/div-by-bitmask.c b/gcc/testsuite/gcc.target/aarch64/div-by-bitmask.c new file mode 100644 index 0000000000000000000000000000000000000000..c03aee695ef834fbe3533a21d54a218160b0007d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/div-by-bitmask.c @@ -0,0 +1,70 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -std=c99 -fdump-tree-vect -save-temps" } */ +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */ + +#include + +/* +** draw_bitmap1: +** ... +** umull2 v[0-9]+.8h, v[0-9]+.16b, v[0-9]+.16b +** umull v[0-9]+.8h, v[0-9]+.8b, v[0-9]+.8b +** addhn v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h +** addhn v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h +** uaddw v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8b +** uaddw v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8b +** uzp2 v[0-9]+.16b, v[0-9]+.16b, v[0-9]+.16b +** ... +*/ +void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * level) / 0xff; +} + +void draw_bitmap2(uint8_t* restrict pixel, uint8_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * level) / 0xfe; +} + +/* +** draw_bitmap3: +** ... +** umull2 v[0-9]+.4s, v[0-9]+.8h, v[0-9]+.8h +** umull v[0-9]+.4s, v[0-9]+.4h, v[0-9]+.4h +** addhn v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s +** addhn v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s +** uaddw v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4h +** uaddw v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4h +** uzp2 v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h +** ... +*/ +void draw_bitmap3(uint16_t* restrict pixel, uint16_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * level) / 0xffffU; +} + +/* +** draw_bitmap4: +** ... +** umull2 v[0-9]+.2d, v[0-9]+.4s, v[0-9]+.4s +** umull v[0-9]+.2d, v[0-9]+.2s, v[0-9]+.2s +** addhn v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d +** addhn v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d +** uaddw v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2s +** uaddw v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2s +** uzp2 v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s +** ... +*/ +/* Costing for long vectorization seems off, so disable + the cost model to test the codegen. */ +__attribute__ ((optimize("-fno-vect-cost-model"))) +void draw_bitmap4(uint32_t* restrict pixel, uint32_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; +} + +/* { dg-final { scan-tree-dump-times "\.DIV_POW2_BITMASK" 6 "vect" } } */ -- --tX2WD1YoU5aqHstC Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="rb15780.patch" Content-Transfer-Encoding: 8bit diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 18733428f3fb91d937346aa360f6d1fe13ca1eae..6b0405924a03a243949a6741f4c0e989d9ca2869 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -4845,6 +4845,57 @@ (define_expand "aarch64_hn2" } ) +;; div optimizations using narrowings +;; we can do the division e.g. shorts by 255 faster by calculating it as +;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in +;; double the precision of x. +;; +;; If we imagine a short as being composed of two blocks of bytes then +;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalen to +;; adding 1 to each sub component: +;; +;; short value of 16-bits +;; ┌──────────────┬────────────────┐ +;; │ │ │ +;; └──────────────┴────────────────┘ +;; 8-bit part1 ▲ 8-bit part2 ▲ +;; │ │ +;; │ │ +;; +1 +1 +;; +;; after the first addition, we have to shift right by 8, and narrow the +;; results back to a byte. Remember that the addition must be done in +;; double the precision of the input. Since 8 is half the size of a short +;; we can use a narrowing halfing instruction in AArch64, addhn which also +;; does the addition in a wider precision and narrows back to a byte. The +;; shift itself is implicit in the operation as it writes back only the top +;; half of the result. i.e. bits 2*esize-1:esize. +;; +;; Since we have narrowed the result of the first part back to a byte, for +;; the second addition we can use a widening addition, uaddw. +;; +;; For the finaly shift, since it's unsigned arithmatic we emit an ushr by 8 +;; to shift and the vectorizer. +;; +;; The shift is later optimized by combine to a uzp2 with movi #0. +(define_expand "udiv_pow2_bitmask2" + [(match_operand:VQN 0 "register_operand") + (match_operand:VQN 1 "register_operand")] + "TARGET_SIMD" +{ + rtx addend = gen_reg_rtx (mode); + rtx val = aarch64_simd_gen_const_vector_dup (mode, 1); + emit_move_insn (addend, lowpart_subreg (mode, val, mode)); + rtx tmp1 = gen_reg_rtx (mode); + rtx tmp2 = gen_reg_rtx (mode); + emit_insn (gen_aarch64_addhn (tmp1, operands[1], addend)); + unsigned bitsize = GET_MODE_UNIT_BITSIZE (mode); + rtx shift_vector = aarch64_simd_gen_const_vector_dup (mode, bitsize); + emit_insn (gen_aarch64_uaddw (tmp2, operands[1], tmp1)); + emit_insn (gen_aarch64_simd_lshr (operands[0], tmp2, shift_vector)); + DONE; +}) + ;; pmul. (define_insn "aarch64_pmul" diff --git a/gcc/testsuite/gcc.target/aarch64/div-by-bitmask.c b/gcc/testsuite/gcc.target/aarch64/div-by-bitmask.c new file mode 100644 index 0000000000000000000000000000000000000000..c03aee695ef834fbe3533a21d54a218160b0007d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/div-by-bitmask.c @@ -0,0 +1,70 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -std=c99 -fdump-tree-vect -save-temps" } */ +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */ + +#include + +/* +** draw_bitmap1: +** ... +** umull2 v[0-9]+.8h, v[0-9]+.16b, v[0-9]+.16b +** umull v[0-9]+.8h, v[0-9]+.8b, v[0-9]+.8b +** addhn v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h +** addhn v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h +** uaddw v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8b +** uaddw v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8b +** uzp2 v[0-9]+.16b, v[0-9]+.16b, v[0-9]+.16b +** ... +*/ +void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * level) / 0xff; +} + +void draw_bitmap2(uint8_t* restrict pixel, uint8_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * level) / 0xfe; +} + +/* +** draw_bitmap3: +** ... +** umull2 v[0-9]+.4s, v[0-9]+.8h, v[0-9]+.8h +** umull v[0-9]+.4s, v[0-9]+.4h, v[0-9]+.4h +** addhn v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s +** addhn v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s +** uaddw v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4h +** uaddw v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4h +** uzp2 v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h +** ... +*/ +void draw_bitmap3(uint16_t* restrict pixel, uint16_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * level) / 0xffffU; +} + +/* +** draw_bitmap4: +** ... +** umull2 v[0-9]+.2d, v[0-9]+.4s, v[0-9]+.4s +** umull v[0-9]+.2d, v[0-9]+.2s, v[0-9]+.2s +** addhn v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d +** addhn v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d +** uaddw v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2s +** uaddw v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2s +** uzp2 v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s +** ... +*/ +/* Costing for long vectorization seems off, so disable + the cost model to test the codegen. */ +__attribute__ ((optimize("-fno-vect-cost-model"))) +void draw_bitmap4(uint32_t* restrict pixel, uint32_t level, int n) +{ + for (int i = 0; i < (n & -16); i+=1) + pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; +} + +/* { dg-final { scan-tree-dump-times "\.DIV_POW2_BITMASK" 6 "vect" } } */ --tX2WD1YoU5aqHstC--