From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com [IPv6:2607:f8b0:4864:20::734]) by sourceware.org (Postfix) with ESMTPS id 985493870924 for ; Mon, 9 Nov 2020 16:00:22 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 985493870924 Received: by mail-qk1-x734.google.com with SMTP id l2so8371423qkf.0 for ; Mon, 09 Nov 2020 08:00:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:references:cc:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=b2KuEUuM3HEHnrhBuFsqRTS1L4AuMPCspnG4f7JEj10=; b=ThG65jtqMpC6Tq26W/G8ImoQ6zKMFAkcH9WSwGyXjDuqOe7dEF4G/DCpnsW6zLI4ad SGNOHdf6RdWCJUoTQl13FTj56Yqj3QxH0so6kZvmtTfdRfpYUKqV2tpU6oM4i8A4k/Cj d5t/BdmgV8rpYPhiOHa9t20wGYtb7hMK6AANsNKelJFjrgERUBJgPxzozdGZTEUEKUYN u/te8d/0idxeIhSswQo9Lgs9BRLe792kiZgmZYNgr/N9TNYkn2af7pRjczNDo42kLkmO YRrA9QV+8fnqI5fjyjGzP5ZIQbiIpOPTgwB9ZBMoMxq6vHHq1mzJMPHLLMtx2JjszeXZ q88A== X-Gm-Message-State: AOAM530OXr/GF0vKvaPZry/4lb2mU+HYIrKU5VjIIkS7tOxPn0rzaVqK GEjt6cZYMI2NDDocnNFUkmY= X-Google-Smtp-Source: ABdhPJx+zt8VWEX7zIk9UMiBahrhcbX1CpblhW5UrBwPsKkBcHU8W2iz0jDE6XowhR3c5cp6y/4mrQ== X-Received: by 2002:a37:8703:: with SMTP id j3mr9590401qkd.5.1604937621995; Mon, 09 Nov 2020 08:00:21 -0800 (PST) Received: from [192.168.0.41] (174-16-106-146.hlrn.qwest.net. [174.16.106.146]) by smtp.gmail.com with ESMTPSA id p1sm6239435qkc.100.2020.11.09.08.00.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 09 Nov 2020 08:00:21 -0800 (PST) Subject: Re: [PATCH] warn for integer overflow in allocation calls (PR 96838) From: Martin Sebor To: Jeff Law , gcc-patches References: <83a00f2c-de4f-60d8-9399-a64671795eda@gmail.com> <1fea8304-a48b-6591-bab4-eb1469865bbd@redhat.com> <5905e37e-46eb-c12b-751b-fd1375a122f9@gmail.com> <5ad6fd82-18fb-eeb2-0c5f-959a39bcee1a@redhat.com> <06dc9a8b-7fdc-e2b7-86f2-77d0285c7349@redhat.com> <3ed754a9-474c-c4ee-d140-c8020322c59d@gmail.com> <81664cc8-0aff-6cac-20a1-55fdf401a1af@gmail.com> Message-ID: Date: Mon, 9 Nov 2020 09:00:19 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <81664cc8-0aff-6cac-20a1-55fdf401a1af@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.2 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Nov 2020 16:00:25 -0000 Ping: https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554000.html Jeff, I don't expect to have the cycles to reimplement this patch using the Ranger APIs before stage 1 closes. I'm open to giving it a try in stage 3 if it's still in scope for GCC 11. Otherwise, is this patch okay to commit? On 9/21/20 9:13 AM, Martin Sebor wrote: > On 9/20/20 12:39 AM, Aldy Hernandez wrote: >> >> >> On 9/19/20 11:22 PM, Martin Sebor wrote: >>> On 9/18/20 12:29 AM, Aldy Hernandez wrote: >>>> >>>> >>>> On 9/17/20 10:18 PM, Martin Sebor wrote: >>>>> On 9/17/20 12:39 PM, Andrew MacLeod wrote: >>>>>> On 9/17/20 12:08 PM, Martin Sebor via Gcc-patches wrote: >>>>>>> On 9/16/20 9:23 PM, Jeff Law wrote: >>>>>>>> >>>>>>>> On 9/15/20 1:47 PM, Martin Sebor wrote: >>>>>>>>> Overflowing the size of a dynamic allocation (e.g., malloc or VLA) >>>>>>>>> can lead to a subsequent buffer overflow corrupting the heap or >>>>>>>>> stack.  The attached patch diagnoses a subset of these cases where >>>>>>>>> the overflow/wraparound is still detectable. >>>>>>>>> >>>>>>>>> Besides regtesting GCC on x86_64-linux I also verified the warning >>>>>>>>> doesn't introduce any false positives into Glibc or Binutils/GDB >>>>>>>>> builds on the same target. >>>>>>>>> >>>>>>>>> Martin >>>>>>>>> >>>>>>>>> gcc-96838.diff >>>>>>>>> >>>>>>>>> PR middle-end/96838 - missing warning on integer overflow in >>>>>>>>> calls to allocation functions >>>>>>>>> >>>>>>>>> gcc/ChangeLog: >>>>>>>>> >>>>>>>>>     PR middle-end/96838 >>>>>>>>>     * calls.c (eval_size_vflow): New function. >>>>>>>>>     (get_size_range): Call it.  Add argument. >>>>>>>>>     (maybe_warn_alloc_args_overflow): Diagnose >>>>>>>>> overflow/wraparound. >>>>>>>>>     * calls.h (get_size_range): Add argument. >>>>>>>>> >>>>>>>>> gcc/testsuite/ChangeLog: >>>>>>>>> >>>>>>>>>     PR middle-end/96838 >>>>>>>>>     * gcc.dg/Walloc-size-larger-than-19.c: New test. >>>>>>>>>     * gcc.dg/Walloc-size-larger-than-20.c: New test. >>>>>>>> >>>>>>>> If an attacker can control an integer overflow that feeds an >>>>>>>> allocation, then they can do all kinds of bad things.  In fact, >>>>>>>> when my son was asking me attack vectors, this is one I said I'd >>>>>>>> look at if I were a bad guy. >>>>>>>> >>>>>>>> >>>>>>>> I'm a bit surprised you can't just query the range of the >>>>>>>> argument and get the overflow status out of that range, but I >>>>>>>> don't see that in the APIs.  How painful would it be to make >>>>>>>> that part of the API? The conceptual model would be to just ask >>>>>>>> for the range of the argument to malloc which would include the >>>>>>>> range and a status bit indicating the computation might have >>>>>>>> overflowed. >>>>>>> >>>>>>   Do we know if it did/would have wrapped? sure.  since we have to >>>>>> do the math.    so you are correct in that the information is >>>>>> there. but is it useful? >>>>>> >>>>>> We are in the very annoying habit of subtracting one by adding >>>>>> 0xFFFFFFF.  which means you get an overflow for unsigned when you >>>>>> subtract one.   From what I have seen of unsigned math, we would >>>>>> be flagging very many operations as overflows, so you would still >>>>>> have the difficulty of figuring out whether its a "real" overflow >>>>>> or a fake one because of the way we do unsigned math >>>>> >>>>> You and me both :) >>>>> >>>>>> >>>>>> At the very start, I did have an overflow flag in the range >>>>>> class... but it was turning out to be fairly useless so it was >>>>>> removed. >>>>>> . >>>>>> >>>>>>> I agree that being able to evaluate an expression in an as-if >>>>>>> infinite precision (in addition to its type) would be helpful. >>>>>> >>>>>> SO again, we get back to adding 0x0fffff when we are trying to >>>>>> subtract one...  now, with infinite precision you are going to see >>>>>> >>>>>>   [2,10]  - 1      we end up with [2,10]+0xFFFFFF, which will now >>>>>> give you  [0x100000001, 0x100000009]    so its going to look like >>>>>> it overflowed? >>>>>> >>>>>> >>>>>> >>>>>>> But just to make sure I understood correctly, let me ask again >>>>>>> using an example: >>>>>>> >>>>>>>   void* f (size_t n) >>>>>>>   { >>>>>>>     if (n < PTRDIFF_MAX / 2) >>>>>>>       n = PTRDIFF_MAX / 2; >>>>>>> >>>>>>>     return malloc (n * sizeof (int)); >>>>>>>   } >>>>>>> >>>>>>> Can the unsigned wraparound in the argument be readily detected? >>>>>>> >>>>>>> On trunk, this ends up with the following: >>>>>>> >>>>>>>   # RANGE [4611686018427387903, 18446744073709551615] >>>>>>>   _6 = MAX_EXPR ; >>>>>>>   # RANGE [0, 18446744073709551615] NONZERO 18446744073709551612 >>>>>>>   _1 = _6 * 4; >>>>>>>   ... >>>>>>>   p_5 = mallocD.1206 (_1); [tail call] >>>>>>>   ... >>>>>>>   return p_5; >>>>>>> >>>>>>> so _1's range reflects the wraparound in size_t, but _6's range >>>>>>> has enough information to uncover it.  So detecting it is possible >>>>>>> and is done in the patch so we get a warning: >>>>>>> >>>>>>> warning: argument 1 range [18446744073709551612, >>>>>>> 0x3fffffffffffffffc] is too large to represent in ‘long unsigned >>>>>>> int’ [-Walloc-size-larger-than=] >>>>>>>     6 |   return malloc (n * sizeof (int)); >>>>>>>       |          ^~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>>> >>>>>>> The code is very simplistic and only handles a small subset of >>>>>>> cases. >>>>>>> It could be generalized and exposed by a more generic API but it >>>>>>> does >>>>>>> seem like the ranger must already have all the logic built into >>>>>>> it so >>>>>>> if it isn't exposed now it should be a matter of opening it up. >>>>>> >>>>>> everything is exposed in range-ops.  well, mostly. >>>>>> if we have _1 = _6 * 4 >>>>>> >>>>>> if one wanted to do that infinite precision, you query the range >>>>>> for _6, and the range for 4 (which would be [4,4] :-) >>>>>> range_of_expr (op1r, _6, stmt) >>>>>> range_of_expr (op2r, 4, stmt) >>>>>> >>>>>> you could take their current types, and cast those ranges to >>>>>> whatever the next higher precsion is, >>>>>> range_cast  (op1r, highertype) >>>>>> range_cast (op2r, highertype) >>>>>> then invoke the operation on those parameters >>>>>> >>>>>> gimple_range_fold (r, stmt,  op1r, op2r) >>>>>> >>>>>> and that will do your operation in the higher precision.  you >>>>>> could compare that to the value in regular precision too i suppose. >>>>> >>>>> The patch does pretty much exactly what you described, except in >>>>> offset_int, and only for a limited set of arithmetic operations. >>>>> It figures out if an unsigned expression wrapped simply by checking >>>>> to see if the mathematically correct result fits in its type. >>>>> >>>>> It sounds like I should be able to use the range APIs above instead >>>>> and get the same result, but for arbitrary expressions.  I don't >>>>> need to (and very well can't) compare the infinitely precise result >>>>> to the regular result because when it's a range that wraps the result >>>>> is the full range of the type (i.e., [0, SIZE_MAX] in the test above). >>>>> >>>>>> >>>>>> The ranger is designed to track ranges as they are represented in >>>>>> the IL.  You are asking for us to interpret the IL as something >>>>>> other than what is there, and increase the precision.  Thats a >>>>>> different task. >>>>>> >>>>>> could that be done?    maybe. It might involve parallel tracking >>>>>> of ssa-Name ranges in a higher precision... and recalculating >>>>>> every expression using those values.   Im not convinced that a >>>>>> generalized ranger which works in higher precision is really going >>>>>> to solve the problem the way you hope it would.. i think we'd get >>>>>> a lot of false info when unsigned values are involved >>>>> >>>>> I don't think that's necessary for unsigned wrapping.  Most of >>>>> the time the (possibly) wrapped result is what we need because >>>>> that's what the languages say is supposed to happen.  The cases >>>>> when we need to check for the dangerous wraparound (or overflow) >>>>> should be rare (just allocation calls) and can be handled on >>>>> demand by walking the IL and doing the computation in the infinite >>>>> precision. >>>>> >>>>> For signed overflow, though, computing the mathematically correct >>>>> result seems preferable.  Or at least tracking (and propagating) >>>>> the overflow bit should be.  That way we could tell when >>>>> an expression is definitely invalid. >>>>> >>>>>> >>>>>> If I were to see a clear solution/description as to what is a >>>>>> problematic overflow and what isnt, there might be something more >>>>>> general that could be done... >>>>> >>>>> Does the distinction above (unsigned wrapping/signed overflow) help? >>>>> >>>>>> >>>>>>> I mentioned it to Aldy in response to the irange best practices >>>>>>> document.  He says there's no way to do that and no plans to >>>>>>> make it possible. >>>>>>> >>>>>> Thats not quite what he said.   . He said  "If the operation may >>>>>> wrap or overflow, it will show up as so in the range.  I don't >>>>>> think we have any plans of changing this behavior." >>>>>> >>>>>> In fact. looking back, he said the same thing I just did.... >>>>>> There IS a mechanism there for working in higher precision should >>>>>> one desire to do so, but as I pointed out above, Im not convinced >>>>>> as a general thing in the ranger it will work how you imagine. >>>>>> >>>>>> My guess is what you really want is to be invoking range-ops on >>>>>> stmts you care about whether they might overflow or not with >>>>>> higher precision versions and then identify the cases which >>>>>> trigger that you actually care about. >>>>> >>>>> Yes, it looks like that's the way to go.  I was hoping the ranger >>>>> had an API that would do all that for me (i.e., walk the IL and >>>>> compute the range of the result on demand in the precision of >>>>> my choice), since it has to do that as it is, except that it uses >>>>> the precision of the expression.  But if not, it sounds like with >>>>> range_ops I should be able to fairly easily write one myself. >>>> >>>> The range_ops class is too low-level for this.  It folds pairs of >>>> ranges, and has no concept of gimple or tree expressions.  To roll >>>> your own solution you'd have to walk the IL, parse the gimple >>>> statements and then call range-ops on your adjusted arguments.  A >>>> better approach would be to overload range_of_stmt as Andrew >>>> suggested (see below). >>>> >>>>> >>>>>> So maybe you want an "overflow calculator" which is similar to the >>>>>> ranger, and  identifies which kinds of stmts expect certain range >>>>>> of results, and if they get results outside of  the expected >>>>>> ranges, triggers a flag. >>>>>> maybe it could convert all unsigned and signed  values to signed, >>>>>> and then work in a higher precision,  and then if any intermediate >>>>>> result cant be accurately cast back to the original type,  it gets >>>>>> flagged? is that the kind of overflow that is cared about? >>>>> >>>>> There are two kinds of "overflow:" unsigned wrapping and signed >>>>> overflow.  Both are a problem in allocation calls like: >>>>> >>>>>    int *p = malloc (nelts * size); >>>>> >>>>> If nelts and size are signed, the compile-time result seems to >>>>> be done in saturation arithmetic (its range looks to be capped >>>>> at TYPE_MAX).  If they are unsigned, they wrap around zero. >>>>> Neither is usually intended by the user but the wraparound is >>>>> the more dangerous of the two because it's well-defined and >>>>> because it shrinks the size to a very small number. >>>>> >>>>>> >>>>>> I dont know... this isnt my space :-)       Im just telling you >>>>>> that I'm not sure the current range query mechanism is the right >>>>>> place to be asking if an "overflow" occurred because I think the >>>>>> general answer wont be useful.. too many kinds of overflows thrown >>>>>> together. >>>>>> >>>>>> I think this requires more specifoc detailed information, and >>>>>> maybe a simple overflow analyzer will do the trick based on >>>>>> range-ops.... I think range-ops has the required abilities. >>>>>> whether it needs some enhancing, or something else exposed,  I'm >>>>>> not sure yet >>>>> >>>>> Thank you!  This was a useful clarification to what Aldy said. >>>>> I'd been meaning to follow up with him on this point when I was >>>>> done with what I'm working on but Jeff's question just got me >>>>> the answers I was looking for quicker.  Let me experiment with >>>>> range_ops and get back to you. >>>>> >>>>>> >>>>>> >>>>>> Andrew >>>>>> >>>>>> PS  One could always take a ranger and overload the >>>>>> range_of_stmt() routine that calculates results, so that it calls >>>>>> current functionality, then try converting the arguements to >>>>>> higher precision ranges and re-invoking range-ops on those >>>>>> arguments, get a new higher precision range back and then look at >>>>>> the results vs the normal ones. Might be a place to start >>>>>> experimenting...   then pass a flag along sayign an overflow >>>>>> occurred. >>>>> >>>>> That sounds like something to look into.  I don't see range_of_stmt >>>> >>>> This would be my preferred approach.  It's clean and builds on top >>>> of the current infrastructure without polluting other users of the >>>> ranger. It definitely sounds like something worth pursuing. >>>> >>>>  > on trunk.  Is that the right name or has it not been merged yet? >>>> >>>> This is in the ranger proper which should come in, in the next week >>>> or two (??).  You can see it in our staging branch >>>> (users/aldyh/ranger-staging): >>>> >>>> class gimple_ranger : public range_query >>>> { >>>> public: >>>>    gimple_ranger (bool use_loop_info) : m_use_loop_info >>>> (use_loop_info) { } >>>>    virtual bool range_of_stmt (irange &r, gimple *, tree name = >>>> NULL) OVERRIDE; >>>>    virtual bool range_of_expr (irange &r, tree name, gimple * = >>>> NULL) OVERRIDE; >>>>    virtual bool range_on_edge (irange &r, edge e, tree name) OVERRIDE; >>>>    virtual void range_on_entry (irange &r, basic_block bb, tree name); >>>>    virtual void range_on_exit (irange &r, basic_block bb, tree name); >>>> ... >>>> ... >>>> }; >>>> >>>> The idea is that range_of_stmt() is called for all statements as the >>>> IL is walked backwards.  You could just overload this, build a >>>> throwaway statement you could pass to gimple_ranger::range_of_stmt, >>>> and work from there. >>> >>> Thanks for the tutorial!  I see the patch you posted with this API >>> so I'll give it a try once it's in. >> >> Note that the patch I posted was to standardize valuation throughout >> the compiler.  It's not the ranger per se, but the underlying API for >> querying ranges that ranger, vr_values, etc will use.  So you need to >> wait for the full ranger to work on this. > > I see.  In that case it might make sense to commit the patch as is > and switch it over to the ranger API when it's in.  Jeff, what's > your preference? > > Martin