From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <msebor@gmail.com>
Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com
 [IPv6:2607:f8b0:4864:20::734])
 by sourceware.org (Postfix) with ESMTPS id 985493870924
 for <gcc-patches@gcc.gnu.org>; Mon,  9 Nov 2020 16:00:22 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 985493870924
Received: by mail-qk1-x734.google.com with SMTP id l2so8371423qkf.0
 for <gcc-patches@gcc.gnu.org>; Mon, 09 Nov 2020 08:00:22 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:from:to:references:cc:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=b2KuEUuM3HEHnrhBuFsqRTS1L4AuMPCspnG4f7JEj10=;
 b=ThG65jtqMpC6Tq26W/G8ImoQ6zKMFAkcH9WSwGyXjDuqOe7dEF4G/DCpnsW6zLI4ad
 SGNOHdf6RdWCJUoTQl13FTj56Yqj3QxH0so6kZvmtTfdRfpYUKqV2tpU6oM4i8A4k/Cj
 d5t/BdmgV8rpYPhiOHa9t20wGYtb7hMK6AANsNKelJFjrgERUBJgPxzozdGZTEUEKUYN
 u/te8d/0idxeIhSswQo9Lgs9BRLe792kiZgmZYNgr/N9TNYkn2af7pRjczNDo42kLkmO
 YRrA9QV+8fnqI5fjyjGzP5ZIQbiIpOPTgwB9ZBMoMxq6vHHq1mzJMPHLLMtx2JjszeXZ
 q88A==
X-Gm-Message-State: AOAM530OXr/GF0vKvaPZry/4lb2mU+HYIrKU5VjIIkS7tOxPn0rzaVqK
 GEjt6cZYMI2NDDocnNFUkmY=
X-Google-Smtp-Source: ABdhPJx+zt8VWEX7zIk9UMiBahrhcbX1CpblhW5UrBwPsKkBcHU8W2iz0jDE6XowhR3c5cp6y/4mrQ==
X-Received: by 2002:a37:8703:: with SMTP id j3mr9590401qkd.5.1604937621995;
 Mon, 09 Nov 2020 08:00:21 -0800 (PST)
Received: from [192.168.0.41] (174-16-106-146.hlrn.qwest.net. [174.16.106.146])
 by smtp.gmail.com with ESMTPSA id p1sm6239435qkc.100.2020.11.09.08.00.20
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Mon, 09 Nov 2020 08:00:21 -0800 (PST)
Subject: Re: [PATCH] warn for integer overflow in allocation calls (PR 96838)
From: Martin Sebor <msebor@gmail.com>
To: Jeff Law <law@redhat.com>, gcc-patches <gcc-patches@gcc.gnu.org>
References: <83a00f2c-de4f-60d8-9399-a64671795eda@gmail.com>
 <1fea8304-a48b-6591-bab4-eb1469865bbd@redhat.com>
 <5905e37e-46eb-c12b-751b-fd1375a122f9@gmail.com>
 <5ad6fd82-18fb-eeb2-0c5f-959a39bcee1a@redhat.com>
 <b06e6ad3-bbc8-e036-6090-7afce4d9c1b1@gmail.com>
 <06dc9a8b-7fdc-e2b7-86f2-77d0285c7349@redhat.com>
 <3ed754a9-474c-c4ee-d140-c8020322c59d@gmail.com>
 <d9720831-b383-d157-1c43-226be97306bc@redhat.com>
 <81664cc8-0aff-6cac-20a1-55fdf401a1af@gmail.com>
Message-ID: <ab0391cf-7eb6-ab6e-4618-4beb172671e0@gmail.com>
Date: Mon, 9 Nov 2020 09:00:19 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.2
MIME-Version: 1.0
In-Reply-To: <81664cc8-0aff-6cac-20a1-55fdf401a1af@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-3.2 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM,
 KAM_SHORT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Nov 2020 16:00:25 -0000

Ping: https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554000.html

Jeff, I don't expect to have the cycles to reimplement this patch
using the Ranger APIs before stage 1 closes.  I'm open to giving
it a try in stage 3 if it's still in scope for GCC 11.  Otherwise,
is this patch okay to commit?

On 9/21/20 9:13 AM, Martin Sebor wrote:
> On 9/20/20 12:39 AM, Aldy Hernandez wrote:
>>
>>
>> On 9/19/20 11:22 PM, Martin Sebor wrote:
>>> On 9/18/20 12:29 AM, Aldy Hernandez wrote:
>>>>
>>>>
>>>> On 9/17/20 10:18 PM, Martin Sebor wrote:
>>>>> On 9/17/20 12:39 PM, Andrew MacLeod wrote:
>>>>>> On 9/17/20 12:08 PM, Martin Sebor via Gcc-patches wrote:
>>>>>>> On 9/16/20 9:23 PM, Jeff Law wrote:
>>>>>>>>
>>>>>>>> On 9/15/20 1:47 PM, Martin Sebor wrote:
>>>>>>>>> Overflowing the size of a dynamic allocation (e.g., malloc or VLA)
>>>>>>>>> can lead to a subsequent buffer overflow corrupting the heap or
>>>>>>>>> stack.  The attached patch diagnoses a subset of these cases where
>>>>>>>>> the overflow/wraparound is still detectable.
>>>>>>>>>
>>>>>>>>> Besides regtesting GCC on x86_64-linux I also verified the warning
>>>>>>>>> doesn't introduce any false positives into Glibc or Binutils/GDB
>>>>>>>>> builds on the same target.
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>>
>>>>>>>>> gcc-96838.diff
>>>>>>>>>
>>>>>>>>> PR middle-end/96838 - missing warning on integer overflow in 
>>>>>>>>> calls to allocation functions
>>>>>>>>>
>>>>>>>>> gcc/ChangeLog:
>>>>>>>>>
>>>>>>>>>     PR middle-end/96838
>>>>>>>>>     * calls.c (eval_size_vflow): New function.
>>>>>>>>>     (get_size_range): Call it.  Add argument.
>>>>>>>>>     (maybe_warn_alloc_args_overflow): Diagnose 
>>>>>>>>> overflow/wraparound.
>>>>>>>>>     * calls.h (get_size_range): Add argument.
>>>>>>>>>
>>>>>>>>> gcc/testsuite/ChangeLog:
>>>>>>>>>
>>>>>>>>>     PR middle-end/96838
>>>>>>>>>     * gcc.dg/Walloc-size-larger-than-19.c: New test.
>>>>>>>>>     * gcc.dg/Walloc-size-larger-than-20.c: New test.
>>>>>>>>
>>>>>>>> If an attacker can control an integer overflow that feeds an 
>>>>>>>> allocation, then they can do all kinds of bad things.  In fact, 
>>>>>>>> when my son was asking me attack vectors, this is one I said I'd 
>>>>>>>> look at if I were a bad guy.
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm a bit surprised you can't just query the range of the 
>>>>>>>> argument and get the overflow status out of that range, but I 
>>>>>>>> don't see that in the APIs.  How painful would it be to make 
>>>>>>>> that part of the API? The conceptual model would be to just ask 
>>>>>>>> for the range of the argument to malloc which would include the 
>>>>>>>> range and a status bit indicating the computation might have 
>>>>>>>> overflowed.
>>>>>>>
>>>>>>   Do we know if it did/would have wrapped? sure.  since we have to 
>>>>>> do the math.    so you are correct in that the information is 
>>>>>> there. but is it useful?
>>>>>>
>>>>>> We are in the very annoying habit of subtracting one by adding 
>>>>>> 0xFFFFFFF.  which means you get an overflow for unsigned when you 
>>>>>> subtract one.   From what I have seen of unsigned math, we would 
>>>>>> be flagging very many operations as overflows, so you would still 
>>>>>> have the difficulty of figuring out whether its a "real" overflow 
>>>>>> or a fake one because of the way we do unsigned math
>>>>>
>>>>> You and me both :)
>>>>>
>>>>>>
>>>>>> At the very start, I did have an overflow flag in the range 
>>>>>> class... but it was turning out to be fairly useless so it was 
>>>>>> removed.
>>>>>> .
>>>>>>
>>>>>>> I agree that being able to evaluate an expression in an as-if
>>>>>>> infinite precision (in addition to its type) would be helpful.
>>>>>>
>>>>>> SO again, we get back to adding 0x0fffff when we are trying to 
>>>>>> subtract one...  now, with infinite precision you are going to see
>>>>>>
>>>>>>   [2,10]  - 1      we end up with [2,10]+0xFFFFFF, which will now 
>>>>>> give you  [0x100000001, 0x100000009]    so its going to look like 
>>>>>> it overflowed?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> But just to make sure I understood correctly, let me ask again
>>>>>>> using an example:
>>>>>>>
>>>>>>>   void* f (size_t n)
>>>>>>>   {
>>>>>>>     if (n < PTRDIFF_MAX / 2)
>>>>>>>       n = PTRDIFF_MAX / 2;
>>>>>>>
>>>>>>>     return malloc (n * sizeof (int));
>>>>>>>   }
>>>>>>>
>>>>>>> Can the unsigned wraparound in the argument be readily detected?
>>>>>>>
>>>>>>> On trunk, this ends up with the following:
>>>>>>>
>>>>>>>   # RANGE [4611686018427387903, 18446744073709551615]
>>>>>>>   _6 = MAX_EXPR <n_2(D), 4611686018427387903>;
>>>>>>>   # RANGE [0, 18446744073709551615] NONZERO 18446744073709551612
>>>>>>>   _1 = _6 * 4;
>>>>>>>   ...
>>>>>>>   p_5 = mallocD.1206 (_1); [tail call]
>>>>>>>   ...
>>>>>>>   return p_5;
>>>>>>>
>>>>>>> so _1's range reflects the wraparound in size_t, but _6's range
>>>>>>> has enough information to uncover it.  So detecting it is possible
>>>>>>> and is done in the patch so we get a warning:
>>>>>>>
>>>>>>> warning: argument 1 range [18446744073709551612, 
>>>>>>> 0x3fffffffffffffffc] is too large to represent in ‘long unsigned 
>>>>>>> int’ [-Walloc-size-larger-than=]
>>>>>>>     6 |   return malloc (n * sizeof (int));
>>>>>>>       |          ^~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>>
>>>>>>> The code is very simplistic and only handles a small subset of 
>>>>>>> cases.
>>>>>>> It could be generalized and exposed by a more generic API but it 
>>>>>>> does
>>>>>>> seem like the ranger must already have all the logic built into 
>>>>>>> it so
>>>>>>> if it isn't exposed now it should be a matter of opening it up.
>>>>>>
>>>>>> everything is exposed in range-ops.  well, mostly.
>>>>>> if we have _1 = _6 * 4
>>>>>>
>>>>>> if one wanted to do that infinite precision, you query the range 
>>>>>> for _6, and the range for 4 (which would be [4,4] :-)
>>>>>> range_of_expr (op1r, _6, stmt)
>>>>>> range_of_expr (op2r, 4, stmt)
>>>>>>
>>>>>> you could take their current types, and cast those ranges to 
>>>>>> whatever the next higher precsion is,
>>>>>> range_cast  (op1r, highertype)
>>>>>> range_cast (op2r, highertype)
>>>>>> then invoke the operation on those parameters
>>>>>>
>>>>>> gimple_range_fold (r, stmt,  op1r, op2r)
>>>>>>
>>>>>> and that will do your operation in the higher precision.  you 
>>>>>> could compare that to the value in regular precision too i suppose.
>>>>>
>>>>> The patch does pretty much exactly what you described, except in
>>>>> offset_int, and only for a limited set of arithmetic operations.
>>>>> It figures out if an unsigned expression wrapped simply by checking
>>>>> to see if the mathematically correct result fits in its type.
>>>>>
>>>>> It sounds like I should be able to use the range APIs above instead
>>>>> and get the same result, but for arbitrary expressions.  I don't
>>>>> need to (and very well can't) compare the infinitely precise result
>>>>> to the regular result because when it's a range that wraps the result
>>>>> is the full range of the type (i.e., [0, SIZE_MAX] in the test above).
>>>>>
>>>>>>
>>>>>> The ranger is designed to track ranges as they are represented in 
>>>>>> the IL.  You are asking for us to interpret the IL as something 
>>>>>> other than what is there, and increase the precision.  Thats a 
>>>>>> different task.
>>>>>>
>>>>>> could that be done?    maybe. It might involve parallel tracking 
>>>>>> of ssa-Name ranges in a higher precision... and recalculating 
>>>>>> every expression using those values.   Im not convinced that a 
>>>>>> generalized ranger which works in higher precision is really going 
>>>>>> to solve the problem the way you hope it would.. i think we'd get 
>>>>>> a lot of false info when unsigned values are involved
>>>>>
>>>>> I don't think that's necessary for unsigned wrapping.  Most of
>>>>> the time the (possibly) wrapped result is what we need because
>>>>> that's what the languages say is supposed to happen.  The cases
>>>>> when we need to check for the dangerous wraparound (or overflow)
>>>>> should be rare (just allocation calls) and can be handled on
>>>>> demand by walking the IL and doing the computation in the infinite
>>>>> precision.
>>>>>
>>>>> For signed overflow, though, computing the mathematically correct
>>>>> result seems preferable.  Or at least tracking (and propagating)
>>>>> the overflow bit should be.  That way we could tell when
>>>>> an expression is definitely invalid.
>>>>>
>>>>>>
>>>>>> If I were to see a clear solution/description as to what is a 
>>>>>> problematic overflow and what isnt, there might be something more 
>>>>>> general that could be done...
>>>>>
>>>>> Does the distinction above (unsigned wrapping/signed overflow) help?
>>>>>
>>>>>>
>>>>>>> I mentioned it to Aldy in response to the irange best practices
>>>>>>> document.  He says there's no way to do that and no plans to
>>>>>>> make it possible.
>>>>>>>
>>>>>> Thats not quite what he said.   . He said  "If the operation may 
>>>>>> wrap or overflow, it will show up as so in the range.  I don't 
>>>>>> think we have any plans of changing this behavior."
>>>>>>
>>>>>> In fact. looking back, he said the same thing I just did....  
>>>>>> There IS a mechanism there for working in higher precision should 
>>>>>> one desire to do so, but as I pointed out above, Im not convinced 
>>>>>> as a general thing in the ranger it will work how you imagine.
>>>>>>
>>>>>> My guess is what you really want is to be invoking range-ops on 
>>>>>> stmts you care about whether they might overflow or not with 
>>>>>> higher precision versions and then identify the cases which 
>>>>>> trigger that you actually care about.
>>>>>
>>>>> Yes, it looks like that's the way to go.  I was hoping the ranger
>>>>> had an API that would do all that for me (i.e., walk the IL and
>>>>> compute the range of the result on demand in the precision of
>>>>> my choice), since it has to do that as it is, except that it uses
>>>>> the precision of the expression.  But if not, it sounds like with
>>>>> range_ops I should be able to fairly easily write one myself.
>>>>
>>>> The range_ops class is too low-level for this.  It folds pairs of 
>>>> ranges, and has no concept of gimple or tree expressions.  To roll 
>>>> your own solution you'd have to walk the IL, parse the gimple 
>>>> statements and then call range-ops on your adjusted arguments.  A 
>>>> better approach would be to overload range_of_stmt as Andrew 
>>>> suggested (see below).
>>>>
>>>>>
>>>>>> So maybe you want an "overflow calculator" which is similar to the 
>>>>>> ranger, and  identifies which kinds of stmts expect certain range 
>>>>>> of results, and if they get results outside of  the expected 
>>>>>> ranges, triggers a flag.
>>>>>> maybe it could convert all unsigned and signed  values to signed, 
>>>>>> and then work in a higher precision,  and then if any intermediate 
>>>>>> result cant be accurately cast back to the original type,  it gets 
>>>>>> flagged? is that the kind of overflow that is cared about?
>>>>>
>>>>> There are two kinds of "overflow:" unsigned wrapping and signed
>>>>> overflow.  Both are a problem in allocation calls like:
>>>>>
>>>>>    int *p = malloc (nelts * size);
>>>>>
>>>>> If nelts and size are signed, the compile-time result seems to
>>>>> be done in saturation arithmetic (its range looks to be capped
>>>>> at TYPE_MAX).  If they are unsigned, they wrap around zero.
>>>>> Neither is usually intended by the user but the wraparound is
>>>>> the more dangerous of the two because it's well-defined and
>>>>> because it shrinks the size to a very small number.
>>>>>
>>>>>>
>>>>>> I dont know... this isnt my space :-)       Im just telling you 
>>>>>> that I'm not sure the current range query mechanism is the right 
>>>>>> place to be asking if an "overflow" occurred because I think the 
>>>>>> general answer wont be useful.. too many kinds of overflows thrown 
>>>>>> together.
>>>>>>
>>>>>> I think this requires more specifoc detailed information, and 
>>>>>> maybe a simple overflow analyzer will do the trick based on 
>>>>>> range-ops.... I think range-ops has the required abilities.  
>>>>>> whether it needs some enhancing, or something else exposed,  I'm 
>>>>>> not sure yet
>>>>>
>>>>> Thank you!  This was a useful clarification to what Aldy said.
>>>>> I'd been meaning to follow up with him on this point when I was
>>>>> done with what I'm working on but Jeff's question just got me
>>>>> the answers I was looking for quicker.  Let me experiment with
>>>>> range_ops and get back to you.
>>>>>
>>>>>>
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> PS  One could always take a ranger and overload the 
>>>>>> range_of_stmt() routine that calculates results, so that it calls 
>>>>>> current functionality, then try converting the arguements to  
>>>>>> higher precision ranges and re-invoking range-ops on those 
>>>>>> arguments, get a new higher precision range back and then look at 
>>>>>> the results vs the normal ones. Might be a place to start 
>>>>>> experimenting...   then pass a flag along sayign an overflow 
>>>>>> occurred.
>>>>>
>>>>> That sounds like something to look into.  I don't see range_of_stmt
>>>>
>>>> This would be my preferred approach.  It's clean and builds on top 
>>>> of the current infrastructure without polluting other users of the 
>>>> ranger. It definitely sounds like something worth pursuing.
>>>>
>>>>  > on trunk.  Is that the right name or has it not been merged yet?
>>>>
>>>> This is in the ranger proper which should come in, in the next week 
>>>> or two (??).  You can see it in our staging branch 
>>>> (users/aldyh/ranger-staging):
>>>>
>>>> class gimple_ranger : public range_query
>>>> {
>>>> public:
>>>>    gimple_ranger (bool use_loop_info) : m_use_loop_info 
>>>> (use_loop_info) { }
>>>>    virtual bool range_of_stmt (irange &r, gimple *, tree name = 
>>>> NULL) OVERRIDE;
>>>>    virtual bool range_of_expr (irange &r, tree name, gimple * = 
>>>> NULL) OVERRIDE;
>>>>    virtual bool range_on_edge (irange &r, edge e, tree name) OVERRIDE;
>>>>    virtual void range_on_entry (irange &r, basic_block bb, tree name);
>>>>    virtual void range_on_exit (irange &r, basic_block bb, tree name);
>>>> ...
>>>> ...
>>>> };
>>>>
>>>> The idea is that range_of_stmt() is called for all statements as the 
>>>> IL is walked backwards.  You could just overload this, build a 
>>>> throwaway statement you could pass to gimple_ranger::range_of_stmt, 
>>>> and work from there.
>>>
>>> Thanks for the tutorial!  I see the patch you posted with this API
>>> so I'll give it a try once it's in.
>>
>> Note that the patch I posted was to standardize valuation throughout 
>> the compiler.  It's not the ranger per se, but the underlying API for 
>> querying ranges that ranger, vr_values, etc will use.  So you need to 
>> wait for the full ranger to work on this.
> 
> I see.  In that case it might make sense to commit the patch as is
> and switch it over to the ranger API when it's in.  Jeff, what's
> your preference?
> 
> Martin