* determining aggregate member from MEM_REF @ 2018-02-15 17:28 Martin Sebor 2018-02-16 11:22 ` Richard Biener 0 siblings, 1 reply; 8+ messages in thread From: Martin Sebor @ 2018-02-15 17:28 UTC (permalink / raw) To: GCC Mailing List There are APIs to determine the base object and an offset into it from all sorts of expressions, including ARRAY_REF, COMPONENT_REF, and MEM_REF, but none of those I know about makes it also possible to discover the member being referred to. Is there an API that I'm missing or a combination of calls to some that would let me determine the (approximate) member and/or element of an aggregate from a MEM_REF expression, plus the offset from its beginning? Say, given struct A { void *p; char b[3][9]; } a[2]; and an expression like a[1].b[2] + 3 represented as the expr MEM_REF (char[9], a, 69) where offsetof (struct A, a[1].b[2]) == 66 I'd like to be able to determine that expr refers to the field b of struct A, and more specifically, b[2], plus 3. It's not important what the index into the array a is, or any other arrays on the way to b. I realize the reference can be ambiguous in some cases (arrays of structs with multiple array members) and so the result wouldn't be guaranteed to be 100% reliable. It would only be used in diagnostics. (I think with some effort the type of the MEM_REF could be used to disambiguate the majority (though not all) of these references in practice.) Thanks Martin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: determining aggregate member from MEM_REF 2018-02-15 17:28 determining aggregate member from MEM_REF Martin Sebor @ 2018-02-16 11:22 ` Richard Biener 2018-02-16 19:07 ` Martin Sebor 0 siblings, 1 reply; 8+ messages in thread From: Richard Biener @ 2018-02-16 11:22 UTC (permalink / raw) To: Martin Sebor; +Cc: GCC Mailing List On Thu, Feb 15, 2018 at 6:28 PM, Martin Sebor <msebor@gmail.com> wrote: > There are APIs to determine the base object and an offset > into it from all sorts of expressions, including ARRAY_REF, > COMPONENT_REF, and MEM_REF, but none of those I know about > makes it also possible to discover the member being referred > to. > > Is there an API that I'm missing or a combination of calls > to some that would let me determine the (approximate) member > and/or element of an aggregate from a MEM_REF expression, > plus the offset from its beginning? > > Say, given > > struct A > { > void *p; > char b[3][9]; > } a[2]; > > and an expression like > > a[1].b[2] + 3 > > represented as the expr > > MEM_REF (char[9], a, 69) &MEM_REF (&a, 69) you probably mean. > where offsetof (struct A, a[1].b[2]) == 66 > > I'd like to be able to determine that expr refers to the field > b of struct A, and more specifically, b[2], plus 3. It's not > important what the index into the array a is, or any other > arrays on the way to b. There is code in initializer folding that searches for a field in a CONSTRUCTOR by base and offset. There's no existing helper that gives you exactly what you want -- I guess you'd ideally want to have a path to the refered object. But it may be possible to follow what fold_ctor_reference does and build such a helper. > I realize the reference can be ambiguous in some cases (arrays > of structs with multiple array members) and so the result wouldn't > be guaranteed to be 100% reliable. It would only be used in > diagnostics. (I think with some effort the type of the MEM_REF > could be used to disambiguate the majority (though not all) of > these references in practice.) Given you have the address of the MEM_REF in your example above the type of the MEM_REF doesn't mean anything. I think ambiguity only happens with unions given MEM_REF offsets are constant. Note that even the type of 'a' might not be correct as it may have had a different dynamic type. So not sure what context you are trying to use this in diagnostics. Richard. > > Thanks > Martin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: determining aggregate member from MEM_REF 2018-02-16 11:22 ` Richard Biener @ 2018-02-16 19:07 ` Martin Sebor 2018-02-26 12:08 ` Richard Biener 0 siblings, 1 reply; 8+ messages in thread From: Martin Sebor @ 2018-02-16 19:07 UTC (permalink / raw) To: Richard Biener; +Cc: GCC Mailing List On 02/16/2018 04:22 AM, Richard Biener wrote: > On Thu, Feb 15, 2018 at 6:28 PM, Martin Sebor <msebor@gmail.com> wrote: >> There are APIs to determine the base object and an offset >> into it from all sorts of expressions, including ARRAY_REF, >> COMPONENT_REF, and MEM_REF, but none of those I know about >> makes it also possible to discover the member being referred >> to. >> >> Is there an API that I'm missing or a combination of calls >> to some that would let me determine the (approximate) member >> and/or element of an aggregate from a MEM_REF expression, >> plus the offset from its beginning? >> >> Say, given >> >> struct A >> { >> void *p; >> char b[3][9]; >> } a[2]; >> >> and an expression like >> >> a[1].b[2] + 3 >> >> represented as the expr >> >> MEM_REF (char[9], a, 69) > > &MEM_REF (&a, 69) > > you probably mean. Yes. I was using the notation from the Wiki https://gcc.gnu.org/wiki/MemRef >> where offsetof (struct A, a[1].b[2]) == 66 >> >> I'd like to be able to determine that expr refers to the field >> b of struct A, and more specifically, b[2], plus 3. It's not >> important what the index into the array a is, or any other >> arrays on the way to b. > > There is code in initializer folding that searches for a field in > a CONSTRUCTOR by base and offset. There's no existing > helper that gives you exactly what you want -- I guess you'd > ideally want to have a path to the refered object. But it may > be possible to follow what fold_ctor_reference does and build > such a helper. Thanks. I'll see what I can come up with if/when I get to it in stage 1. > >> I realize the reference can be ambiguous in some cases (arrays >> of structs with multiple array members) and so the result wouldn't >> be guaranteed to be 100% reliable. It would only be used in >> diagnostics. (I think with some effort the type of the MEM_REF >> could be used to disambiguate the majority (though not all) of >> these references in practice.) > > Given you have the address of the MEM_REF in your example above > the type of the MEM_REF doesn't mean anything. You're right, it doesn't always correspond to the type of the member. It does in some cases but those may be uncommon. Too bad. > I think ambiguity only happens with unions given MEM_REF offsets > are constant. > > Note that even the type of 'a' might not be correct as it may have had > a different dynamic type. > > So not sure what context you are trying to use this in diagnostics. Say I have a struct like this: struct A { char a[4], b[5]; }; then in extern struct A *a; memset (&a[0].a[0] + 14, 0, 3); // invalid memset (&a[1].b[0] + 1, 0, 3); // valid both references are the same: &MEM_REF[char*, (void *)a + 14]; and there's no way to unambiguously tell which member each refers to, or even to distinguish the valid one from the other. MEM_REF makes the kind of analysis I'm interested in very difficult (or impossible) to do reliably. Being able to determine the member is useful in -Wrestrict where rather than printing the offsets from the base object I'd like to be able to print the offsets relative to the referenced member. Beyond -Wrestrict, identifying the member is key in detecting writes that span multiple members (e.g., strcpy). Those could (for example) overwrite a member that's a pointer to a function and cause code injection. As it is, GCC has no way to do that because __builtin_object_size considers the size of the entire enclosing object, not that of the member. For the same reason: MEM_REF makes it impossible. Martin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: determining aggregate member from MEM_REF 2018-02-16 19:07 ` Martin Sebor @ 2018-02-26 12:08 ` Richard Biener 2018-02-26 15:44 ` Martin Sebor 2018-02-26 19:57 ` Jeff Law 0 siblings, 2 replies; 8+ messages in thread From: Richard Biener @ 2018-02-26 12:08 UTC (permalink / raw) To: Martin Sebor; +Cc: GCC Mailing List On Fri, Feb 16, 2018 at 8:07 PM, Martin Sebor <msebor@gmail.com> wrote: > On 02/16/2018 04:22 AM, Richard Biener wrote: >> >> On Thu, Feb 15, 2018 at 6:28 PM, Martin Sebor <msebor@gmail.com> wrote: >>> >>> There are APIs to determine the base object and an offset >>> into it from all sorts of expressions, including ARRAY_REF, >>> COMPONENT_REF, and MEM_REF, but none of those I know about >>> makes it also possible to discover the member being referred >>> to. >>> >>> Is there an API that I'm missing or a combination of calls >>> to some that would let me determine the (approximate) member >>> and/or element of an aggregate from a MEM_REF expression, >>> plus the offset from its beginning? >>> >>> Say, given >>> >>> struct A >>> { >>> void *p; >>> char b[3][9]; >>> } a[2]; >>> >>> and an expression like >>> >>> a[1].b[2] + 3 >>> >>> represented as the expr >>> >>> MEM_REF (char[9], a, 69) >> >> >> &MEM_REF (&a, 69) >> >> you probably mean. > > > Yes. I was using the notation from the Wiki > https://gcc.gnu.org/wiki/MemRef > >>> where offsetof (struct A, a[1].b[2]) == 66 >>> >>> I'd like to be able to determine that expr refers to the field >>> b of struct A, and more specifically, b[2], plus 3. It's not >>> important what the index into the array a is, or any other >>> arrays on the way to b. >> >> >> There is code in initializer folding that searches for a field in >> a CONSTRUCTOR by base and offset. There's no existing >> helper that gives you exactly what you want -- I guess you'd >> ideally want to have a path to the refered object. But it may >> be possible to follow what fold_ctor_reference does and build >> such a helper. > > > Thanks. I'll see what I can come up with if/when I get to it > in stage 1. > >> >>> I realize the reference can be ambiguous in some cases (arrays >>> of structs with multiple array members) and so the result wouldn't >>> be guaranteed to be 100% reliable. It would only be used in >>> diagnostics. (I think with some effort the type of the MEM_REF >>> could be used to disambiguate the majority (though not all) of >>> these references in practice.) >> >> >> Given you have the address of the MEM_REF in your example above >> the type of the MEM_REF doesn't mean anything. > > > You're right, it doesn't always correspond to the type of > the member. It does in some cases but those may be uncommon. > Too bad. > >> I think ambiguity only happens with unions given MEM_REF offsets >> are constant. >> >> Note that even the type of 'a' might not be correct as it may have had >> a different dynamic type. >> >> So not sure what context you are trying to use this in diagnostics. > > > Say I have a struct like this: > > struct A { > char a[4], b[5]; > }; > > then in > > extern struct A *a; > > memset (&a[0].a[0] + 14, 0, 3); // invalid > > memset (&a[1].b[0] + 1, 0, 3); // valid > > both references are the same: > > &MEM_REF[char*, (void *)a + 14]; > > and there's no way to unambiguously tell which member each refers > to, or even to distinguish the valid one from the other. MEM_REF > makes the kind of analysis I'm interested in very difficult (or > impossible) to do reliably. Yes. Similar issues exist for the objsz pass (aka fortify stuff). > Being able to determine the member is useful in -Wrestrict where > rather than printing the offsets from the base object I'd like > to be able to print the offsets relative to the referenced > member. Beyond -Wrestrict, identifying the member is key in > detecting writes that span multiple members (e.g., strcpy). > Those could (for example) overwrite a member that's a pointer > to a function and cause code injection. As it is, GCC has no > way to do that because __builtin_object_size considers the > size of the entire enclosing object, not that of the member. > For the same reason: MEM_REF makes it impossible. We're first and foremost an optimizing compiler and not a static analysis tool. People seem to want some optimization to make static analysis easier but then they have to live with imperfect results. There's no easy way around this kind of issues. Richard. > Martin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: determining aggregate member from MEM_REF 2018-02-26 12:08 ` Richard Biener @ 2018-02-26 15:44 ` Martin Sebor 2018-02-26 20:05 ` Jeff Law 2018-02-26 19:57 ` Jeff Law 1 sibling, 1 reply; 8+ messages in thread From: Martin Sebor @ 2018-02-26 15:44 UTC (permalink / raw) To: Richard Biener; +Cc: GCC Mailing List On 02/26/2018 05:08 AM, Richard Biener wrote: > On Fri, Feb 16, 2018 at 8:07 PM, Martin Sebor <msebor@gmail.com> wrote: >> On 02/16/2018 04:22 AM, Richard Biener wrote: >>> >>> On Thu, Feb 15, 2018 at 6:28 PM, Martin Sebor <msebor@gmail.com> wrote: >>>> >>>> There are APIs to determine the base object and an offset >>>> into it from all sorts of expressions, including ARRAY_REF, >>>> COMPONENT_REF, and MEM_REF, but none of those I know about >>>> makes it also possible to discover the member being referred >>>> to. >>>> >>>> Is there an API that I'm missing or a combination of calls >>>> to some that would let me determine the (approximate) member >>>> and/or element of an aggregate from a MEM_REF expression, >>>> plus the offset from its beginning? >>>> >>>> Say, given >>>> >>>> struct A >>>> { >>>> void *p; >>>> char b[3][9]; >>>> } a[2]; >>>> >>>> and an expression like >>>> >>>> a[1].b[2] + 3 >>>> >>>> represented as the expr >>>> >>>> MEM_REF (char[9], a, 69) >>> >>> >>> &MEM_REF (&a, 69) >>> >>> you probably mean. >> >> >> Yes. I was using the notation from the Wiki >> https://gcc.gnu.org/wiki/MemRef >> >>>> where offsetof (struct A, a[1].b[2]) == 66 >>>> >>>> I'd like to be able to determine that expr refers to the field >>>> b of struct A, and more specifically, b[2], plus 3. It's not >>>> important what the index into the array a is, or any other >>>> arrays on the way to b. >>> >>> >>> There is code in initializer folding that searches for a field in >>> a CONSTRUCTOR by base and offset. There's no existing >>> helper that gives you exactly what you want -- I guess you'd >>> ideally want to have a path to the refered object. But it may >>> be possible to follow what fold_ctor_reference does and build >>> such a helper. >> >> >> Thanks. I'll see what I can come up with if/when I get to it >> in stage 1. >> >>> >>>> I realize the reference can be ambiguous in some cases (arrays >>>> of structs with multiple array members) and so the result wouldn't >>>> be guaranteed to be 100% reliable. It would only be used in >>>> diagnostics. (I think with some effort the type of the MEM_REF >>>> could be used to disambiguate the majority (though not all) of >>>> these references in practice.) >>> >>> >>> Given you have the address of the MEM_REF in your example above >>> the type of the MEM_REF doesn't mean anything. >> >> >> You're right, it doesn't always correspond to the type of >> the member. It does in some cases but those may be uncommon. >> Too bad. >> >>> I think ambiguity only happens with unions given MEM_REF offsets >>> are constant. >>> >>> Note that even the type of 'a' might not be correct as it may have had >>> a different dynamic type. >>> >>> So not sure what context you are trying to use this in diagnostics. >> >> >> Say I have a struct like this: >> >> struct A { >> char a[4], b[5]; >> }; >> >> then in >> >> extern struct A *a; >> >> memset (&a[0].a[0] + 14, 0, 3); // invalid >> >> memset (&a[1].b[0] + 1, 0, 3); // valid >> >> both references are the same: >> >> &MEM_REF[char*, (void *)a + 14]; >> >> and there's no way to unambiguously tell which member each refers >> to, or even to distinguish the valid one from the other. MEM_REF >> makes the kind of analysis I'm interested in very difficult (or >> impossible) to do reliably. > > Yes. Similar issues exist for the objsz pass (aka fortify stuff). > >> Being able to determine the member is useful in -Wrestrict where >> rather than printing the offsets from the base object I'd like >> to be able to print the offsets relative to the referenced >> member. Beyond -Wrestrict, identifying the member is key in >> detecting writes that span multiple members (e.g., strcpy). >> Those could (for example) overwrite a member that's a pointer >> to a function and cause code injection. As it is, GCC has no >> way to do that because __builtin_object_size considers the >> size of the entire enclosing object, not that of the member. >> For the same reason: MEM_REF makes it impossible. > > We're first and foremost an optimizing compiler and not a > static analysis tool. People seem to want some optimization > to make static analysis easier but then they have to live with > imperfect results. There's no easy way around this kind of > issues. There certainly are limits, but I don't think the two need to be mutually exclusive. I believe MEM_REF was introduced mainly as a solution to avoid the complexity (and bugs) of having to traverse all the other XXX_REFs all over the place. There is no fundamental reason why MEM_REF couldn't be improved or even replaced to preserve more of the original detail. Folding things to MEM_REF (or rather, folding them too early) makes all kinds of analysis harder: not just warnings but even optimization. I've raised a whole slew of bugs for the strlen pass alone where folding string functions to MEM_REF defeats useful downstream optimizations. Making strlen (and all other passes that might benefit from the original detail) work hard to deal with MEM_REF isn't a good design solution. It forces the complexity that MEM_REF is meant to remove back into its clients. Worse, because of the loss of detail, the results are unavoidably suboptimal (at least for certain kinds of analyses). Martin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: determining aggregate member from MEM_REF 2018-02-26 15:44 ` Martin Sebor @ 2018-02-26 20:05 ` Jeff Law 2018-02-27 13:27 ` Richard Biener 0 siblings, 1 reply; 8+ messages in thread From: Jeff Law @ 2018-02-26 20:05 UTC (permalink / raw) To: Martin Sebor, Richard Biener; +Cc: GCC Mailing List On 02/26/2018 08:44 AM, Martin Sebor wrote: > > Folding things to MEM_REF (or rather, folding them too early) > makes all kinds of analysis harder: not just warnings but even > optimization. I've raised a whole slew of bugs for the strlen > pass alone where folding string functions to MEM_REF defeats > useful downstream optimizations. Making strlen (and all other > passes that might benefit from the original detail) work hard > to deal with MEM_REF isn't a good design solution. It forces > the complexity that MEM_REF is meant to remove back into its > clients. Worse, because of the loss of detail, the results > are unavoidably suboptimal (at least for certain kinds of > analyses). I haven't looked specifically at the MEM_REF folding, but I wouldn't be surprised to find cases where deferral ultimately results in regressions. When to fold & lower is a hard problem. There is a constant tension between trying to fold early as it often leads to generally better code vs folding later which may help other sets of code, particularly when folding results in an inability to recover data. There generally is not an optimal solution to these problems; we have to take a pragmatic approach. So if you can defer and not regress, then by all means propose patches. But I think you'll find that to defer means you have to beef up stuff later in the pipeline. jeff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: determining aggregate member from MEM_REF 2018-02-26 20:05 ` Jeff Law @ 2018-02-27 13:27 ` Richard Biener 0 siblings, 0 replies; 8+ messages in thread From: Richard Biener @ 2018-02-27 13:27 UTC (permalink / raw) To: Jeff Law; +Cc: Martin Sebor, GCC Mailing List On Mon, Feb 26, 2018 at 9:04 PM, Jeff Law <law@redhat.com> wrote: > On 02/26/2018 08:44 AM, Martin Sebor wrote: >> >> Folding things to MEM_REF (or rather, folding them too early) >> makes all kinds of analysis harder: not just warnings but even >> optimization. I've raised a whole slew of bugs for the strlen >> pass alone where folding string functions to MEM_REF defeats >> useful downstream optimizations. Making strlen (and all other >> passes that might benefit from the original detail) work hard >> to deal with MEM_REF isn't a good design solution. It forces >> the complexity that MEM_REF is meant to remove back into its >> clients. Worse, because of the loss of detail, the results >> are unavoidably suboptimal (at least for certain kinds of >> analyses). > I haven't looked specifically at the MEM_REF folding, but I wouldn't be > surprised to find cases where deferral ultimately results in regressions. MEM_REF was introduced to fix representational shortcomings, not mainly to make the representation more compact. For example there wasn't a way to recod that an access is volatile or un-aligned without taking the address of sth, casting that thing and then dereferencing it. That causes havoc in alias-analysis because many more things are now addressable. Also it was introduced to fix wrong-code bugs but at the same time not cause missed optimizations. That's when propagating "structural addresses" into dereferences which is generally invalid if you preserve the structure of the address. With losing some of the canonicalization we could allow MEM[&a.b.c.d, + 5] instead of forcing the .b.c.d into the constant offset. Or we could add a third operand so we have MEM[&a, +10, &a.b.c.d] or so. But all that comes at a cost and cannot solve all the issues so I'm hesitant to do sth (costly) like that. > When to fold & lower is a hard problem. There is a constant tension > between trying to fold early as it often leads to generally better code > vs folding later which may help other sets of code, particularly when > folding results in an inability to recover data. > > There generally is not an optimal solution to these problems; we have to > take a pragmatic approach. So if you can defer and not regress, then by > all means propose patches. But I think you'll find that to defer means > you have to beef up stuff later in the pipeline. Note we already do defer some stuff - not that I like that - but ultimately you get to the point where defering hinders exactly the optimization you want to perform to get good diagnostics. So it remains a chicken-and-egg issue. Defering to a reasonable point also means that by practical means you could have done a proper IPA static analysis pass in the first place. Richard. > jeff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: determining aggregate member from MEM_REF 2018-02-26 12:08 ` Richard Biener 2018-02-26 15:44 ` Martin Sebor @ 2018-02-26 19:57 ` Jeff Law 1 sibling, 0 replies; 8+ messages in thread From: Jeff Law @ 2018-02-26 19:57 UTC (permalink / raw) To: Richard Biener, Martin Sebor; +Cc: GCC Mailing List On 02/26/2018 05:08 AM, Richard Biener wrote: > On Fri, Feb 16, 2018 at 8:07 PM, Martin Sebor <msebor@gmail.com> wrote: >> Say I have a struct like this: >> >> struct A { >> char a[4], b[5]; >> }; >> >> then in >> >> extern struct A *a; >> >> memset (&a[0].a[0] + 14, 0, 3); // invalid >> >> memset (&a[1].b[0] + 1, 0, 3); // valid >> >> both references are the same: >> >> &MEM_REF[char*, (void *)a + 14]; >> >> and there's no way to unambiguously tell which member each refers >> to, or even to distinguish the valid one from the other. MEM_REF >> makes the kind of analysis I'm interested in very difficult (or >> impossible) to do reliably. > > Yes. Similar issues exist for the objsz pass (aka fortify stuff). In fact, I think we have a long standing regression in this space. > >> Being able to determine the member is useful in -Wrestrict where >> rather than printing the offsets from the base object I'd like >> to be able to print the offsets relative to the referenced >> member. Beyond -Wrestrict, identifying the member is key in >> detecting writes that span multiple members (e.g., strcpy). >> Those could (for example) overwrite a member that's a pointer >> to a function and cause code injection. As it is, GCC has no >> way to do that because __builtin_object_size considers the >> size of the entire enclosing object, not that of the member. >> For the same reason: MEM_REF makes it impossible. > > We're first and foremost an optimizing compiler and not a > static analysis tool. People seem to want some optimization > to make static analysis easier but then they have to live with > imperfect results. There's no easy way around this kind of > issues. True, but there is significant value in generating good diagnostics. IMHO it's worth thinking about if/how we can get the refinements we want on the diagnostic side without regressing on the code generation side. jeff ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-02-27 13:17 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-02-15 17:28 determining aggregate member from MEM_REF Martin Sebor 2018-02-16 11:22 ` Richard Biener 2018-02-16 19:07 ` Martin Sebor 2018-02-26 12:08 ` Richard Biener 2018-02-26 15:44 ` Martin Sebor 2018-02-26 20:05 ` Jeff Law 2018-02-27 13:27 ` Richard Biener 2018-02-26 19:57 ` Jeff Law
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).