* gcc 3.3.6 - stack corruption questions @ 2005-07-25 14:55 Louis LeBlanc 2005-07-25 15:15 ` Giovanni Bajo 0 siblings, 1 reply; 10+ messages in thread From: Louis LeBlanc @ 2005-07-25 14:55 UTC (permalink / raw) To: gcc Hey folks. I'm having some trouble with a process compiled with gcc 3.3.6. This code is pretty complex and has several features that are not typically in use because they involve non-production test cases. The problem is I'm getting core dumps (SEGV) that appears to come from this code when I know it shouldn't be in the execution path. The code in question is switched on by a command line argument only, and the process is managed by a parent process that monitors and manages it's execution - reporting crashes and restarting it if necessary. Here's my environment: gcc 3.3.6 built on SunOS 5.8 sun4u sparc SUNW,Ultra-60, app built on the same platform and execution on SunOS 5.8 sun4u sparc SUNW,UltraSPARC-IIi-cEngine. The entire codebase is written in C, and is compiled as follows: /usr/local/gcc-3.3.6/bin/gcc -ggdb -g3 -Wall -D_REENTRANT -Wno-multichar -Wno-unused-function -D_SOLARIS -DUSE_DEV_POLL -mcpu=ultrasparc -O2 -DTIMING=1 -DDB_TIMING=1 -Icommon/include -I/opt/oracle/8.1.7/include -I/opt/oracle/8.1.7/rdbms/public -c -o store.o store.c These problems have popped up time and again over the last 6 years, going as far back as gcc 2.95, but gdb has never been able to tell me any more than where the problem came from (the Solaris pstack utility always agrees with gdb). These problems are only repeated under longer execution times, and only after some thousands or even millions of transactions. The application is supposed to provide 99.97% availability, so having this happen 12 times over the course of a weekend is a bit concerning. Sometimes a build will prove wonderfully stable, but then a very small code change made to tweak some behavior will completely destabilize it. Recently, I added a handler to catch segfaults and bus errors to try to extract more info through the ucontext interface. I am able to get a little explicit detail, but not much new information. Problem with this is it doesn't preserve the originating stack as well. At this point, I'm at a loss as to where to start. This is a pretty important codebase (to my employer, anyway) and the frequency of these inexplicable problems is starting to cause some concern. Any suggestions as to where to go next? If I've forgotten any potentially useful information please don't hesitate to request it. Please CC me directly, as I am not on the dev list. Thanks for your time. Lou -- Louis LeBlanc leblanc@keyslapper.net Fully Funded Hobbyist, KeySlapper Extrordinaire :þ http://www.keyslapper.net Ô¿Ô¬ Key fingerprint = C5E7 4762 F071 CE3B ED51 4FB8 AF85 A2FE 80C8 D9A2 Flugg's Law: When you need to knock on wood is when you realize that the world is composed of vinyl, naugahyde and aluminum. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-25 14:55 gcc 3.3.6 - stack corruption questions Louis LeBlanc @ 2005-07-25 15:15 ` Giovanni Bajo 2005-07-25 15:23 ` Louis LeBlanc 2005-07-25 22:00 ` Louis LeBlanc 0 siblings, 2 replies; 10+ messages in thread From: Giovanni Bajo @ 2005-07-25 15:15 UTC (permalink / raw) To: Louis LeBlanc; +Cc: gcc Louis LeBlanc <leblanc@keyslapper.net> wrote: > The problem is I'm getting core dumps (SEGV) that appears to come from > this code when I know it shouldn't be in the execution path. The code > in question is switched on by a command line argument only, and the > process is managed by a parent process that monitors and manages it's > execution - reporting crashes and restarting it if necessary. Looks like a bug hidden in your code. Several things you could try: - valgrind - GCC 4.0 with -fmudflap - GCC 4.1 CVS with -fstack-protect -- Giovanni Bajo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-25 15:15 ` Giovanni Bajo @ 2005-07-25 15:23 ` Louis LeBlanc 2005-07-25 22:00 ` Louis LeBlanc 1 sibling, 0 replies; 10+ messages in thread From: Louis LeBlanc @ 2005-07-25 15:23 UTC (permalink / raw) To: Giovanni Bajo; +Cc: Louis LeBlanc, gcc On 07/25/05 05:15 PM, Giovanni Bajo sat at the `puter and typed: > Louis LeBlanc <leblanc@keyslapper.net> wrote: > > > The problem is I'm getting core dumps (SEGV) that appears to come from > > this code when I know it shouldn't be in the execution path. The code > > in question is switched on by a command line argument only, and the > > process is managed by a parent process that monitors and manages it's > > execution - reporting crashes and restarting it if necessary. > > Looks like a bug hidden in your code. Several things you could try: > > - valgrind > - GCC 4.0 with -fmudflap > - GCC 4.1 CVS with -fstack-protect > -- Thanks for the tips. Since I'm on Solaris, I don't think Valgrind is an option (Linux and FreeBSD on x86/PowerPC/AMD64 only). I will check out the gcc versions and features you mention. Lou -- Louis LeBlanc leblanc@keyslapper.net Fully Funded Hobbyist, KeySlapper Extrordinaire :þ http://www.keyslapper.net Ô¿Ô¬ Key fingerprint = C5E7 4762 F071 CE3B ED51 4FB8 AF85 A2FE 80C8 D9A2 Turnaucka's Law: The attention span of a computer is only as long as its electrical cord. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-25 15:15 ` Giovanni Bajo 2005-07-25 15:23 ` Louis LeBlanc @ 2005-07-25 22:00 ` Louis LeBlanc 2005-07-25 22:28 ` Giovanni Bajo 2005-07-25 22:50 ` Robert Dewar 1 sibling, 2 replies; 10+ messages in thread From: Louis LeBlanc @ 2005-07-25 22:00 UTC (permalink / raw) To: Giovanni Bajo; +Cc: Louis LeBlanc, gcc On 07/25/05 05:15 PM, Giovanni Bajo sat at the `puter and typed: > Louis LeBlanc <leblanc@keyslapper.net> wrote: > > > The problem is I'm getting core dumps (SEGV) that appears to come from > > this code when I know it shouldn't be in the execution path. The code > > in question is switched on by a command line argument only, and the > > process is managed by a parent process that monitors and manages it's > > execution - reporting crashes and restarting it if necessary. > > Looks like a bug hidden in your code. Several things you could try: > > - valgrind > - GCC 4.0 with -fmudflap > - GCC 4.1 CVS with -fstack-protect I've not gotten the chance to build with gcc4.0.1 yet (still building), but I've tried a couple other things with 3.3.6 that you might find interesting, maybe something will raise a flag. I added the -fstack-check switch to my makefile and recompiled with various optimizations. I was pretty surprised at the file sizes that showed up: No Optimization: -rwxr-xr-x 1 leblanc daemon 1128660 Jul 25 16:25 myprocess* Optimized with -O2 -rwxr-xr-x 1 leblanc daemon 1058228 Jul 25 17:36 myprocess* Optimized with -O3 -rwxr-xr-x 1 leblanc daemon 1129792 Jul 25 17:32 myprocess* I would have expected much different results. Shouldn't the file sizes be smaller (at least a little) with the -O3 switch? Maybe there's a loop unrolled to make it faster, resulting in a larger codebase? Anyway, the code that generated these files (identical between the three, except the compilation flags) appears to be hitting something with the optimizer that doesn't like my code. Those with optimization (the last 2) core on less than 50K transactions - both in calls to pthread_mutex_unlock(). I have verified that the mutex passed in is valid (it would have been locked and unlocked some 70K times before failing in my last test). The unoptimized version completed a 401,900 transaction test with no problem. All day, I've been playing with different things, BTW, all these executables were compiled with -ggdb -g3 -Wall. Thanks to everyone who sent ideas so far. Lou -- Louis LeBlanc leblanc@keyslapper.net Fully Funded Hobbyist, KeySlapper Extrordinaire :þ http://www.keyslapper.net Ô¿Ô¬ Key fingerprint = C5E7 4762 F071 CE3B ED51 4FB8 AF85 A2FE 80C8 D9A2 Udall's Fourth Law: Any change or reform you make is going to have consequences you don't like. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-25 22:00 ` Louis LeBlanc @ 2005-07-25 22:28 ` Giovanni Bajo 2005-07-26 21:06 ` Louis LeBlanc 2005-07-25 22:50 ` Robert Dewar 1 sibling, 1 reply; 10+ messages in thread From: Giovanni Bajo @ 2005-07-25 22:28 UTC (permalink / raw) To: Louis LeBlanc; +Cc: gcc Louis LeBlanc <leblanc@keyslapper.net> wrote: > I added the -fstack-check switch to my makefile and recompiled with > various optimizations. I was pretty surprised at the file sizes that > showed up: > > No Optimization: > -rwxr-xr-x 1 leblanc daemon 1128660 Jul 25 16:25 myprocess* > > Optimized with -O2 > -rwxr-xr-x 1 leblanc daemon 1058228 Jul 25 17:36 myprocess* > > Optimized with -O3 > -rwxr-xr-x 1 leblanc daemon 1129792 Jul 25 17:32 myprocess* > > I would have expected much different results. Shouldn't the file > sizes be smaller (at least a little) with the -O3 switch? Maybe > there's a loop unrolled to make it faster, resulting in a larger > codebase? Or inlining, or many other things. If you care about size, use -Os. -- Giovanni Bajo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-25 22:28 ` Giovanni Bajo @ 2005-07-26 21:06 ` Louis LeBlanc 2005-07-26 21:52 ` Robert Dewar 0 siblings, 1 reply; 10+ messages in thread From: Louis LeBlanc @ 2005-07-26 21:06 UTC (permalink / raw) To: Giovanni Bajo; +Cc: Louis LeBlanc, gcc On 07/26/05 12:28 AM, Giovanni Bajo sat at the `puter and typed: > Louis LeBlanc <leblanc@keyslapper.net> wrote: > > > I added the -fstack-check switch to my makefile and recompiled with > > various optimizations. I was pretty surprised at the file sizes that > > showed up: > > > > No Optimization: > > -rwxr-xr-x 1 leblanc daemon 1128660 Jul 25 16:25 myprocess* > > > > Optimized with -O2 > > -rwxr-xr-x 1 leblanc daemon 1058228 Jul 25 17:36 myprocess* > > > > Optimized with -O3 > > -rwxr-xr-x 1 leblanc daemon 1129792 Jul 25 17:32 myprocess* > > > > I would have expected much different results. Shouldn't the file > > sizes be smaller (at least a little) with the -O3 switch? Maybe > > there's a loop unrolled to make it faster, resulting in a larger > > codebase? > > > Or inlining, or many other things. If you care about size, use -Os. Well, I don't care about size that much. This isn't an embedded app, and performance and stability usually contend with each other for the primary concern. I did finally get a build with 4.0.1. I certainly was surprised at some of the things I found, but it did help a lot. Funny thing is that the same compiler flags provide a much bigger executable with 4.0.1 than with 3.3.6, but it doesn't core trying to release a thread lock. I've also confirmed that this happens without contention because one of the apps that uses this code actually only runs a single thread. I used the following flags: -ggdb -g3 -Wall -fstack-check -fno-strict-aliasing -O2 -mcpu=ultrasparc The -fmudflaps flag requires some odd linking I haven't figured out yet, so I'm skipping it for now. I also found, to my delight and surprise, that the same code appears to perform between 10% and 20% better - in a rough, fairly imprecise environment. I still cannot find the cause of the core I get with the 3.3.6 version though. What I do get with 4.0.1 is some warnings I'd never seen before: disk.c: In function 'disk_find': disk.c:281: warning: frame size too large for reliable stack checking disk.c:281: warning: try reducing the number of local variables I was initially getting this for about 20 of my functions. I know very well that I'm using some large stacks, so I wasn't surprised about that, I was just surprised that the threshold here appears to be quite low (is it around 8K? - please clarify for me if you know). This process does some pretty heavy text management for URL transforms and transaction logging, so several routines statically allocate some fairly large character buffers for this. When I first saw these warnings, I went through and moved a few buffers to different scopes and changed a couple to dynamic allocations, since they weren't needed in all cases. This helped eliminate most of these (down to 6 functions), but some still require these buffers for all transactions, so it doesn't make much sense (to me, anyway) to allocate them dynamically. The source file that is still generating a core with 3.3.6 did have a couple of these warnings (in different functions, not the one that was coring), but they were easily accomodated, and even though the warnings are gone in the 4.0.1 build, it still cores with the 3.3.6 build unless I turn off optimization for that file. I know it's confusing, but I'm trying to use the behavior from the two compiler versions to make the code itself better, rather than just picking the compiler that "seems to work". I was actually hoping to get more discernable info in the 3.3.6 build core after these changes, but the core hasn't actually changed. Looking at the gcc docs, I noticed the -fstack-limit flags, but I'm not real clear if or how they can help, or exactly how they're used. I am using the gcc linker, so the method described in the docs should be fine if it applies, but it looks like an absolute address is needed for the lower bound. Any pointers on where I can get clarification on these flags and the default stack size limits? If I should be taking this to a usenet group or some other forum instead of here, please let me know. Once again, many thanks to all those that have provided input so far. It's been very helpful. Lou -- Louis LeBlanc leblanc@keyslapper.net Fully Funded Hobbyist, KeySlapper Extrordinaire :þ http://www.keyslapper.net Ô¿Ô¬ Key fingerprint = C5E7 4762 F071 CE3B ED51 4FB8 AF85 A2FE 80C8 D9A2 QOTD: "If you keep an open mind people will throw a lot of garbage in it." ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-26 21:06 ` Louis LeBlanc @ 2005-07-26 21:52 ` Robert Dewar 2005-07-26 22:27 ` Louis LeBlanc 0 siblings, 1 reply; 10+ messages in thread From: Robert Dewar @ 2005-07-26 21:52 UTC (permalink / raw) To: Louis LeBlanc; +Cc: Giovanni Bajo, gcc Louis LeBlanc wrote: > I also found, to my delight and surprise, that the same code appears > to perform between 10% and 20% better - in a rough, fairly imprecise > environment. why surprise? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-26 21:52 ` Robert Dewar @ 2005-07-26 22:27 ` Louis LeBlanc 0 siblings, 0 replies; 10+ messages in thread From: Louis LeBlanc @ 2005-07-26 22:27 UTC (permalink / raw) To: Robert Dewar; +Cc: Louis LeBlanc, Giovanni Bajo, gcc On 07/26/05 05:52 PM, Robert Dewar sat at the `puter and typed: > Louis LeBlanc wrote: > > > I also found, to my delight and surprise, that the same code appears > > to perform between 10% and 20% better - in a rough, fairly imprecise > > environment. > > why surprise? I wasn't aware it was possible to get such a large boost just going to a newer compiler version and relatively minor code restructuring. Lou -- Louis LeBlanc leblanc@keyslapper.net Fully Funded Hobbyist, KeySlapper Extrordinaire :þ http://www.keyslapper.net Ô¿Ô¬ Key fingerprint = C5E7 4762 F071 CE3B ED51 4FB8 AF85 A2FE 80C8 D9A2 Age, n.: That period of life in which we compound for the vices that we still cherish by reviling those that we no longer have the enterprise to commit. -- Ambrose Bierce ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-25 22:00 ` Louis LeBlanc 2005-07-25 22:28 ` Giovanni Bajo @ 2005-07-25 22:50 ` Robert Dewar 2005-07-25 23:00 ` Dale Johannesen 1 sibling, 1 reply; 10+ messages in thread From: Robert Dewar @ 2005-07-25 22:50 UTC (permalink / raw) To: Louis LeBlanc; +Cc: Giovanni Bajo, gcc Louis LeBlanc wrote: > I would have expected much different results. Shouldn't the file > sizes be smaller (at least a little) with the -O3 switch? Maybe > there's a loop unrolled to make it faster, resulting in a larger > codebase? you generally expect -O3 files to be larger. inlining does not save space. Indeed in the context of Ada, where well thought out inlining is acheived by -O2 -gnatn (and the use of pragma Inline), we often see -O3 executables being not only larger but slower, presumably due to increased icache pressure. > The unoptimized version completed a 401,900 transaction test with no > problem. All day, I've been playing with different things, there are many bugs, most notably uninitialed vars, that show up only when you turn on optimization. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gcc 3.3.6 - stack corruption questions 2005-07-25 22:50 ` Robert Dewar @ 2005-07-25 23:00 ` Dale Johannesen 0 siblings, 0 replies; 10+ messages in thread From: Dale Johannesen @ 2005-07-25 23:00 UTC (permalink / raw) To: Robert Dewar; +Cc: gcc, Dale Johannesen, Louis LeBlanc, Giovanni Bajo O Jul 25, 2005, at 3:50 PM, Robert Dewar wrote: >> The unoptimized version completed a 401,900 transaction test with no >> problem. All day, I've been playing with different things, > > there are many bugs, most notably uninitialed vars, that show > up only when you turn on optimization. Also violations of strict aliasing rules are common. -Wuninitialized -fno-strict-aliasing [after the -O] will exercise those two. Also, mixed builds with some -O0 and some -O3 files should narrow it down. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-07-26 22:27 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-07-25 14:55 gcc 3.3.6 - stack corruption questions Louis LeBlanc 2005-07-25 15:15 ` Giovanni Bajo 2005-07-25 15:23 ` Louis LeBlanc 2005-07-25 22:00 ` Louis LeBlanc 2005-07-25 22:28 ` Giovanni Bajo 2005-07-26 21:06 ` Louis LeBlanc 2005-07-26 21:52 ` Robert Dewar 2005-07-26 22:27 ` Louis LeBlanc 2005-07-25 22:50 ` Robert Dewar 2005-07-25 23:00 ` Dale Johannesen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).