From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 3110 invoked by alias); 11 Sep 2012 17:25:38 -0000 Received: (qmail 3095 invoked by uid 22791); 11 Sep 2012 17:25:33 -0000 X-SWARE-Spam-Status: No, hits=-4.3 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,KHOP_THREADED,TW_FC X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 11 Sep 2012 17:25:17 +0000 From: "tejohnson at google dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug gcov-profile/54487] [4.8 Regression] profiledbootstrap broken by r190952 Date: Tue, 11 Sep 2012 17:25:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: gcov-profile X-Bugzilla-Keywords: build X-Bugzilla-Severity: normal X-Bugzilla-Who: tejohnson at google dot com X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 4.8.0 X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2012-09/txt/msg00880.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487 --- Comment #16 from Teresa Johnson 2012-09-11 17:24:58 UTC --- On Thu, Sep 6, 2012 at 10:18 PM, Teresa Johnson wrote: > On Thu, Sep 6, 2012 at 1:49 PM, hjl.tools at gmail dot com > wrote: >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487 >> >> --- Comment #13 from H.J. Lu 2012-09-06 20:49:02 UTC --- >> It works for me now after syncing with revision 191037. > > Unfortunately, I have now hit this myself a couple times in further > testing, so I am still digging... After digging into this for several days now, I am convinced that the gcda file is changing out from underneath the profile-use compile during the read. I just don't understand how, since fcntl is being used to lock the file. Details below. I am hoping someone has some ideas or advice on where to go. First, some info on the compile step and gcda file locking, then I will explain why I belive the gcda file is being written mid-read. The failure is occurring sporadically during the profile-use step of a parallel profiledbootstrap. In this step, not only are the gcc profiles being read in by the profile-use compilation, but the gcc binaries being used in the compile are instrumented, so the gcda files are also being written by the parallel build processes. However, the fcntl file locking should prevent interference between the reads and the writes of the same gcda file. According to the fcntl documentation, the lock is lost on any fclose on the file by the same process, even if it uses a different file descriptor. But for a given profile use compile, the read step should be complete before the write step when the gcc exits and updates the gcda files with libgcov. So I don't think the lock should be lost this way. The failure is not reproducible manually. Therefore, I have been adding instrumentation to the compiler to spit out information at both the point of failure ("error: corrupted profile info: edge from 54 to 55 exceeds maximal count" from read_profile_edge_counts()), and then at the point when the gcda counts are being read in by read_counts_file(). This is the loop that looks like: for (ix = 0; ix != n_counts; ix++) entry->counts[ix] += gcov_read_counter (); I modified the above loop to check each count as it is read in against the sum_max, and if it exceeds sum_max to print the summary and counters at that point. I then compared this to the counters obtained for that function from a gcov-dump of the gcda file afterwards. The counters in the gcda file from gcov-dump are slightly higher because of subsequent merges into it by other parallel builds before compilation aborts, but it is still fairly easy to correlate the counter values between them. What I found is that the counter values being read by read_counts_file look good up to a certain point, and then they go bad. Looking at the huge bad values in hex, they are actually valid values from the gcda file, but it looks like we suddenly jumped back or ahead several locations. So for example, in some cases where it looks like we suddenly jumped ahead in the gcda file, I see that some of the last counter values being read into the array as counters are actually the tag values from just after the counter array in the gcda file. Or in some cases, the counter values are being read mis-aligned by one word, so they look huge because instead of having 0x00000000 in the most significant of the 2 counter words, the 0x00000000 is in the least significant of the 2 counter words. In other words, we jumped some odd number of words ahead (or behind) in the gcda file. If another process is writing into the gcda file at the same time, this could happen. Specifically, since the histogram now included in the gcda file summaries with my patch only write non-zero values, after a merge the size of the histogram, and therefore the size of the summaries, could change. This will cause the starting offsets of different tags/sections throughout the rest of the gcda file to shift. This shouldn't be a problem if the file locking is working though. But if the file locking is not working, then that would explain why suddenly during the read we suddenly seem to jump to a different spot. A couple of other notes: - I added some code after each counter read in the above loop to seek back to the offset where we read the tag, re-read it, and compare it to the tag we read earlier (then re-seek back to the current location in the counter array before continuing the read). Normally this check succeeds, but in the cases where I am hitting the above errors, the "tag" at that location has changed to something that doesn't look like a tag. - Jumping to a different spot should corrupt the reads of the rest of the file. But the main loop in read_counts_file will simply ignore any tag it doesn't understand, and exits when it reads a '0' tag. That's why there was no error being flagged during the read when this was happening. - If something like this was happening on trunk before my patch, it most likely would have been hidden because the size of the summaries was static, and therefore so were the offsets within the file, except in the case when a new summary was being added during the merge (which is not very often). So it is possible that this problem existed prior but was hidden. Any suggestions or ideas? Thanks, Teresa > > Teresa > >> >> -- >> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email >> ------- You are receiving this mail because: ------- >> You are on the CC list for the bug. > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413