From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gnu.wildebeest.org (gnu.wildebeest.org [45.83.234.184]) by sourceware.org (Postfix) with ESMTPS id D68BD3858D28 for ; Thu, 16 Dec 2021 17:05:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D68BD3858D28 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=klomp.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org Received: from tarox.wildebeest.org (83-87-18-245.cable.dynamic.v4.ziggo.nl [83.87.18.245]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by gnu.wildebeest.org (Postfix) with ESMTPSA id 4DD21302FBA5; Thu, 16 Dec 2021 18:05:05 +0100 (CET) Received: by tarox.wildebeest.org (Postfix, from userid 1000) id 90AC0425A473; Thu, 16 Dec 2021 18:05:04 +0100 (CET) Message-ID: <23227917960b1e002be56c1bd93435a11f109077.camel@klomp.org> Subject: Re: Buildbot failure in Wildebeest Builder on whole buildset From: Mark Wielaard To: buildbot@builder.wildebeest.org, elfutils-devel@sourceware.org Date: Thu, 16 Dec 2021 18:05:04 +0100 In-Reply-To: <20211216011009.B2BBF800E29@builder.wildebeest.org> References: <20211216011009.B2BBF800E29@builder.wildebeest.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Evolution 3.28.5 (3.28.5-10.el7) Mime-Version: 1.0 X-Spam-Status: No, score=-3.9 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: elfutils-devel@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Elfutils-devel mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Dec 2021 17:05:10 -0000 Hi, On Thu, 2021-12-16 at 01:10 +0000, buildbot@builder.wildebeest.org wrote: > The Buildbot has detected a new failure on builder elfutils-centos- > x86_64 while building elfutils. > Full details are available at: > https://builder.wildebeest.org/buildbot/#builders/1/builds/884 >=20 > Buildbot URL: https://builder.wildebeest.org/buildbot/ >=20 > Worker for this Build: centos-x86_64 >=20 > Build Reason: > Blamelist: Alexander Kanavin >=20 > BUILD FAILED: failed test (failure) >=20 > Sincerely, > -The BuildbotThe Buildbot has detected a new failure on builder > elfutils-fedora-x86_64 while building elfutils. > Full details are available at: > https://builder.wildebeest.org/buildbot/#builders/3/builds/876 >=20 > Buildbot URL: https://builder.wildebeest.org/buildbot/ >=20 > Worker for this Build: fedora-x86_64 >=20 > Build Reason: > Blamelist: Alexander Kanavin >=20 > BUILD FAILED: failed test (failure) So this is really unfortunate and has nothing to do with the patch from Alexander. These are two different, but related failures. On centos-x86_64 this is: FAIL: run-backtrace-native-core-biarch.sh =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D /usr/bin/coredumpctl 0xf77ac000 0xf77ad000 linux-gate.so.1 0xf77ad000 0xf77d08fc ld-linux.so.2 0xf75b4000 0xf777ea1c libc.so.6 0xf777f000 0xf7799248 libpthread.so.0 0x5658e000 0x56591050 backtrace-child-biarch TID 24658: # 0 0xf77ac430 __kernel_vsyscall # 1 0xf778dd16 - 1 raise # 2 0x5658eafc - 1 sigusr2 # 3 0x5658ebeb - 1 stdarg # 4 0x5658ec2f - 1 backtracegen # 5 0x5658ec38 - 1 start # 6 0xf7785bbc - 1 start_thread # 7 0xf76b227e - 1 __clone TID 24656: # 0 0xf76b2268 __clone /srv/buildbot/worker/elfutils-centos-x86_64/build/tests/backtrace: dwfl_thread_getframes: No DWARF information found backtrace: backtrace.c:81: callback_verify: Assertion `seen_main' failed. ./test-subr.sh: line 84: 24682 Aborted (core dumped) LD_LIBRARY_PATH=3D"${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_ PATH" $VALGRIND_CMD "$@" backtrace-child-biarch-core.24656: no main Note that this is a i386 process being backtraced on x86_64. On fedora-x86_64 this is: FAIL: run-backtrace-native-core.sh =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D /usr/bin/coredumpctl 0x7ffd3d934000 0x7ffd3d935000 linux-vdso.so.1 0x7f4ccbf99000 0x7f4ccbfcd200 ld-linux-x86-64.so.2 0x7f4ccbd7b000 0x7f4ccbf84ad0 libc.so.6 0x56038dbfc000 0x56038dc000a8 backtrace-child TID 3043057: # 0 0x7f4ccbe0a89c __pthread_kill_implementation # 1 0x7f4ccbdbd6b6 - 1 raise # 2 0x56038dbfd3fd - 1 sigusr2 # 3 0x56038dbfd4ca - 1 stdarg # 4 0x56038dbfd4e0 - 1 backtracegen # 5 0x56038dbfd4e9 - 1 start # 6 0x7f4ccbe08ad7 - 1 start_thread # 7 0x7f4ccbe8d770 - 1 __clone3 TID 3043052: # 0 0x7f4ccbe8d75d __clone3 /srv/buildbot/worker/elfutils-fedora-x86_64/build/tests/backtrace: dwfl_thread_getframes: address out of range backtrace: backtrace.c:81: callback_verify: Assertion `seen_main' failed. ./test-subr.sh: line 84: 3043062 Aborted (core dumped) LD_LIBRARY_PATH=3D"${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_ PATH" $VALGRIND_CMD "$@" backtrace-child-core.3043052: no main rmdir: failed to remove 'test-3043029': Directory not empty FAIL run-backtrace-native-core.sh (exit status: 1) This is an x86_64 process core being backtraced on x86_64. The problem in both cases is that the parent cannot unwind from the exact pc it is stuck at. With eu-stack -v --core we can see (for the parent TID): TID 3043052: #0 0x00007f4ccbe8d75d __clone3 - libc.so.6 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:62 eu-stack: dwfl_thread_getframes tid 3043052 at 0x7f4ccbe8d75d in libc.so.6: address out of range That is this source code: ENTRY (__clone3) /* Sanity check arguments. */ movl $-EINVAL, %eax test %RDI_LP, %RDI_LP /* No NULL cl_args pointer. */ jz SYSCALL_ERROR_LABEL test %RDX_LP, %RDX_LP /* No NULL function pointer. */ jz SYSCALL_ERROR_LABEL /* Save the cl_args pointer in R8 which is preserved by the syscall. */ mov %RCX_LP, %R8_LP /* Do the system call. */ movl $SYS_ify(clone3), %eax /* End FDE now, because in the child the unwind info will be wrong. */ cfi_endproc syscall =3D> test %RAX_LP, %RAX_LP jl SYSCALL_ERROR_LABEL jz L(thread_start) ret L(thread_start): cfi_startproc /* Clearing frame pointer is insufficient, use CFI. */ cfi_undefined (rip) /* Clear the frame pointer. The ABI suggests this be done, to mark the outermost frame obviously. */ xorl %ebp, %ebp /* Align stack to 16 bytes per the x86-64 psABI. */ and $-16, %RSP_LP [...] So the PC is right after the syscall, when as the code says there is no CFI. Apparently the child ran first and quickly got to the terminating kill, while the parent was still stuck in the syscall (or just out of it, but not yet returned from the clone3 call. I think some synchronization is missed between the parent and child. But the test code is fairly complex. Cheers, Mark=20