Hi,

On Mon, 2021-09-13 at 11:06 +0200, Mark Wielaard wrote:
> On Sun, Sep 12, 2021 at 11:16:09PM +0000, 
> buildbot@builder.wildebeest.org wrote:
> > The Buildbot has detected a new failure on builder elfutils-fedora-
> > s390x while building elfutils.
> > Full details are available at:
> >     https://builder.wildebeest.org/buildbot/#builders/10/builds/795
> > 
> > Buildbot URL: https://builder.wildebeest.org/buildbot/
> > 
> > Worker for this Build: fedora-s390x
> 
> This is the same failure we saw on fedora-ppc64 and centos-x86_64
> yesterday.
> https://builder.wildebeest.org/buildbot/#/builders/10/builds/795/steps/8/logs/test-suite_log
> 
> I still don't understand why. In the logs we can see (for the PORT2
> server):
> 
> [Sun Sep 12 22:56:26 2021] (1493056/1493066): recorded
> buildid=a0a48245eb29786f7b6853df68ab23cb608b344b
> file=/home/mjw/bb/wildebeest/elfutils-fedora-
> s390x/build/tests/dwfllines mtime=1631486319 atype=ED
> 
> But then, 2 seconds later:
> [Sun Sep 12 22:56:28 2021] (1493056/1493388): searching for
> buildid=a0a48245eb29786f7b6853df68ab23cb608b344b
> artifacttype=debuginfo suffix=
> [Sun Sep 12 22:56:28 2021] (1493056/1493388): not found
> [Sun Sep 12 22:56:28 2021] (1493056/1493388): 127.0.0.1:47886
> UA:elfutils/0.185,Linux/s390x,fedora/34 XFF: GET
> /buildid/a0a48245eb29786f7b6853df68ab23cb608b344b/debuginfo 404 9
> 0+2ms
> 
> Somewhere inbetween the buildid seems to have been forgotten. But I
> cannot figure out why or where. It is clearly non-deterministic since
> normally the tests PASS.

So the issue is triggered by this part in groom ():

   // delete buildids with no references in _r_de or _f_de tables;
   // cascades to _r_sref & _f_s records
   sqlite_ps buildids_del (db, "nuke orphan buildids",
                           "delete from " BUILDIDS "_buildids "
                           "where not exists (select 1 from " BUILDIDS "_f_de d where " BUILDIDS "_buildids.id = d.buildid) "
                           "and not exists (select 1 from " BUILDIDS "_r_de d where " BUILDIDS "_buildids.id = d.buildid)");
   buildids_del.reset().step_ok_done();

When commenting that out I can run the tests (or a simplified version
using just one server and on one client request as attached) 30000
times without issue. While with groom executing that part of the 
code the test will fail after a couple hundred cycles.

Now the question is whether it is reasonable that groom removes the
buildid here. Is that because of the way the test is written? Or is
this a real bug where there is a bad interaction between a (partial?)
scan run and a groom cycle?

Cheers,

Mark