From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gnu.wildebeest.org (gnu.wildebeest.org [45.83.234.184]) by sourceware.org (Postfix) with ESMTPS id 6E34A3858C2D for ; Tue, 16 Aug 2022 21:36:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 6E34A3858C2D Received: from reform (unknown [178.226.129.39]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gnu.wildebeest.org (Postfix) with ESMTPSA id C5F88300070C; Tue, 16 Aug 2022 23:36:19 +0200 (CEST) Received: by reform (Postfix, from userid 1000) id A7CA32E80466; Tue, 16 Aug 2022 23:36:17 +0200 (CEST) Date: Tue, 16 Aug 2022 23:36:17 +0200 From: Mark Wielaard To: Overseers mailing list Cc: Simon Marchi Subject: Re: inbox.sourceware.org experiment Message-ID: References: <20220813141403.GL5520@gnu.wildebeest.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220813141403.GL5520@gnu.wildebeest.org> X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: overseers@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Overseers mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Aug 2022 21:36:23 -0000 Hi, On Sat, Aug 13, 2022 at 04:14:03PM +0200, Mark Wielaard via Overseers wrote: > Looking at the mailman2inbox.sh script I have a few suggestions (I can > make them to the script myself, but don't know if you are currently > editing/running it): > > - public-inbox-init should probably use -V2 (see above). You can then > also use -j JOBS to speed up the import. > > - --indexlevel shuld be full to make the Xapian searching more useful > (this is the default, so you can also not set it). Note that this > also affects the incremental updating done by public-inbox-mda. > > - You want to kill public-inbox-httpd using -SIGHUP so it just reloads > the new config files. Yo also want to kill the other daemons, > public-inbox-imapd and public-inbox-nntpd > > - The --ng name should be based on the primary domain name (see > above). I don't know how to determine that easily though. Maybe > mailman knows, then we can also set the initial ADDRESS properly. > > The formail -s public-inbox-mda seems to work well for batch > importing, but is it efficient enough for keeping the importing up to > date? It looks like the last .mbox file is just really big and new > messages are appended at the end, so we would be trying to import all > messages all the ime. And how do we make sure it is triggered when new > messages come in? It turns out public-inbox does support importing a full mbox in one go. But it doesn't have a nice binary for it yet. There is however scripts/import_vger_from_mbox in upstream git which is easily adapted (just remove the vger specific filtering). I put this in the inbox homedir as import_from_mbox. And to test I remove the already imported elfutils-devel and reimported it using the import_from_mbox script using: $ public-inbox-init -V2 --ng inbox.sourceware.elfutils-devel -L full elfutils-devel /home/inbox/lists/elfutils-devel https://inbox.sourceware.org/elfutils-devel elfutils@sourceware.org elfutils-devel@lists.fedorahosted.org $ ./import_from_mbox elfutils-devel elfutils-devel@lists.fedorahosted.org lists/elfutils-devel < /sourceware/projects/elfutils-home/elfutils-devel.nospam.mbox $ for i in /var/lib/mailman/archives/private/elfutils-devel.mbox/*mbox; do ./import_from_mbox elfutils-devel elfutils-devel@sourceware.org lists/elfutils-devel < $i; done Note this is V2 plus full indexing and includes and extra historical elfutils-devel.nospam.mbox Surprisingly this only took ~30 seconds in total. The elfutils-devel.nospam.mbox doesn't contain enough headers to do proper threading unfortunately. But the full index does make it possible to match on similar subject. I don't have a solution for keeping the archive up to date. Parsing mboxes is really discouraged upstream because it needs reparsing all messages and there is no locking mechanism for mboxes so if mailman writes to the mbox and public-inbox reads from it odd things can happen. One way to make it work with public-inbox-watch is to subscribe the inbox user to each list and create a Maildir of messages. But then the message headers will have been rewritten by mailman. So it would be better to somehow get the inbox user the messages before mailman sees them, or somehow get the inbox user a copy of the message as mailman would add to the mbox archive instead of what it sents to list subscribers. Cheers, Mark