From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans-Peter Nilsson To: Subject: Indexing incomplete due to document-size truncation Date: Mon, 31 Dec 2001 19:40:00 -0000 Message-id: X-SW-Source: 2001/msg00503.html I investigated a report from a user complaining that htdig didn't find a message in gcc-patches/1999-09n/, while hits were found in gcc-bugs. I think I found the cause of that problem: while indexing, the max document size is set to 200000 bytes. The most visible effect is that, due to reverse order listing, ml time periods with an index.html over that length will not have the first messages indexed, whenever the search DB is re-indexed after that time-period. That's kind of obvious, though I thought I set that limit high enough at the time. :-( The next question was "how long do those ml index.html:s get, so what should be a reasonable limit"? Find+sort on both gcc and sourceware side shows that they come up as large as (find -ls): 1430616 468 -rw-r--r-- 1 listarch gcc 478262 Jan 31 23:10 ./gcc-patches/2001-01/index.html only beaten by an index which seems flawed: 1816724 2278 -rw-r--r-- 1 listarch sourcewa 2332154 Mar 31 2000 ./sourceware/ml/dssslist/2000-03/index.html Can that last one be fixed? It is still functional html, but for each new message, all the previous messages are listed, plus a searchbox (or something like that). It's size should probably be in the order of 30k. It's not incredibly bad, just bad. That 478262 index.html is no particular expection. Runner-ups are about the same order. I'm thinking of raising the limit to 1M per message. Having the largest index.html as recent as January is a sign that they might grow that large (though should probably then be split to weekly indexes or something, as Jason indicated). Still, this would mean indexing more messages, so there would have to be more disk space before I dare changing this. So here we are, back at my previous rant. brgds, H-P PS: Index: site.conf =================================================================== RCS file: /cvs/sourceware/infra/htdig-conf/site.conf,v retrieving revision 1.8 diff -p -c -c -p -1 -r1.8 site.conf *** site.conf 2001/03/26 09:55:17 1.8 --- site.conf 2001/03/26 13:56:38 *************** no_excerpt_show_top: true *** 85,87 **** # ! max_doc_size: 200000 --- 85,87 ---- # ! max_doc_size: 1000000 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans-Peter Nilsson To: Subject: Indexing incomplete due to document-size truncation Date: Mon, 26 Mar 2001 06:00:00 -0000 Message-ID: X-SW-Source: 2001-q1/msg00503.html Message-ID: <20010326060000.y3bm8O9XQZMxFk4vgsfF5y88bg19h9STKePk8cGBym8@z> I investigated a report from a user complaining that htdig didn't find a message in gcc-patches/1999-09n/, while hits were found in gcc-bugs. I think I found the cause of that problem: while indexing, the max document size is set to 200000 bytes. The most visible effect is that, due to reverse order listing, ml time periods with an index.html over that length will not have the first messages indexed, whenever the search DB is re-indexed after that time-period. That's kind of obvious, though I thought I set that limit high enough at the time. :-( The next question was "how long do those ml index.html:s get, so what should be a reasonable limit"? Find+sort on both gcc and sourceware side shows that they come up as large as (find -ls): 1430616 468 -rw-r--r-- 1 listarch gcc 478262 Jan 31 23:10 ./gcc-patches/2001-01/index.html only beaten by an index which seems flawed: 1816724 2278 -rw-r--r-- 1 listarch sourcewa 2332154 Mar 31 2000 ./sourceware/ml/dssslist/2000-03/index.html Can that last one be fixed? It is still functional html, but for each new message, all the previous messages are listed, plus a searchbox (or something like that). It's size should probably be in the order of 30k. It's not incredibly bad, just bad. That 478262 index.html is no particular expection. Runner-ups are about the same order. I'm thinking of raising the limit to 1M per message. Having the largest index.html as recent as January is a sign that they might grow that large (though should probably then be split to weekly indexes or something, as Jason indicated). Still, this would mean indexing more messages, so there would have to be more disk space before I dare changing this. So here we are, back at my previous rant. brgds, H-P PS: Index: site.conf =================================================================== RCS file: /cvs/sourceware/infra/htdig-conf/site.conf,v retrieving revision 1.8 diff -p -c -c -p -1 -r1.8 site.conf *** site.conf 2001/03/26 09:55:17 1.8 --- site.conf 2001/03/26 13:56:38 *************** no_excerpt_show_top: true *** 85,87 **** # ! max_doc_size: 200000 --- 85,87 ---- # ! max_doc_size: 1000000