public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* Indexing incomplete due to document-size truncation
  2001-12-31 19:40 Indexing incomplete due to document-size truncation Hans-Peter Nilsson
@ 2001-03-26  6:00 ` Hans-Peter Nilsson
  0 siblings, 0 replies; 2+ messages in thread
From: Hans-Peter Nilsson @ 2001-03-26  6:00 UTC (permalink / raw)
  To: overseers

I investigated a report from a user complaining that htdig didn't find a
message in gcc-patches/1999-09n/, while hits were found in gcc-bugs.

I think I found the cause of that problem: while indexing, the max
document size is set to 200000 bytes.  The most visible effect is that,
due to reverse order listing, ml time periods with an index.html over that
length will not have the first messages indexed, whenever the search DB
is re-indexed after that time-period.

That's kind of obvious, though I thought I set that limit high enough at
the time. :-(  The next question was "how long do those ml index.html:s
get, so what should be a reasonable limit"?  Find+sort on both gcc and
sourceware side shows that they come up as large as (find -ls):

1430616  468 -rw-r--r--   1 listarch gcc        478262 Jan 31 23:10 ./gcc-patches/2001-01/index.html

only beaten by an index which seems flawed:

1816724 2278 -rw-r--r--   1 listarch sourcewa  2332154 Mar 31  2000 ./sourceware/ml/dssslist/2000-03/index.html

Can that last one be fixed?  It is still functional html, but for each new
message, all the previous messages are listed, plus a searchbox (or
something like that).  It's size should probably be in the order of 30k.
It's not incredibly bad, just bad.

That 478262 index.html is no particular expection.  Runner-ups are about
the same order.  I'm thinking of raising the limit to 1M per message.
Having the largest index.html as recent as January is a sign that they
might grow that large (though should probably then be split to weekly
indexes or something, as Jason indicated).

Still, this would mean indexing more messages, so there would have to be
more disk space before I dare changing this.  So here we are, back at my
previous rant.

brgds, H-P

PS:
Index: site.conf
===================================================================
RCS file: /cvs/sourceware/infra/htdig-conf/site.conf,v
retrieving revision 1.8
diff -p -c -c -p -1 -r1.8 site.conf
*** site.conf	2001/03/26 09:55:17	1.8
--- site.conf	2001/03/26 13:56:38
*************** no_excerpt_show_top: true
*** 85,87 ****
  #
! max_doc_size:		200000

--- 85,87 ----
  #
! max_doc_size:		1000000


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Indexing incomplete due to document-size truncation
@ 2001-12-31 19:40 Hans-Peter Nilsson
  2001-03-26  6:00 ` Hans-Peter Nilsson
  0 siblings, 1 reply; 2+ messages in thread
From: Hans-Peter Nilsson @ 2001-12-31 19:40 UTC (permalink / raw)
  To: overseers

I investigated a report from a user complaining that htdig didn't find a
message in gcc-patches/1999-09n/, while hits were found in gcc-bugs.

I think I found the cause of that problem: while indexing, the max
document size is set to 200000 bytes.  The most visible effect is that,
due to reverse order listing, ml time periods with an index.html over that
length will not have the first messages indexed, whenever the search DB
is re-indexed after that time-period.

That's kind of obvious, though I thought I set that limit high enough at
the time. :-(  The next question was "how long do those ml index.html:s
get, so what should be a reasonable limit"?  Find+sort on both gcc and
sourceware side shows that they come up as large as (find -ls):

1430616  468 -rw-r--r--   1 listarch gcc        478262 Jan 31 23:10 ./gcc-patches/2001-01/index.html

only beaten by an index which seems flawed:

1816724 2278 -rw-r--r--   1 listarch sourcewa  2332154 Mar 31  2000 ./sourceware/ml/dssslist/2000-03/index.html

Can that last one be fixed?  It is still functional html, but for each new
message, all the previous messages are listed, plus a searchbox (or
something like that).  It's size should probably be in the order of 30k.
It's not incredibly bad, just bad.

That 478262 index.html is no particular expection.  Runner-ups are about
the same order.  I'm thinking of raising the limit to 1M per message.
Having the largest index.html as recent as January is a sign that they
might grow that large (though should probably then be split to weekly
indexes or something, as Jason indicated).

Still, this would mean indexing more messages, so there would have to be
more disk space before I dare changing this.  So here we are, back at my
previous rant.

brgds, H-P

PS:
Index: site.conf
===================================================================
RCS file: /cvs/sourceware/infra/htdig-conf/site.conf,v
retrieving revision 1.8
diff -p -c -c -p -1 -r1.8 site.conf
*** site.conf	2001/03/26 09:55:17	1.8
--- site.conf	2001/03/26 13:56:38
*************** no_excerpt_show_top: true
*** 85,87 ****
  #
! max_doc_size:		200000

--- 85,87 ----
  #
! max_doc_size:		1000000


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2001-12-31 19:40 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-12-31 19:40 Indexing incomplete due to document-size truncation Hans-Peter Nilsson
2001-03-26  6:00 ` Hans-Peter Nilsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).