public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* Source(s|ware) ht://Dig indexing does not index ml:s correctly
@ 2000-12-30  6:08 Hans-Peter Nilsson
  2000-07-11  7:23 ` Hans-Peter Nilsson
  2000-12-30  6:08 ` Hans-Peter Nilsson
  0 siblings, 2 replies; 6+ messages in thread
From: Hans-Peter Nilsson @ 2000-12-30  6:08 UTC (permalink / raw)
  To: overseers

Searching for a gem
hidden under stone on stone
gets buried in pile

Many projects (ehrm, at least newlib) do not "get hits" for the posts
after a few new ones on the latest (month, year, quarter).

This is a result of having pages marked meta noindex,follow and only
pointing to the site URL when updating.  The update will only process
pages than have changed.  If such a non-changed page points to a page
marked with meta noindex,follow, (like the mailing list index for a
time-period), new messages will not be indexed (or only be indexed if they
are pointed to from an updated page elsewhere).

The obvious hack is to remove the noindex,follow mark everywhere,
but a better solution is to add a list of such (topmost) noindex,follow
urls to start_url.  The trick (if there is one) is to form such a list
without assuming anything static, like what the current mailing lists are.
Or at least to do it with *enough* room for things to work seamlessly
without lots of fiddling when things change or projects are added.

Suggestions welcome.

The same problem exists with gcc.gnu.org, and someone complained on
missing expected hits.  I haven't answered him yet, maybe because I'm
arrogant or something, or thought I should just fix it before.  :-P

Patch for visualization purposes only.

Index: sourceware.conf
===================================================================
RCS file: /cvs/sourceware/infra/htdig-conf/sourceware.conf,v
retrieving revision 1.8
diff -p -c -r1.8 sourceware.conf
*** sourceware.conf	2000/07/11 10:23:58	1.8
--- sourceware.conf	2000/07/11 13:51:23
*************** database_dir:		/sourceware/htdig/sourcew
*** 19,25 ****
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.
--- 19,25 ----
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/ `/path/to/list_of_noindex_follow_urls`
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.

Sorry for not fixing it yet, but I thought I should at least report the
current badness.

brgds, H-P

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2000-12-30  6:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-30  6:08 Source(s|ware) ht://Dig indexing does not index ml:s correctly Hans-Peter Nilsson
2000-07-11  7:23 ` Hans-Peter Nilsson
2000-12-30  6:08 ` Hans-Peter Nilsson
2000-07-25 11:22   ` Hans-Peter Nilsson
2000-12-30  6:08   ` Gerald Pfeifer
2000-07-25 13:32     ` Gerald Pfeifer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).