public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
From: Hans-Peter Nilsson <hp@bitrange.com>
To: overseers@sources.redhat.com
Subject: Source(s|ware) ht://Dig indexing does not index ml:s correctly
Date: Sat, 30 Dec 2000 06:08:00 -0000	[thread overview]
Message-ID: <Pine.BSF.4.10.10007110937110.22049-100000@dair.pair.com> (raw)

Searching for a gem
hidden under stone on stone
gets buried in pile

Many projects (ehrm, at least newlib) do not "get hits" for the posts
after a few new ones on the latest (month, year, quarter).

This is a result of having pages marked meta noindex,follow and only
pointing to the site URL when updating.  The update will only process
pages than have changed.  If such a non-changed page points to a page
marked with meta noindex,follow, (like the mailing list index for a
time-period), new messages will not be indexed (or only be indexed if they
are pointed to from an updated page elsewhere).

The obvious hack is to remove the noindex,follow mark everywhere,
but a better solution is to add a list of such (topmost) noindex,follow
urls to start_url.  The trick (if there is one) is to form such a list
without assuming anything static, like what the current mailing lists are.
Or at least to do it with *enough* room for things to work seamlessly
without lots of fiddling when things change or projects are added.

Suggestions welcome.

The same problem exists with gcc.gnu.org, and someone complained on
missing expected hits.  I haven't answered him yet, maybe because I'm
arrogant or something, or thought I should just fix it before.  :-P

Patch for visualization purposes only.

Index: sourceware.conf
===================================================================
RCS file: /cvs/sourceware/infra/htdig-conf/sourceware.conf,v
retrieving revision 1.8
diff -p -c -r1.8 sourceware.conf
*** sourceware.conf	2000/07/11 10:23:58	1.8
--- sourceware.conf	2000/07/11 13:51:23
*************** database_dir:		/sourceware/htdig/sourcew
*** 19,25 ****
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.
--- 19,25 ----
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/ `/path/to/list_of_noindex_follow_urls`
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.

Sorry for not fixing it yet, but I thought I should at least report the
current badness.

brgds, H-P

WARNING: multiple messages have this Message-ID
From: Hans-Peter Nilsson <hp@bitrange.com>
To: overseers@sources.redhat.com
Subject: Source(s|ware) ht://Dig indexing does not index ml:s correctly
Date: Tue, 11 Jul 2000 07:23:00 -0000	[thread overview]
Message-ID: <Pine.BSF.4.10.10007110937110.22049-100000@dair.pair.com> (raw)
Message-ID: <20000711072300.njhB9ctRsfgLt85Ilff7iQ-XM5SsJBTtDe6RkGVsx44@z> (raw)

Searching for a gem
hidden under stone on stone
gets buried in pile

Many projects (ehrm, at least newlib) do not "get hits" for the posts
after a few new ones on the latest (month, year, quarter).

This is a result of having pages marked meta noindex,follow and only
pointing to the site URL when updating.  The update will only process
pages than have changed.  If such a non-changed page points to a page
marked with meta noindex,follow, (like the mailing list index for a
time-period), new messages will not be indexed (or only be indexed if they
are pointed to from an updated page elsewhere).

The obvious hack is to remove the noindex,follow mark everywhere,
but a better solution is to add a list of such (topmost) noindex,follow
urls to start_url.  The trick (if there is one) is to form such a list
without assuming anything static, like what the current mailing lists are.
Or at least to do it with *enough* room for things to work seamlessly
without lots of fiddling when things change or projects are added.

Suggestions welcome.

The same problem exists with gcc.gnu.org, and someone complained on
missing expected hits.  I haven't answered him yet, maybe because I'm
arrogant or something, or thought I should just fix it before.  :-P

Patch for visualization purposes only.

Index: sourceware.conf
===================================================================
RCS file: /cvs/sourceware/infra/htdig-conf/sourceware.conf,v
retrieving revision 1.8
diff -p -c -r1.8 sourceware.conf
*** sourceware.conf	2000/07/11 10:23:58	1.8
--- sourceware.conf	2000/07/11 13:51:23
*************** database_dir:		/sourceware/htdig/sourcew
*** 19,25 ****
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.
--- 19,25 ----
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/ `/path/to/list_of_noindex_follow_urls`
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.

Sorry for not fixing it yet, but I thought I should at least report the
current badness.

brgds, H-P

             reply	other threads:[~2000-12-30  6:08 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2000-12-30  6:08 Hans-Peter Nilsson [this message]
2000-07-11  7:23 ` Hans-Peter Nilsson
2000-12-30  6:08 ` Hans-Peter Nilsson
2000-07-25 11:22   ` Hans-Peter Nilsson
2000-12-30  6:08   ` Gerald Pfeifer
2000-07-25 13:32     ` Gerald Pfeifer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.BSF.4.10.10007110937110.22049-100000@dair.pair.com \
    --to=hp@bitrange.com \
    --cc=overseers@sources.redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).