public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
From: Hans-Peter Nilsson <hp@bitrange.com>
To: overseers@sources.redhat.com
Subject: Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
Date: Sat, 30 Dec 2000 06:08:00 -0000	[thread overview]
Message-ID: <Pine.BSF.4.10.10007251259210.16931-100000@dair.pair.com> (raw)
In-Reply-To: <Pine.BSF.4.10.10007110937110.22049-100000@dair.pair.com>

On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote:

(Talking to myself here; I didn't get any comments or requests for
clarification.)

> Many projects (ehrm, at least newlib) do not "get hits" for the posts
> after a few new ones on the latest (month, year, quarter).
> 
> This is a result of having pages marked meta noindex,follow and only
> pointing to the site URL when updating.  The update will only process
> pages than have changed.  If such a non-changed page points to a page
> marked with meta noindex,follow, (like the mailing list index for a
> time-period), new messages will not be indexed (or only be indexed if they
> are pointed to from an updated page elsewhere).

I think this is really a misfeature of ht://Dig: when doing the original
from-scratch indexing, it should save for updates (not throw away) the
URLs that were found, when a meta-tag "noindex,follow" stopped all other
processing than adding its links to the indexing.

> The obvious hack is to remove the noindex,follow mark everywhere,
> but a better solution is to add a list of such (topmost) noindex,follow
> urls to start_url.  The trick (if there is one) is to form such a list
> without assuming anything static, like what the current mailing lists are.
> Or at least to do it with *enough* room for things to work seamlessly
> without lots of fiddling when things change or projects are added.

I see three solutions for handling the lists of pages with noindex,follow.

 1:
Hack ht://Dig to generate the list by itself.  Creating the list in an
external file is somewhat simple; using the existing DB to keep track of
it will be a bit harder.

 Pros:
- Almost everything will work as it stands configury-wise, with only (say)
an extra option at index time, and a change as the start_url patch in my
previous message.

 Cons:
- I have to go hack ht://Dig, feels like it will take longer than the
other two options.
- A similar solution needs to go in future ht://Dig releases, or
sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig
patches.


 2:
Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh
(for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh
constructs.

 Pros:
- Changes are local to the ht://Dig configuration.
- Will handle occurrences of noindex,follow generally; mailing lists as
well as other places (F-O-M?).
- I've already started along this route (right, that's not a good reason).

 Cons:
- Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow.
- A "find" will traverse the web-directories every update.


 3:
Do it in monthly-updates; appending indexes for new months to a file.

 Pros:
- Seems like the smallest change (tens of lines?).

 Cons:
- Will only handle the problem for mailing lists: future additions of
noindex,follow tags in other places will fail silently (as it does now).
- Unexpected dependency between the ht://Dig configury and the mailing
list archive management.
- Will have to add ht://Dig excludes for otherwise non-indexed pages
like "overseers" anyway, as with #2.


I'll pick #2 for now: I don't like #3 and I think #1 will take more time
than I have right now.

If you have another opinion, please scream within 48 hours (as I'll be
gone for a week after that) or revert the patches I'll copy here when I'm
done.

brgds, H-P

WARNING: multiple messages have this Message-ID
From: Hans-Peter Nilsson <hp@bitrange.com>
To: overseers@sources.redhat.com
Subject: Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
Date: Tue, 25 Jul 2000 11:22:00 -0000	[thread overview]
Message-ID: <Pine.BSF.4.10.10007251259210.16931-100000@dair.pair.com> (raw)
Message-ID: <20000725112200.Q6lk0ma2gI61WKQA--D4hcSLGRp_XuFfu_EPm5jXrao@z> (raw)
In-Reply-To: <Pine.BSF.4.10.10007110937110.22049-100000@dair.pair.com>

On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote:

(Talking to myself here; I didn't get any comments or requests for
clarification.)

> Many projects (ehrm, at least newlib) do not "get hits" for the posts
> after a few new ones on the latest (month, year, quarter).
> 
> This is a result of having pages marked meta noindex,follow and only
> pointing to the site URL when updating.  The update will only process
> pages than have changed.  If such a non-changed page points to a page
> marked with meta noindex,follow, (like the mailing list index for a
> time-period), new messages will not be indexed (or only be indexed if they
> are pointed to from an updated page elsewhere).

I think this is really a misfeature of ht://Dig: when doing the original
from-scratch indexing, it should save for updates (not throw away) the
URLs that were found, when a meta-tag "noindex,follow" stopped all other
processing than adding its links to the indexing.

> The obvious hack is to remove the noindex,follow mark everywhere,
> but a better solution is to add a list of such (topmost) noindex,follow
> urls to start_url.  The trick (if there is one) is to form such a list
> without assuming anything static, like what the current mailing lists are.
> Or at least to do it with *enough* room for things to work seamlessly
> without lots of fiddling when things change or projects are added.

I see three solutions for handling the lists of pages with noindex,follow.

 1:
Hack ht://Dig to generate the list by itself.  Creating the list in an
external file is somewhat simple; using the existing DB to keep track of
it will be a bit harder.

 Pros:
- Almost everything will work as it stands configury-wise, with only (say)
an extra option at index time, and a change as the start_url patch in my
previous message.

 Cons:
- I have to go hack ht://Dig, feels like it will take longer than the
other two options.
- A similar solution needs to go in future ht://Dig releases, or
sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig
patches.


 2:
Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh
(for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh
constructs.

 Pros:
- Changes are local to the ht://Dig configuration.
- Will handle occurrences of noindex,follow generally; mailing lists as
well as other places (F-O-M?).
- I've already started along this route (right, that's not a good reason).

 Cons:
- Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow.
- A "find" will traverse the web-directories every update.


 3:
Do it in monthly-updates; appending indexes for new months to a file.

 Pros:
- Seems like the smallest change (tens of lines?).

 Cons:
- Will only handle the problem for mailing lists: future additions of
noindex,follow tags in other places will fail silently (as it does now).
- Unexpected dependency between the ht://Dig configury and the mailing
list archive management.
- Will have to add ht://Dig excludes for otherwise non-indexed pages
like "overseers" anyway, as with #2.


I'll pick #2 for now: I don't like #3 and I think #1 will take more time
than I have right now.

If you have another opinion, please scream within 48 hours (as I'll be
gone for a week after that) or revert the patches I'll copy here when I'm
done.

brgds, H-P

  parent reply	other threads:[~2000-12-30  6:08 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2000-12-30  6:08 Hans-Peter Nilsson
2000-07-11  7:23 ` Hans-Peter Nilsson
2000-12-30  6:08 ` Hans-Peter Nilsson [this message]
2000-07-25 11:22   ` Hans-Peter Nilsson
2000-12-30  6:08   ` Gerald Pfeifer
2000-07-25 13:32     ` Gerald Pfeifer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.BSF.4.10.10007251259210.16931-100000@dair.pair.com \
    --to=hp@bitrange.com \
    --cc=overseers@sources.redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).