From mboxrd@z Thu Jan  1 00:00:00 1970
From: Hans-Peter Nilsson <hp@bitrange.com>
To: overseers@sources.redhat.com
Subject: Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
Date: Sat, 30 Dec 2000 06:08:00 -0000
Message-id: <Pine.BSF.4.10.10007251259210.16931-100000@dair.pair.com>
References: <Pine.BSF.4.10.10007110937110.22049-100000@dair.pair.com>
X-SW-Source: 2000/msg00847.html

On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote:

(Talking to myself here; I didn't get any comments or requests for
clarification.)

> Many projects (ehrm, at least newlib) do not "get hits" for the posts
> after a few new ones on the latest (month, year, quarter).
> 
> This is a result of having pages marked meta noindex,follow and only
> pointing to the site URL when updating.  The update will only process
> pages than have changed.  If such a non-changed page points to a page
> marked with meta noindex,follow, (like the mailing list index for a
> time-period), new messages will not be indexed (or only be indexed if they
> are pointed to from an updated page elsewhere).

I think this is really a misfeature of ht://Dig: when doing the original
from-scratch indexing, it should save for updates (not throw away) the
URLs that were found, when a meta-tag "noindex,follow" stopped all other
processing than adding its links to the indexing.

> The obvious hack is to remove the noindex,follow mark everywhere,
> but a better solution is to add a list of such (topmost) noindex,follow
> urls to start_url.  The trick (if there is one) is to form such a list
> without assuming anything static, like what the current mailing lists are.
> Or at least to do it with *enough* room for things to work seamlessly
> without lots of fiddling when things change or projects are added.

I see three solutions for handling the lists of pages with noindex,follow.

 1:
Hack ht://Dig to generate the list by itself.  Creating the list in an
external file is somewhat simple; using the existing DB to keep track of
it will be a bit harder.

 Pros:
- Almost everything will work as it stands configury-wise, with only (say)
an extra option at index time, and a change as the start_url patch in my
previous message.

 Cons:
- I have to go hack ht://Dig, feels like it will take longer than the
other two options.
- A similar solution needs to go in future ht://Dig releases, or
sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig
patches.


 2:
Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh
(for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh
constructs.

 Pros:
- Changes are local to the ht://Dig configuration.
- Will handle occurrences of noindex,follow generally; mailing lists as
well as other places (F-O-M?).
- I've already started along this route (right, that's not a good reason).

 Cons:
- Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow.
- A "find" will traverse the web-directories every update.


 3:
Do it in monthly-updates; appending indexes for new months to a file.

 Pros:
- Seems like the smallest change (tens of lines?).

 Cons:
- Will only handle the problem for mailing lists: future additions of
noindex,follow tags in other places will fail silently (as it does now).
- Unexpected dependency between the ht://Dig configury and the mailing
list archive management.
- Will have to add ht://Dig excludes for otherwise non-indexed pages
like "overseers" anyway, as with #2.


I'll pick #2 for now: I don't like #3 and I think #1 will take more time
than I have right now.

If you have another opinion, please scream within 48 hours (as I'll be
gone for a week after that) or revert the patches I'll copy here when I'm
done.

brgds, H-P

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Hans-Peter Nilsson <hp@bitrange.com>
To: overseers@sources.redhat.com
Subject: Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
Date: Tue, 25 Jul 2000 11:22:00 -0000
Message-ID: <Pine.BSF.4.10.10007251259210.16931-100000@dair.pair.com>
References: <Pine.BSF.4.10.10007110937110.22049-100000@dair.pair.com>
X-SW-Source: 2000-q3/msg00138.html
Message-ID: <20000725112200.Q6lk0ma2gI61WKQA--D4hcSLGRp_XuFfu_EPm5jXrao@z>

On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote:

(Talking to myself here; I didn't get any comments or requests for
clarification.)

> Many projects (ehrm, at least newlib) do not "get hits" for the posts
> after a few new ones on the latest (month, year, quarter).
> 
> This is a result of having pages marked meta noindex,follow and only
> pointing to the site URL when updating.  The update will only process
> pages than have changed.  If such a non-changed page points to a page
> marked with meta noindex,follow, (like the mailing list index for a
> time-period), new messages will not be indexed (or only be indexed if they
> are pointed to from an updated page elsewhere).

I think this is really a misfeature of ht://Dig: when doing the original
from-scratch indexing, it should save for updates (not throw away) the
URLs that were found, when a meta-tag "noindex,follow" stopped all other
processing than adding its links to the indexing.

> The obvious hack is to remove the noindex,follow mark everywhere,
> but a better solution is to add a list of such (topmost) noindex,follow
> urls to start_url.  The trick (if there is one) is to form such a list
> without assuming anything static, like what the current mailing lists are.
> Or at least to do it with *enough* room for things to work seamlessly
> without lots of fiddling when things change or projects are added.

I see three solutions for handling the lists of pages with noindex,follow.

 1:
Hack ht://Dig to generate the list by itself.  Creating the list in an
external file is somewhat simple; using the existing DB to keep track of
it will be a bit harder.

 Pros:
- Almost everything will work as it stands configury-wise, with only (say)
an extra option at index time, and a change as the start_url patch in my
previous message.

 Cons:
- I have to go hack ht://Dig, feels like it will take longer than the
other two options.
- A similar solution needs to go in future ht://Dig releases, or
sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig
patches.


 2:
Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh
(for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh
constructs.

 Pros:
- Changes are local to the ht://Dig configuration.
- Will handle occurrences of noindex,follow generally; mailing lists as
well as other places (F-O-M?).
- I've already started along this route (right, that's not a good reason).

 Cons:
- Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow.
- A "find" will traverse the web-directories every update.


 3:
Do it in monthly-updates; appending indexes for new months to a file.

 Pros:
- Seems like the smallest change (tens of lines?).

 Cons:
- Will only handle the problem for mailing lists: future additions of
noindex,follow tags in other places will fail silently (as it does now).
- Unexpected dependency between the ht://Dig configury and the mailing
list archive management.
- Will have to add ht://Dig excludes for otherwise non-indexed pages
like "overseers" anyway, as with #2.


I'll pick #2 for now: I don't like #3 and I think #1 will take more time
than I have right now.

If you have another opinion, please scream within 48 hours (as I'll be
gone for a week after that) or revert the patches I'll copy here when I'm
done.

brgds, H-P