public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* Source(s|ware) ht://Dig indexing does not index ml:s correctly
  2000-12-30  6:08 Source(s|ware) ht://Dig indexing does not index ml:s correctly Hans-Peter Nilsson
@ 2000-07-11  7:23 ` Hans-Peter Nilsson
  2000-12-30  6:08 ` Hans-Peter Nilsson
  1 sibling, 0 replies; 6+ messages in thread
From: Hans-Peter Nilsson @ 2000-07-11  7:23 UTC (permalink / raw)
  To: overseers

Searching for a gem
hidden under stone on stone
gets buried in pile

Many projects (ehrm, at least newlib) do not "get hits" for the posts
after a few new ones on the latest (month, year, quarter).

This is a result of having pages marked meta noindex,follow and only
pointing to the site URL when updating.  The update will only process
pages than have changed.  If such a non-changed page points to a page
marked with meta noindex,follow, (like the mailing list index for a
time-period), new messages will not be indexed (or only be indexed if they
are pointed to from an updated page elsewhere).

The obvious hack is to remove the noindex,follow mark everywhere,
but a better solution is to add a list of such (topmost) noindex,follow
urls to start_url.  The trick (if there is one) is to form such a list
without assuming anything static, like what the current mailing lists are.
Or at least to do it with *enough* room for things to work seamlessly
without lots of fiddling when things change or projects are added.

Suggestions welcome.

The same problem exists with gcc.gnu.org, and someone complained on
missing expected hits.  I haven't answered him yet, maybe because I'm
arrogant or something, or thought I should just fix it before.  :-P

Patch for visualization purposes only.

Index: sourceware.conf
===================================================================
RCS file: /cvs/sourceware/infra/htdig-conf/sourceware.conf,v
retrieving revision 1.8
diff -p -c -r1.8 sourceware.conf
*** sourceware.conf	2000/07/11 10:23:58	1.8
--- sourceware.conf	2000/07/11 13:51:23
*************** database_dir:		/sourceware/htdig/sourcew
*** 19,25 ****
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.
--- 19,25 ----
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/ `/path/to/list_of_noindex_follow_urls`
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.

Sorry for not fixing it yet, but I thought I should at least report the
current badness.

brgds, H-P

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
  2000-12-30  6:08 ` Hans-Peter Nilsson
@ 2000-07-25 11:22   ` Hans-Peter Nilsson
  2000-12-30  6:08   ` Gerald Pfeifer
  1 sibling, 0 replies; 6+ messages in thread
From: Hans-Peter Nilsson @ 2000-07-25 11:22 UTC (permalink / raw)
  To: overseers

On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote:

(Talking to myself here; I didn't get any comments or requests for
clarification.)

> Many projects (ehrm, at least newlib) do not "get hits" for the posts
> after a few new ones on the latest (month, year, quarter).
> 
> This is a result of having pages marked meta noindex,follow and only
> pointing to the site URL when updating.  The update will only process
> pages than have changed.  If such a non-changed page points to a page
> marked with meta noindex,follow, (like the mailing list index for a
> time-period), new messages will not be indexed (or only be indexed if they
> are pointed to from an updated page elsewhere).

I think this is really a misfeature of ht://Dig: when doing the original
from-scratch indexing, it should save for updates (not throw away) the
URLs that were found, when a meta-tag "noindex,follow" stopped all other
processing than adding its links to the indexing.

> The obvious hack is to remove the noindex,follow mark everywhere,
> but a better solution is to add a list of such (topmost) noindex,follow
> urls to start_url.  The trick (if there is one) is to form such a list
> without assuming anything static, like what the current mailing lists are.
> Or at least to do it with *enough* room for things to work seamlessly
> without lots of fiddling when things change or projects are added.

I see three solutions for handling the lists of pages with noindex,follow.

 1:
Hack ht://Dig to generate the list by itself.  Creating the list in an
external file is somewhat simple; using the existing DB to keep track of
it will be a bit harder.

 Pros:
- Almost everything will work as it stands configury-wise, with only (say)
an extra option at index time, and a change as the start_url patch in my
previous message.

 Cons:
- I have to go hack ht://Dig, feels like it will take longer than the
other two options.
- A similar solution needs to go in future ht://Dig releases, or
sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig
patches.


 2:
Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh
(for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh
constructs.

 Pros:
- Changes are local to the ht://Dig configuration.
- Will handle occurrences of noindex,follow generally; mailing lists as
well as other places (F-O-M?).
- I've already started along this route (right, that's not a good reason).

 Cons:
- Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow.
- A "find" will traverse the web-directories every update.


 3:
Do it in monthly-updates; appending indexes for new months to a file.

 Pros:
- Seems like the smallest change (tens of lines?).

 Cons:
- Will only handle the problem for mailing lists: future additions of
noindex,follow tags in other places will fail silently (as it does now).
- Unexpected dependency between the ht://Dig configury and the mailing
list archive management.
- Will have to add ht://Dig excludes for otherwise non-indexed pages
like "overseers" anyway, as with #2.


I'll pick #2 for now: I don't like #3 and I think #1 will take more time
than I have right now.

If you have another opinion, please scream within 48 hours (as I'll be
gone for a week after that) or revert the patches I'll copy here when I'm
done.

brgds, H-P

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
  2000-12-30  6:08   ` Gerald Pfeifer
@ 2000-07-25 13:32     ` Gerald Pfeifer
  0 siblings, 0 replies; 6+ messages in thread
From: Gerald Pfeifer @ 2000-07-25 13:32 UTC (permalink / raw)
  To: Hans-Peter Nilsson; +Cc: overseers

On Tue, 25 Jul 2000, Hans-Peter Nilsson wrote:
> I'll pick #2 for now: I don't like #3 and I think #1 will take more time
> than I have right now.

I think that #1 with those or similiar changes getting included into
official htDig sources would be the best solution, but probably a bit
involved, so #2 seems like the way to go now.

But, of course, you are the one doing all of the work, so it's up to
you to decide anyway! :-)

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/
Have a look at http://petition.eurolinux.org -- it's not about Linux, BTW!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
  2000-12-30  6:08 Source(s|ware) ht://Dig indexing does not index ml:s correctly Hans-Peter Nilsson
  2000-07-11  7:23 ` Hans-Peter Nilsson
@ 2000-12-30  6:08 ` Hans-Peter Nilsson
  2000-07-25 11:22   ` Hans-Peter Nilsson
  2000-12-30  6:08   ` Gerald Pfeifer
  1 sibling, 2 replies; 6+ messages in thread
From: Hans-Peter Nilsson @ 2000-12-30  6:08 UTC (permalink / raw)
  To: overseers

On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote:

(Talking to myself here; I didn't get any comments or requests for
clarification.)

> Many projects (ehrm, at least newlib) do not "get hits" for the posts
> after a few new ones on the latest (month, year, quarter).
> 
> This is a result of having pages marked meta noindex,follow and only
> pointing to the site URL when updating.  The update will only process
> pages than have changed.  If such a non-changed page points to a page
> marked with meta noindex,follow, (like the mailing list index for a
> time-period), new messages will not be indexed (or only be indexed if they
> are pointed to from an updated page elsewhere).

I think this is really a misfeature of ht://Dig: when doing the original
from-scratch indexing, it should save for updates (not throw away) the
URLs that were found, when a meta-tag "noindex,follow" stopped all other
processing than adding its links to the indexing.

> The obvious hack is to remove the noindex,follow mark everywhere,
> but a better solution is to add a list of such (topmost) noindex,follow
> urls to start_url.  The trick (if there is one) is to form such a list
> without assuming anything static, like what the current mailing lists are.
> Or at least to do it with *enough* room for things to work seamlessly
> without lots of fiddling when things change or projects are added.

I see three solutions for handling the lists of pages with noindex,follow.

 1:
Hack ht://Dig to generate the list by itself.  Creating the list in an
external file is somewhat simple; using the existing DB to keep track of
it will be a bit harder.

 Pros:
- Almost everything will work as it stands configury-wise, with only (say)
an extra option at index time, and a change as the start_url patch in my
previous message.

 Cons:
- I have to go hack ht://Dig, feels like it will take longer than the
other two options.
- A similar solution needs to go in future ht://Dig releases, or
sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig
patches.


 2:
Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh
(for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh
constructs.

 Pros:
- Changes are local to the ht://Dig configuration.
- Will handle occurrences of noindex,follow generally; mailing lists as
well as other places (F-O-M?).
- I've already started along this route (right, that's not a good reason).

 Cons:
- Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow.
- A "find" will traverse the web-directories every update.


 3:
Do it in monthly-updates; appending indexes for new months to a file.

 Pros:
- Seems like the smallest change (tens of lines?).

 Cons:
- Will only handle the problem for mailing lists: future additions of
noindex,follow tags in other places will fail silently (as it does now).
- Unexpected dependency between the ht://Dig configury and the mailing
list archive management.
- Will have to add ht://Dig excludes for otherwise non-indexed pages
like "overseers" anyway, as with #2.


I'll pick #2 for now: I don't like #3 and I think #1 will take more time
than I have right now.

If you have another opinion, please scream within 48 hours (as I'll be
gone for a week after that) or revert the patches I'll copy here when I'm
done.

brgds, H-P

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Source(s|ware) ht://Dig indexing does not index ml:s correctly
@ 2000-12-30  6:08 Hans-Peter Nilsson
  2000-07-11  7:23 ` Hans-Peter Nilsson
  2000-12-30  6:08 ` Hans-Peter Nilsson
  0 siblings, 2 replies; 6+ messages in thread
From: Hans-Peter Nilsson @ 2000-12-30  6:08 UTC (permalink / raw)
  To: overseers

Searching for a gem
hidden under stone on stone
gets buried in pile

Many projects (ehrm, at least newlib) do not "get hits" for the posts
after a few new ones on the latest (month, year, quarter).

This is a result of having pages marked meta noindex,follow and only
pointing to the site URL when updating.  The update will only process
pages than have changed.  If such a non-changed page points to a page
marked with meta noindex,follow, (like the mailing list index for a
time-period), new messages will not be indexed (or only be indexed if they
are pointed to from an updated page elsewhere).

The obvious hack is to remove the noindex,follow mark everywhere,
but a better solution is to add a list of such (topmost) noindex,follow
urls to start_url.  The trick (if there is one) is to form such a list
without assuming anything static, like what the current mailing lists are.
Or at least to do it with *enough* room for things to work seamlessly
without lots of fiddling when things change or projects are added.

Suggestions welcome.

The same problem exists with gcc.gnu.org, and someone complained on
missing expected hits.  I haven't answered him yet, maybe because I'm
arrogant or something, or thought I should just fix it before.  :-P

Patch for visualization purposes only.

Index: sourceware.conf
===================================================================
RCS file: /cvs/sourceware/infra/htdig-conf/sourceware.conf,v
retrieving revision 1.8
diff -p -c -r1.8 sourceware.conf
*** sourceware.conf	2000/07/11 10:23:58	1.8
--- sourceware.conf	2000/07/11 13:51:23
*************** database_dir:		/sourceware/htdig/sourcew
*** 19,25 ****
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.
--- 19,25 ----
  # You could also index all the URLs in a file like so:
  # start_url:	       `${common_dir}/start.url`
  #
! start_url:		http://sources.redhat.com/ `/path/to/list_of_noindex_follow_urls`
  
  # The old hostname (left side) is here changed to the canonical hostname
  # (right side), to avoid a loop of redirects.

Sorry for not fixing it yet, but I thought I should at least report the
current badness.

brgds, H-P

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly
  2000-12-30  6:08 ` Hans-Peter Nilsson
  2000-07-25 11:22   ` Hans-Peter Nilsson
@ 2000-12-30  6:08   ` Gerald Pfeifer
  2000-07-25 13:32     ` Gerald Pfeifer
  1 sibling, 1 reply; 6+ messages in thread
From: Gerald Pfeifer @ 2000-12-30  6:08 UTC (permalink / raw)
  To: Hans-Peter Nilsson; +Cc: overseers

On Tue, 25 Jul 2000, Hans-Peter Nilsson wrote:
> I'll pick #2 for now: I don't like #3 and I think #1 will take more time
> than I have right now.

I think that #1 with those or similiar changes getting included into
official htDig sources would be the best solution, but probably a bit
involved, so #2 seems like the way to go now.

But, of course, you are the one doing all of the work, so it's up to
you to decide anyway! :-)

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/
Have a look at http://petition.eurolinux.org -- it's not about Linux, BTW!

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2000-12-30  6:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-30  6:08 Source(s|ware) ht://Dig indexing does not index ml:s correctly Hans-Peter Nilsson
2000-07-11  7:23 ` Hans-Peter Nilsson
2000-12-30  6:08 ` Hans-Peter Nilsson
2000-07-25 11:22   ` Hans-Peter Nilsson
2000-12-30  6:08   ` Gerald Pfeifer
2000-07-25 13:32     ` Gerald Pfeifer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).