From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans-Peter Nilsson To: overseers@sources.redhat.com Subject: Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly Date: Sat, 30 Dec 2000 06:08:00 -0000 Message-id: References: X-SW-Source: 2000/msg00847.html On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote: (Talking to myself here; I didn't get any comments or requests for clarification.) > Many projects (ehrm, at least newlib) do not "get hits" for the posts > after a few new ones on the latest (month, year, quarter). > > This is a result of having pages marked meta noindex,follow and only > pointing to the site URL when updating. The update will only process > pages than have changed. If such a non-changed page points to a page > marked with meta noindex,follow, (like the mailing list index for a > time-period), new messages will not be indexed (or only be indexed if they > are pointed to from an updated page elsewhere). I think this is really a misfeature of ht://Dig: when doing the original from-scratch indexing, it should save for updates (not throw away) the URLs that were found, when a meta-tag "noindex,follow" stopped all other processing than adding its links to the indexing. > The obvious hack is to remove the noindex,follow mark everywhere, > but a better solution is to add a list of such (topmost) noindex,follow > urls to start_url. The trick (if there is one) is to form such a list > without assuming anything static, like what the current mailing lists are. > Or at least to do it with *enough* room for things to work seamlessly > without lots of fiddling when things change or projects are added. I see three solutions for handling the lists of pages with noindex,follow. 1: Hack ht://Dig to generate the list by itself. Creating the list in an external file is somewhat simple; using the existing DB to keep track of it will be a bit harder. Pros: - Almost everything will work as it stands configury-wise, with only (say) an extra option at index time, and a change as the start_url patch in my previous message. Cons: - I have to go hack ht://Dig, feels like it will take longer than the other two options. - A similar solution needs to go in future ht://Dig releases, or sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig patches. 2: Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh (for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh constructs. Pros: - Changes are local to the ht://Dig configuration. - Will handle occurrences of noindex,follow generally; mailing lists as well as other places (F-O-M?). - I've already started along this route (right, that's not a good reason). Cons: - Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow. - A "find" will traverse the web-directories every update. 3: Do it in monthly-updates; appending indexes for new months to a file. Pros: - Seems like the smallest change (tens of lines?). Cons: - Will only handle the problem for mailing lists: future additions of noindex,follow tags in other places will fail silently (as it does now). - Unexpected dependency between the ht://Dig configury and the mailing list archive management. - Will have to add ht://Dig excludes for otherwise non-indexed pages like "overseers" anyway, as with #2. I'll pick #2 for now: I don't like #3 and I think #1 will take more time than I have right now. If you have another opinion, please scream within 48 hours (as I'll be gone for a week after that) or revert the patches I'll copy here when I'm done. brgds, H-P From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans-Peter Nilsson To: overseers@sources.redhat.com Subject: Re: Source(s|ware) ht://Dig indexing does not index ml:s correctly Date: Tue, 25 Jul 2000 11:22:00 -0000 Message-ID: References: X-SW-Source: 2000-q3/msg00138.html Message-ID: <20000725112200.Q6lk0ma2gI61WKQA--D4hcSLGRp_XuFfu_EPm5jXrao@z> On Tue, 11 Jul 2000, Hans-Peter Nilsson wrote: (Talking to myself here; I didn't get any comments or requests for clarification.) > Many projects (ehrm, at least newlib) do not "get hits" for the posts > after a few new ones on the latest (month, year, quarter). > > This is a result of having pages marked meta noindex,follow and only > pointing to the site URL when updating. The update will only process > pages than have changed. If such a non-changed page points to a page > marked with meta noindex,follow, (like the mailing list index for a > time-period), new messages will not be indexed (or only be indexed if they > are pointed to from an updated page elsewhere). I think this is really a misfeature of ht://Dig: when doing the original from-scratch indexing, it should save for updates (not throw away) the URLs that were found, when a meta-tag "noindex,follow" stopped all other processing than adding its links to the indexing. > The obvious hack is to remove the noindex,follow mark everywhere, > but a better solution is to add a list of such (topmost) noindex,follow > urls to start_url. The trick (if there is one) is to form such a list > without assuming anything static, like what the current mailing lists are. > Or at least to do it with *enough* room for things to work seamlessly > without lots of fiddling when things change or projects are added. I see three solutions for handling the lists of pages with noindex,follow. 1: Hack ht://Dig to generate the list by itself. Creating the list in an external file is somewhat simple; using the existing DB to keep track of it will be a bit harder. Pros: - Almost everything will work as it stands configury-wise, with only (say) an extra option at index time, and a change as the start_url patch in my previous message. Cons: - I have to go hack ht://Dig, feels like it will take longer than the other two options. - A similar solution needs to go in future ht://Dig releases, or sources.redhat.com/gcc.gnu.org will have to keep track of local ht://Dig patches. 2: Do it in (a script called by) the htupdate-sourceware.sh and htupdate.sh (for gcc) scripts, using configury from ht://Dig and find+grep+sed+sh constructs. Pros: - Changes are local to the ht://Dig configuration. - Will handle occurrences of noindex,follow generally; mailing lists as well as other places (F-O-M?). - I've already started along this route (right, that's not a good reason). Cons: - Some hundred lines in htupdate-sourceware.sh; perhaps hard to follow. - A "find" will traverse the web-directories every update. 3: Do it in monthly-updates; appending indexes for new months to a file. Pros: - Seems like the smallest change (tens of lines?). Cons: - Will only handle the problem for mailing lists: future additions of noindex,follow tags in other places will fail silently (as it does now). - Unexpected dependency between the ht://Dig configury and the mailing list archive management. - Will have to add ht://Dig excludes for otherwise non-indexed pages like "overseers" anyway, as with #2. I'll pick #2 for now: I don't like #3 and I think #1 will take more time than I have right now. If you have another opinion, please scream within 48 hours (as I'll be gone for a week after that) or revert the patches I'll copy here when I'm done. brgds, H-P