From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans-Peter Nilsson To: overseers@sources.redhat.com Subject: Source(s|ware) ht://Dig indexing does not index ml:s correctly Date: Sat, 30 Dec 2000 06:08:00 -0000 Message-id: X-SW-Source: 2000/msg00761.html Searching for a gem hidden under stone on stone gets buried in pile Many projects (ehrm, at least newlib) do not "get hits" for the posts after a few new ones on the latest (month, year, quarter). This is a result of having pages marked meta noindex,follow and only pointing to the site URL when updating. The update will only process pages than have changed. If such a non-changed page points to a page marked with meta noindex,follow, (like the mailing list index for a time-period), new messages will not be indexed (or only be indexed if they are pointed to from an updated page elsewhere). The obvious hack is to remove the noindex,follow mark everywhere, but a better solution is to add a list of such (topmost) noindex,follow urls to start_url. The trick (if there is one) is to form such a list without assuming anything static, like what the current mailing lists are. Or at least to do it with *enough* room for things to work seamlessly without lots of fiddling when things change or projects are added. Suggestions welcome. The same problem exists with gcc.gnu.org, and someone complained on missing expected hits. I haven't answered him yet, maybe because I'm arrogant or something, or thought I should just fix it before. :-P Patch for visualization purposes only. Index: sourceware.conf =================================================================== RCS file: /cvs/sourceware/infra/htdig-conf/sourceware.conf,v retrieving revision 1.8 diff -p -c -r1.8 sourceware.conf *** sourceware.conf 2000/07/11 10:23:58 1.8 --- sourceware.conf 2000/07/11 13:51:23 *************** database_dir: /sourceware/htdig/sourcew *** 19,25 **** # You could also index all the URLs in a file like so: # start_url: `${common_dir}/start.url` # ! start_url: http://sources.redhat.com/ # The old hostname (left side) is here changed to the canonical hostname # (right side), to avoid a loop of redirects. --- 19,25 ---- # You could also index all the URLs in a file like so: # start_url: `${common_dir}/start.url` # ! start_url: http://sources.redhat.com/ `/path/to/list_of_noindex_follow_urls` # The old hostname (left side) is here changed to the canonical hostname # (right side), to avoid a loop of redirects. Sorry for not fixing it yet, but I thought I should at least report the current badness. brgds, H-P From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans-Peter Nilsson To: overseers@sources.redhat.com Subject: Source(s|ware) ht://Dig indexing does not index ml:s correctly Date: Tue, 11 Jul 2000 07:23:00 -0000 Message-ID: X-SW-Source: 2000-q3/msg00052.html Message-ID: <20000711072300.njhB9ctRsfgLt85Ilff7iQ-XM5SsJBTtDe6RkGVsx44@z> Searching for a gem hidden under stone on stone gets buried in pile Many projects (ehrm, at least newlib) do not "get hits" for the posts after a few new ones on the latest (month, year, quarter). This is a result of having pages marked meta noindex,follow and only pointing to the site URL when updating. The update will only process pages than have changed. If such a non-changed page points to a page marked with meta noindex,follow, (like the mailing list index for a time-period), new messages will not be indexed (or only be indexed if they are pointed to from an updated page elsewhere). The obvious hack is to remove the noindex,follow mark everywhere, but a better solution is to add a list of such (topmost) noindex,follow urls to start_url. The trick (if there is one) is to form such a list without assuming anything static, like what the current mailing lists are. Or at least to do it with *enough* room for things to work seamlessly without lots of fiddling when things change or projects are added. Suggestions welcome. The same problem exists with gcc.gnu.org, and someone complained on missing expected hits. I haven't answered him yet, maybe because I'm arrogant or something, or thought I should just fix it before. :-P Patch for visualization purposes only. Index: sourceware.conf =================================================================== RCS file: /cvs/sourceware/infra/htdig-conf/sourceware.conf,v retrieving revision 1.8 diff -p -c -r1.8 sourceware.conf *** sourceware.conf 2000/07/11 10:23:58 1.8 --- sourceware.conf 2000/07/11 13:51:23 *************** database_dir: /sourceware/htdig/sourcew *** 19,25 **** # You could also index all the URLs in a file like so: # start_url: `${common_dir}/start.url` # ! start_url: http://sources.redhat.com/ # The old hostname (left side) is here changed to the canonical hostname # (right side), to avoid a loop of redirects. --- 19,25 ---- # You could also index all the URLs in a file like so: # start_url: `${common_dir}/start.url` # ! start_url: http://sources.redhat.com/ `/path/to/list_of_noindex_follow_urls` # The old hostname (left side) is here changed to the canonical hostname # (right side), to avoid a loop of redirects. Sorry for not fixing it yet, but I thought I should at least report the current badness. brgds, H-P