public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* a distressing trend
@ 2002-12-03 19:33 Christopher Faylor
  2002-12-04  0:49 ` Jason Molenda
  2002-12-04  4:15 ` Frank Ch. Eigler
  0 siblings, 2 replies; 4+ messages in thread
From: Christopher Faylor @ 2002-12-03 19:33 UTC (permalink / raw)
  To: overseers

I don't know if this is a new trend or not but I'm noticing more and
more sites connecting to the system which are just slurping down the
mailing list archives.  I suspect that they are doing this to mine email
addresses since most of the time the site doing the slurping is on the
other end of a cable modem or dsl connection (inferred from the host
name, if I can derive it) rather than from something like deja or
altavista or something similar.

So, I also suspect that our simple email munging is probably not
offering much of a barrier to this new breed of evil spammer.

I've been blocking any site I catch doing this and so far haven't had a
single complaint.  That is rather odd if someone is doing this purposely
for legitimate purposes.

But as much as I enjoy thwarting spammers, I am not scaling too well.  I
have to remember to continually scan the www port connections looking
for the evil ones and sometime my day job gets in the way.

I'm thinking that we have to start munging the email even more which
really irritates me since it is nice to be able to derive real email
addresses if required.

Does anyone have any idea on how to stop this new escalation in the
war with the spammers?  Maybe we could translate the email address
into pig latin or something...

Or is there an apache plugin out there which would be helpful in
stopping this kind of abuse?

cgf

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: a distressing trend
  2002-12-03 19:33 a distressing trend Christopher Faylor
@ 2002-12-04  0:49 ` Jason Molenda
  2002-12-04  7:00   ` Christopher Faylor
  2002-12-04  4:15 ` Frank Ch. Eigler
  1 sibling, 1 reply; 4+ messages in thread
From: Jason Molenda @ 2002-12-04  0:49 UTC (permalink / raw)
  To: overseers

On Tue, Dec 03, 2002 at 10:34:21PM -0500, Christopher Faylor wrote:
> I don't know if this is a new trend or not but I'm noticing more and
> more sites connecting to the system which are just slurping down the
> mailing list archives.  


Oy, lame.  On the up side, I use "-swarelist" on all my mails to
sources.redhat.com lists and "-gcclist" on all gcc.gnu.org lists
and I haven't gotten spam to either addr in quite some time.  I
don't post a lot, I'll admit, but if they're being harvested and
sold, I haven't seen the end results of it.

I set up a bit of stuff on the web server long ago -

RewriteEngine  on
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon       [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf         [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro      [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT     [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent          [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker      [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit   [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.*  [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO         [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft          [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster     [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL     [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.*$ /badspammer.html  [L]

These agent names are probably outdated and not catching the latest
robots -- if you happen to see these coming, we could look at the
"combined_log" which includes the User-Agent and see if there's a
harvester stupid enough to announce itself.

Oh, and these redirects are only done on sources.redhat.com.  It seemed
a little controversial (the potential existed to shunt real users off
to the badspammer.html page).  I looked around with Google a bit and
saw a more extensive list at 
	http://www.internet-tips.net/Webmaster/htaccess_rewrite.htm
But grepping through this week's sourceware-combined_log for half of the
agents listed on the above page, I turned up zero matches..


This site has a page with ideas for obscuring addresses:
	http://www.turnstep.com/Spambot/html.html

We could add some of these; they seem pretty clever.  

> I'm thinking that we have to start munging the email even more which
> really irritates me since it is nice to be able to derive real email
> addresses if required.


Yeah, the only ironclad approaches I've seen are (1) drop all e-mail
addresses and just use the real usernames or the loginname sans
hostname, or (2) require people to have a login account and click
through a cgi link to send mail to anyone else.  As a user, I hate
both of these a lot.


> Or is there an apache plugin out there which would be helpful in
> stopping this kind of abuse?

apache 2 is a big bag of new magical stuff, there might be some
kind of clever way to limit access like that...  


FWIW I think we're better off trying to catch known harvester robots
when possible and trying new ways of obscuring e-mail addresses.
I'd prefer we don't move to eliding e-mail addrs from web archives
altogether, and I don't know if we can come up with some kind of
rate limiter that would keep the harvesters out without hosing real
users.

(god help us if the harvesters ever figure out ftp and bzip2... :-)

J

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: a distressing trend
  2002-12-03 19:33 a distressing trend Christopher Faylor
  2002-12-04  0:49 ` Jason Molenda
@ 2002-12-04  4:15 ` Frank Ch. Eigler
  1 sibling, 0 replies; 4+ messages in thread
From: Frank Ch. Eigler @ 2002-12-04  4:15 UTC (permalink / raw)
  To: overseers

[-- Attachment #1: Type: text/plain, Size: 771 bytes --]

Hi -

cgf wrote:

> I don't know if this is a new trend or not but I'm noticing more and
> more sites connecting to the system which are just slurping down the
> mailing list archives.  [...]
> Or is there an apache plugin out there which would be helpful in
> stopping this kind of abuse?

To the extent that the problem is that this activity results in
unnecessary load on the sourceware machines, we could definitely
consider apache modules that throttle naughty users.  At home, I
use mod_throttle to limit average throughput to certain local
archive-like URL locations.  These modules can be tweaked to
limit per-IP-address throughput too, though configuration
combinations are often limited.

Doing much more to hide email addresses is IMO a losing battle.

- FChE

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: a distressing trend
  2002-12-04  0:49 ` Jason Molenda
@ 2002-12-04  7:00   ` Christopher Faylor
  0 siblings, 0 replies; 4+ messages in thread
From: Christopher Faylor @ 2002-12-04  7:00 UTC (permalink / raw)
  To: Jason Molenda; +Cc: overseers

On Wed, Dec 04, 2002 at 12:49:57AM -0800, Jason Molenda wrote:
>On Tue, Dec 03, 2002 at 10:34:21PM -0500, Christopher Faylor wrote:
>> I don't know if this is a new trend or not but I'm noticing more and
>> more sites connecting to the system which are just slurping down the
>> mailing list archives.  
>
>
>Oy, lame.  On the up side, I use "-swarelist" on all my mails to
>sources.redhat.com lists and "-gcclist" on all gcc.gnu.org lists
>and I haven't gotten spam to either addr in quite some time.  I
>don't post a lot, I'll admit, but if they're being harvested and
>sold, I haven't seen the end results of it.
>
>I set up a bit of stuff on the web server long ago -
>
>RewriteEngine  on
>RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon       [OR]
>RewriteCond %{HTTP_USER_AGENT} ^EmailWolf         [OR]
>RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro      [OR]
>RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT     [OR]
>RewriteCond %{HTTP_USER_AGENT} ^Crescent          [OR]
>RewriteCond %{HTTP_USER_AGENT} ^CherryPicker      [OR]
>RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit   [OR]
>RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.*  [OR]
>RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO         [OR]
>RewriteCond %{HTTP_USER_AGENT} ^Telesoft          [OR]
>RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster     [OR]
>RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL     [OR]
>RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
>RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
>RewriteRule ^.*$ /badspammer.html  [L]
>
>These agent names are probably outdated and not catching the latest
>robots -- if you happen to see these coming, we could look at the
>"combined_log" which includes the User-Agent and see if there's a
>harvester stupid enough to announce itself.

I've been adding bad agents that I've noticed to robots.txt but there are
at least three of those.  Changing httpd.conf is a nicer solution.  I'll
do that from now on.

>	http://www.internet-tips.net/Webmaster/htaccess_rewrite.htm
>But grepping through this week's sourceware-combined_log for half of the
>agents listed on the above page, I turned up zero matches..

Hmm.  Is there some other reason why someone would be slurping down all
of the archives?  I can't think of any.  Maybe I should ask in some selected
mailing lists.

>This site has a page with ideas for obscuring addresses:
>	http://www.turnstep.com/Spambot/html.html
>
>We could add some of these; they seem pretty clever.  

I was thinking about using the gif method.  That would rule lynx users
out though.

>> I'm thinking that we have to start munging the email even more which
>> really irritates me since it is nice to be able to derive real email
>> addresses if required.
>
>
>Yeah, the only ironclad approaches I've seen are (1) drop all e-mail
>addresses and just use the real usernames or the loginname sans
>hostname, or (2) require people to have a login account and click
>through a cgi link to send mail to anyone else.  As a user, I hate
>both of these a lot.

Me too.  I was perusing the mailing list archives in another project and
came across an article from someone who wanted a private response but
his email address was munged to something like: bob@em...  So, I
couldn't easily find out who he was.  And the spammers had inconvenienced
me YA.  I wouldn't want sourceware to suffer a similar fate.

>> Or is there an apache plugin out there which would be helpful in
>> stopping this kind of abuse?
>
>apache 2 is a big bag of new magical stuff, there might be some
>kind of clever way to limit access like that...  
>
>FWIW I think we're better off trying to catch known harvester robots
>when possible and trying new ways of obscuring e-mail addresses.
>I'd prefer we don't move to eliding e-mail addrs from web archives
>altogether, and I don't know if we can come up with some kind of
>rate limiter that would keep the harvesters out without hosing real
>users.

I guess we should add some kind of trigger that says "Hey, this site
has 100 connections to httpd" so that we can find this kind of thing
quickly.  It seems like, invariably, whenever I see a site with 20+
connections, they are slurping down the mailing list archives.

>(god help us if the harvesters ever figure out ftp and bzip2... :-)

Yep.  Sigh.

cgf

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-12-04 15:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-03 19:33 a distressing trend Christopher Faylor
2002-12-04  0:49 ` Jason Molenda
2002-12-04  7:00   ` Christopher Faylor
2002-12-04  4:15 ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).