← Home

Doable strategies against blog spam?

In the Typo admin area there's some "blacklist" option in the admin interface, which claims to compare the users IP address to "local and remote blacklists". The interface explains this as "Good defense against spam bots."

Also you're recommended to *not* "allow non-ajax comments" which means that comments will only accepted when send via Ajax (which most spammers don't seem to be capable of).

Don't know about the blacklist option yet. I think I'll read up the code. But I wonder why there's any need of the non-ajax comments option at all when the IP comparsion really does it's job well.

While I've been thinking about this someone dropped an annoucement for a Rails act_as_classifiable plugin to the RoR Mailinglist.

Hmmm. I wonder if there's been any attemp to accomplish something like this:

  • Manually confirmed Typo blogs build up a "trusted blogs net" featuring a distributed notification service about obvious spam requests. By obvious I mean things like requests to Wordpress or other missing files on a Typo blog or manually moderated comments. When there's any such request on any Typo blog, it will be made visible to all other Typo installations by distributing it somehow (either through a central server or some kind of peer-to-peer approach).
  • Each HTTP request will be analysed (i.e. "learned") as a whole (like Bad Behaviour does it) through a Bayesian filter as "bad/spam" so that each blog acts as a user to the Bayesian knowledge of the Typo blog net as a whole.
  • This knowledge will be published back to the blogs so that each blog can decide on additional measurements - like actually blocking the whole request even for GET with a 412 error ("you're not allowed to read my blog.") when a Request fails a local Baysian test.

This way the following conditions would trigger whole HTTP Requests to be marked/learned as "bad".

  • it comes from an IP address known as a spammer by a remote blacklist
  • it has been moderated manually as spam

HTTP Requests that come from IP addresses that are used by clients who occasionally log in will be learned as "good".

This approach could also be used to greatly enhance a comment moderation queue I think. Depending on the results of the Bayesian analysis the notification mails could include a marker in the subject line indicating that the system requires human feedback for this comment. Clicking on a given URL in the mail would learn/unlearn the comment as good/bad. Comments that have been recognized as "good" could immidiately be published.

Newly "learned" spam sources could be published to services like Akismet, SURBL or DNSBL.

Also there could be an additional interface to allow for complaints when your comment has been treated as "bad" by the system.

Another thought:

I don't know enough about how the majority of spam bots work but I assume that they're using some "fire and forget" strategy - by just issuing the POST request to several known "candidate" URLs.

If so, this would allow for a very basic additional measurement - simply add some kind of "please confirm your comment" page or (in case of Ajax being used) alert box. Someone who just "fires-and-forgets" would fail this handshake.

If not so, this on the other hand would allow to send the spammer to a honeypot which could make him wait half an hour or so to complete the request and additionally logging as much of information as possible. (I've read about 1&1 having been asked to provide their logs to a lawsuit against a spammer but can't remember the details.)