Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Re-organised entire article to clarify problem, past approaches, and objectives

Blog Spam

Blog spam

As of this writing, this document consists of entirely of excerpts from the initial request for comments for a BlogSpamAssassin project. This document needs to be re-organised, re-written and sub-divided.

Blog Spam

Considering the latest press on blog comment spam, it is time that we organize a cross-platform project to address the problem. There are a considerable number of plugins implemented for various blog software with the intent of reducing blog spam but many are ineffective or require a tremendous amount of work to maintain (Jay Allen's mt-blacklist plugin is definitely the latter).

http://news.netcraft.com/archives/2004/12/17/hosts_disable_movable_type_as_comment_spam_slows_servers.html http://it.slashdot.org/article.pl?sid=04/12/18/1827225&tid=111&tid=128 http://www.sixapart.com/log/2004/12/more_on_comment.shtml

Add a definition of blog spam and discuss the cause and symptoms of the problem.

Weblog spam is completely different from e-mail spam. The objective of the e-mail spammer is for you to read their message and respond quickly. The opposite holds true of the weblog spammer. The spammer needs their comments to remain undetected (or at least undeleted) to boost and maintain the pagerank of the site that they are spamming for.

Practically speaking, it doesn't matter if a spammer is able to create a comment on someone's blog. So long as the comment is deleted or otherwise altered before the search engine indexes the page, they have gained nothing for their trouble. Care must be taken that legitimate links are not disabled: Bloggers like their pagerank and this reduces the effectiveness of search engines by prohibiting them from indexing pages that may only be mentioned in weblog comments.

A permanent solution to weblog spam would be one that requires the comment spammer to expend more resources to send the spam than they can gain through spamming while minimising the amount of resources that the weblog owner needs to expend. These resources include time, money and compute cycles.

Users need to feel that they can trust the solution. Two examples from the e-mail domain are blacklists such as the MAPS RBL and whitelists such as IronPort BondedSender. In the case of blacklists, listees complain that the listing is unfair. (See Rule #2.) In the case of BondedSender, some users incorrectly assume that IronPort is publishing a list of organisations who have paid them for the privilege of spamming.

The solution should not interfere with legitimate discussions. Delaying the posting of legitimate comments means that users can not easily engage in discussions.

To summarize, I think that a permanent solution to weblog spam needs to:

  1. Eliminate the benefits of spamming (boost in pagerank). 2. Not eliminate the benefits of linking in legitimate comments. 3. Require minimal maintenance. 4. Be accountable and trustworthy to the user. 5. Not disrupt or delay legitimate comments.

I suspect that much of SpamAssassin's PMC is on vacation or otherwise occupied right now. Once people are back and I can get some +1s, I think that a good start for this project would be to use the wiki to critically analyse the various anti-blogspam offerings to identify their strengths and weaknesses.

Useful for wikis too.

Developing a solution

  1. Perform an in-depth analysis of blog spam motivation and methods. Cost/benefit analysis will be the focus of this document.

2. Categorise and analyse the current and possible anti blog-spam approaches. For each approach, the analysis will include a summary of how the approach deals with the cause and symptoms of the blog spam problem, how it affects the end users (blogger, reader), maintenance required by developers and bloggers, how the approach may be improved and a list of projects implementing each approach. The unique strengths and weaknesses of the individual projects should also be covered.

3. Using the knowledge that we have gathered in (1) and (2), select approaches for further study and improvement. Using volunteers from the weblogging community, run usability studies over the course of two or three months. Compare the effectiveness of the approaches against a control group.

4. Using what we have learned from (3), identify how weblog user interfaces need to be improved to cope with the symptoms of the blog spam problem and to best accomodate the approaches that we have determined to be the best. Author a "best practice" document detailing how blog spam should be dealt with.

5. Develop an Apache-licensed reference implementation of our solutions along the lines of spamd (re-using code where appropriate), client-side code and patches/plugins for blog software with the intent that our solutions will be incorporated into the release versions of major blog software.

SpamAssassin and the ASF's role in this project will to provide a neutral venue for this project and to contribute with expertise in the field of anti-spam. As it has been previousoly mentioned, some of the SpamAssassin developers (myself included) are also bloggers and have first-hand experience with the problem of blog spam.

Related work and ideas

Mailing List

A mailing list has been created to begin work on this project.

blogspam@spamassassin.apache.org

You can subscribe via:

blogspam-subscribe@spamassassin.apache.org

I've been working on ideas for this on my oown at my Detecting Comment Spam, Part 3 post.

Proof of work

http://elliottback.com/wp/archives/2004/11/29/spam-stopgap-extreme/ http://dev.wp-plugins.org/browser/wp-hashcash/trunk/

This is a JS proof of work implementation that has been extremely (100%) effective in blocking non-human spam thus far. This is the only technique of this type that has worked more than about a week, other modifications such as adding random fields, asking questions in the comment form, and changing the URI of the comment post script have been bypassed by the bots within a few days.

Things along this line will not be effective in the long run because there is a commenting protocol popularized by Six Apart designed specifically for no human involvement, TrackBack. This is a essential feature to many bloggers.

http://www.movabletype.org/trackback/

Pingback is more robust and requires a link back, but can still be spoofed:

http://www.hixie.ch/specs/pingback/pingback

The approach we're taking to that is white listing of URIs in the WP-integrated blogroll and moderation of others, we also don't allow any markup within these comments.

WordPress systems

http://wordpress.org/development/2004/12/fight-spam/ http://mookitty.co.uk/devblog/category/kittens-spaminator/ http://www.unknowngenius.com/blog/static/spam-karma

These are the two plugins that combined about a dozen different efforts that were going on. Both have a scoring system very much like SpamAssassin in some ways that uses content characteristics, RBL lookups, user agent characteristics (how long it was on the page before, is it coming through a proxy) and contextual characteristics like the age of the post. Spaminator has a "tar pit" which tries to delay bots when one has been identified by inserting random delays before responses. This seems to have pissed them off enough because now several of the bots check for the Spaminator files before targeting a weblog. Spam Karma is interesting because if your comment is borderline spam (right on the threshold) you can get it through by filling out a image CAPTCHA or responding to an email confirmation, thus it combines CAPTCHA with an accessible alternative.

Collaborative filters

I've seen some interesting talk of centralized/decentralized systems, which operate much like razor or pyzor except the server is freely available and easy to install as an add-on to WordPress. Submissions can come from trusted sources with keys and then a web of trust can be extended out by utilizing XFN metadata that WordPress supports in its blogrolls.

http://gmpg.org/xfn/

This could be very interesting, as it would be hard to target in a central fashion (there can be hundreds/thousands of "servers") and it doesn't require much manual intervention by the person running the plugin, just the person running the server has to be proactive. It could also scale well. However the code for this isn't ready for release yet, it's undergoing a security audit and review.

This type of spam is not limited to blogging systems and can easily be expanded to take into account other collaborative portals (e.g., wiki, forums, etc.). In regards to blogs, the main ports of entry for spam are:

The last of these two do not require any human interaction at all and are more automated processes of communication. While Pingbacks require links back to the system being commented on, they too can be spoofed. While many systems have anti-spam measures on the web interface to prevent automated comment spam (e.g., Captcha, arithmetic or logic questions, obfuscated javascript code, etc.), the main concern of this article is for processing spam that gets beyond the UI.

SpamAssassin Integration

While there are difference between e-mail spam and blog spam, SpamAssassin is a strong candidate as a basis for preventing blog spam. There have already been several attempts to integrate SpamAssassin with a blog (WordPress and Moveable Type):

...

...

Content based filtering for blog spam has its uses. Jay's BL is doing a great job and catches a tremendous amount of spam. I got to the point where a few would get through every day so I wrote the SpamAssassin plugin and it has been getting better as time goes on. I ran the two alongside each other for a while but eventually turned off BL.

I had originally written the MT SpamAssassin plugin to send all URL's found in comments to a central database. I was intending to count them there and write new rules similar to what Jay is doing but I thought that it would be too much work and something that could be done if it became popular which it hasn't (wink)

Google was the catalyst for all this. Now that most of the internets search traffic runs through Google everyone wants PageRank because it means big bucks, and the cheapest way to get PageRank is through links. Perhaps we need to speak to them to see what they have in mind although I doubt they would share it with us but it would be nice to know where they stand on this issue. I am sure they hate it as much as we do.

...

...

...

...

These plugins basically take the content from a blog, tests it with SpamAssassin, and flags it as needing moderation if deemed unsafe.

Miscellaneous Notes

...

http://wordpress.org/support/10/12268

http://www.ioerror.us/2005/01/02/wp-spamassassin/

Wiki Markup
In addition, \[http://www.kahunaburger.com/blog/archives/000189.html KahunaBurger.com\] has a [MovableType] Plugin which does something similar, checking comments against a running 'spamd'.

Another effort is Text::SpamAssassin/babycart (see http://www.austinimprov.com/~apthorpe/code/babycart/) This is a wrapper around Mail::SpamAssassin with a specially-tuned user_prefs file which returns SPAM, NOTSPAM, or DUNNO. Proof-of-concept code for WordPress is included.

...

  • Proof-of-work: A legitimate user will take several seconds to minutes
    to create each unqiue comment while a comment spammer sends them out as fast as possible. Consider a proof-of-work algorithm executed within the browser (e.g. javascript, java, activex) added to comment submission forms. The weblog software can safely reject all comment submissions that lack valid proof of work. Legitimate users will not be inconvenienced by a short delay as they submit their comment while spammers will not be able to easily submit comments in large volumes. For example, if a typical comment spammer sends 1000000 comments per day and the proof of work requires 2 seconds of compute time then they will need to dedicate 24 machines to proof-of-work computation to maintain their rate of transmission. The cons of this method are that users without advanced browsers or older, slow computers may not be able to post comments

...

  • .

Note that with the current number of trojanned or zombie machines, we can safely assume that the attacker has infinite computing power, rendering any proof-of-work defense moot.

  • Collaborative filtering: IronPort maintains a database of e-mail
    server traffic volumes called SenderBase. Mail servers can use SenderBase to find "traffic spikes" and potentially block e-mail from those servers. Something similar could be done for weblogs. As comments come in, weblogs could report the urls in the comments to a central server. If an URL is sent in too rapidly, it can be added to a list of probable spam urls and weblogs can quarantine or delete comments containing that url.
  • DNS-based URI Blocklists: SpamAssassin has had great success using
    Jeff Chan's Spam URI Realtime Blocklists. When an e-mail arrives, SpamAssassin extracts the urls contained within and performs a few DNS TXT queries to find whether the url has been reported in spam. These blocklists can be used for weblogs too. Instead of Jay maintaining a central blocklist that people download and install manually, mt-blacklist could use a DNS-based blocklist that is effectively updated in real time. This would significantly cut down on comment spam because weblog owners would not need to actively maintain their blocklists. The submission process could be streamlined so that it doesn't consume so much of any one person's time.

Other Resources

http://codex.wordpress.org/Combat_Comment_Spam

http://drupal.org/node/14193 - Drupal: Fighting back at Spam

Blog Software

This list is incomplete and in alphabetical order.

PHSDL

Sharing its methodology with BlogSpamAssassin under PHSDL GNU.

...

Mailing List

A mailing list has been created to begin work on this project.

blogspam@spamassassin.apache.org

You can subscribe via:

blogspam-subscribe@spamassassin.apache.orgIgorBergerTalk