Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Users need to feel that they can trust the solution. Two examples from the e-mail domain are blacklists such as the MAPS RBL and whitelists such as Iron}}Port Bonded{{Sender.
IronPort BondedSender. In the case of blacklists, listees complain that the listing is unfair. (See Rule #2.) In the case of Bonded}}Sender BondedSender, some users incorrectly assume that Iron{{Port IronPort is publishing a list of organisations who have paid them for the privilege of spamming.

...

To summarize, I think that a permanent solution to weblog spam needs to:1.

  1. Eliminate the benefits of spamming (boost in pagerank). 2. Not eliminate the benefits of linking in legitimate comments. 3. Require minimal maintenance. 4. Be accountable and trustworthy to the user. 5. Not disrupt or delay legitimate comments.

I suspect that much of SpamAssassin's PMC is on vacation or otherwise occupied right now. Once people are back and I can get some +1s, I think that a good start for this project would be to use the wiki to critically analyse the various anti-blogspam offerings to identify their strengths and weaknesses.

Useful for wikis too.

Developing a solution

...

  1. Perform an in-depth analysis of blog spam motivation and methods. Cost/benefit analysis will be the focus of this document.

2. Categorise and analyse the current and possible anti blog-spam approaches. For each approach, the analysis will include a summary of how the approach deals with the cause and symptoms of the blog spam problem, how it affects the end users (blogger, reader), maintenance required by developers and bloggers, how the approach may be improved and a list of projects implementing each approach. The unique strengths and weaknesses of the individual projects should also be covered.

...

blogspam-subscribe@spamassassin.apache.org

I've been working on ideas for this on my oown at my Detecting Comment Spam, Part 3 post.

Proof of work

http://elliottback.com/wp/archives/2004/11/29/spam-stopgap-extreme/ http://dev.wp-plugins.org/browser/wp-hashcash/trunk/

...

Things along this line will not be effective in the long run because there is a commenting protocol popularized by Six Apart designed specifically for no human involvement, Track{{{}}}Back. This is TrackBack. This is a essential feature to many bloggers.

...

I've seen some interesting talk of centralized/decentralized systems, which operate much like razor or pyzor except the server is freely available and easy to install as an add-on to Word}}Press WordPress. Submissions can come from trusted sources with keys and then a web of trust can be extended out by utilizing XFN metadata that Word{{Press WordPress supports in its blogrolls.

...

Content based filtering for blog spam has its uses. Jay's BL is doing a great job and catches a tremendous amount of spam. I got to the point where a few would get through every day so I wrote the Spam{{{}}}Assassin SpamAssassin plugin and it has been getting better as time goes on. I ran the two alongside each other for a while but eventually turned off BL.

I had originally written the MT Spam{{{}}}Assassin SpamAssassin plugin to send all URL's found in comments to a central database. I was intending to count them there and write new rules similar to what Jay is doing but I thought that it would be too much work and something that could be done if it became popular which it hasn't (wink)

Google was the catalyst for all this. Now that most of the internets search traffic runs through Google everyone wants Page}}Rank PageRank because it means big bucks, and the cheapest way to get Page{{Rank PageRank is through links. Perhaps we need to speak to them to see what they have in mind although I doubt they would share it with us but it would be nice to know where they stand on this issue. I am sure they hate it as much as we do.

Google has suggested ways to mark links to be ignored for Page{{{}}}Rank PageRank-type purposes. All URLs with the rel="nofollow" attribute added tell search engines to ignore the link: see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html for more information. Spamming the wiki/blog is thus pretty pointless.

...

Wiki Markup
In addition, \[http://www.kahunaburger.com/blog/archives/000189.html Kahuna{{{}}}BurgerKahunaBurger.com\] has a Movable{{{}}}Type[MovableType] Plugin which does something similar, checking comments against a running 'spamd'.

Another effort is Text::Spam}}AssassinSpamAssassin/babycart (see http://www.austinimprov.com/~apthorpe/code/babycart/) This is a wrapper around Mail::Spam{{Assassin SpamAssassin with a specially-tuned user_prefs file which returns SPAM, NOTSPAM, or DUNNO. Proof-of-concept code for Word{{{}}}Press WordPress is included.

Ideas

  • Proof-of-work: A legitimate user will take several seconds to minutes

to create each unqiue comment while a comment spammer sends them out as fast as possible. Consider a proof-of-work algorithm executed within the browser (e.g. javascript, java, activex) added to comment submission forms. The weblog software can safely reject all comment submissions that lack valid proof of work. Legitimate users will not be inconvenienced by a short delay as they submit their comment while spammers will not be able to easily submit comments in large volumes. For example, if a typical comment spammer sends 1000000 comments per day and the proof of work requires 2 seconds of compute time then they will need to dedicate 24 machines to proof-of-work computation to maintain their rate of transmission. The cons of this method are that users without advanced browsers or older, slow computers may not be able to post comments.

There is a javascript implementation of Hashcash that can be combined with SpamAssassin's hashcash verification and duplicate detection algorithms to quickly produce a prototype.

Note that with the current number of trojanned or zombie machines, we can safely assume that the attacker has infinite computing power, rendering any proof-of-work defense moot.

  • Collaborative filtering: Iron}}Port IronPort maintains a database of e-mail

server traffic volumes called

...

SenderBase. Mail servers can use

...

SenderBase to find "traffic spikes" and potentially block e-mail from those servers. Something similar could be done for weblogs. As comments come in, weblogs could report the urls in the comments to a central server. If an URL is sent in too rapidly, it can be added to a list of probable spam urls and weblogs can quarantine or delete comments containing that url.

  • DNS-based URI Blocklists: SpamAssassin has had great success using

Jeff Chan's Spam URI Realtime Blocklists. When an e-mail arrives, SpamAssassin extracts the urls contained within and performs a few DNS TXT queries to find whether the url has been reported in spam. These blocklists can be used for weblogs too. Instead of Jay maintaining a central blocklist that people download and install manually, mt-blacklist could use a DNS-based blocklist that is effectively updated in real time. This would significantly cut down on comment spam because weblog owners would not need to actively maintain their blocklists. The submission process could be streamlined so that it doesn't consume so much of any one person's time.

Other Resources

http://codex.wordpress.org/Combat_Comment_Spam

...

PHSDL

Sharing its methodology with BlogSpamAssassin under PHSDL GNU.

Project Framework Constraints

  • Not to stop search engine Spam
  • Stop comments Malware and redirect domains Spam
  • Not to stop off topic comments Spam
  • Develop a universal API anti Spam filter for different Scripting languages
  • API anti Spam Filter must be available for forums, blogs, and book marking services

BlogSpamAssassin Directives

Avoid Problemetic Honeypots

Minimize resource consumption

  • Structure SBL by domain not by url or sub domain
  • Query SBL primary database for domain match
  • Utilize push technology vs. pull technology

Adopt Standarized Filtering Techniques

  • Primary Filter needs to be based on Malware and cloaking redirect Domains
  • Secondary Filter should be based on ParentProject SpamAssassin
  • Tertiary should be user defined white and black list Filters

BlogSpamAssassin Sub Algorithm PHSDL Filter Test

IgorBergerTalk