WikiHome RecentChanges WikiNode Preferences chongqed.org

WikiSpam

WikiSpam

Spamming is unwanted, antisocial, disruptive, and rude behavior. Webpage owners don't put up wikis, blogs, and guestbooks for the purpose of giving spammers a place to put links. Wikis are open to anyone to edit, but if your additions benefits you (and your PageRank) more than it benefits the wiki community it is most likely unwanted and spam.

Discussion of antispam ideas

AutoBan

Described at UnrealWiki. It is a variation of the ShotGunSpam? method.

There are two types of bans imposed by the automatic spam filter:

Temporary Bans are imposed for adding several, but not too many links to a page. The edit is saved, but the user is put in read only mode for ten minutes for each link added. For most spammers even with only a small number of links that will be a long enough ban to deter them.

Permanent Bans are imposed for adding too many links to a page. The changes to the page are discarded, and the user is put in read only mode permanently.

The exact number of links required to trigger either case is not made public so spammers can't easily fine-tune their spamming attempts.

An email containing time, IP address, modified page name and submitted page content is sent to the wiki admins to deal with false positives or setting a Permanent Ban.

On UnrealWiki, so far the Automatic Permanent Ban has caught most of the spammers. The Temporary Ban has had a few false positives and not caught too many spammers since most post a ton of links and are given a Permanent Ban.

We should learn more about this method, it sounds a bit like what Manni and I have discussed with the spammer cookie AntiSpamDan gives to spammers. Currently visitors with the spammer cookie get a nasty redirect.

Banned Content

Content banning is better than IP banning. CommunityWiki has a page talking about the small network of wikis that exchange "banned content" pages containing a list of regular expressions which may not match when saving pages. (Oddmuse wikis, for the moment.) There's a similar solution on the MoinMoin master.

Getting organized is exactly what needs to be done to fight the spammers. Many of these banned content lists are regular expressions in a format that is compatible with several different wiki engines. We now have a blacklist of URLs in the chongqed database that can be used as a banned content list. MoinMoin can use our list and DokuWiki uses it by default. You can see the list of spammers attempts on this wiki our list has blocked automatically on the CaughtSpam page.

CAPTCHA

CAPTCHA requires a visitor to enter numbers based on a distorted image that a machine using OCR will not be able to read. In terms of ending spam this is a very good solution, but its annoying to users. And for visually impaired visitors it is difficult or impossible to read and therefore they cannot edit the wiki. Well designed CAPTCHA systems include audio alternatives for blind users, but few are well designed.

Email spammers find ways to trick users into entering the CAPTCHA info for the spammer to sign up for new free email accounts. More about this and CAPTCHAs in general can be found at Wikipedia under Circumvention.

For info on a non image alternative implementation see Manni's blog about a "logic puzzle" or question based CAPTCHA that overcomes the accessibility problem.

How about a CAPTCHA which is invoked only if the same IP makes more than X number of edits in one day? – BayleShanks?

That seems like a waste since its not going to block that much spam. Most smart spammers don't do massive attacks, they do a few pages in one session. They usually then come back about 12 or 24 hours later to make sure their spam is still there. – Joe

DelayedIndexing

Delayed Indexing involves adding a noindex meta tag to pages for a period of time after an edit. For pages with frequent changes they may rarely not get this noindex tag. This is probably not a good method since Google should remove pages from its index when it finds the noindex meta tag.

Normally it should work fine. It would mostly affect pages that are edited pretty much non stop (which even for Wikipedia isn't common) once every 10 hours (based on Ward's setting). If Google's crawler shows up at one of those times when it gives noindex I think it would remove the page. Its possible but of course less likely that Google could show up at the time noindex is on even with one edit.

Even at worst though this is NOT a terrible idea if you are really desperate. But other than keeping the Google index of your wiki clean so you don't attract more spam, it is not going to deter spammers that do find your wiki since they seem unable to read any warnings.

IP Addresses Blocking

Banning single IP addresses usually does no good either because spammers often use a large range of addresses or even proxies. Banning IP ranges without knowing what you are doing is dangerous. You may accidentally block entire countries from editing (and sometimes viewing) your wiki. If you are going to ban a small IP range be sure you do your research and determine that you are blocking the smallest number of possibly legitimate users you can (hopefully zero). This site about email blacklists gives some more insight to the problem.

Javascript Hashcash

This technique use Ajax to retrieve a Javascript in order to allow the navigator to compute a number which is compared with a reference number computed by the Wiki application.

It's really effective, as no robot today is able to understand Javascript, and the hashcash method is there to prevent future use of Javascript by robot.

Source and implementation on a wiki can be found here : Wikini (french) , idea and code are from : http://wordpress-plugins.feifei.us/hashcash/

NoFollow

Google has started an initiative to devalue comment spam. They got together with MSN, Yahoo and a bunch of blogging companies and come up with a tag which instructs the search engines to ignore this link for PageRank/link popularity calculations. They are using a "rel=nofollow" attribute on href links to do this. See Google's announcement for more details.

The idea is that blog software is supposed to give any link in a comment that attribute while the links in the blog post itself can still be followed and bloggers can thus still influence the page rank of pages or google bomb each other.

This seems like a pretty good solution but more for blogs than wikis. Wikis can use it too, but since the entire wiki is user contributed all links would have a nofollow rather than just the comments as in blogs which is not ideal. The method is pretty controversial among blog and wiki users because it does not giving PageRank to legitimate links.

The problem with this method and many others is that many spammers don't know or care that their spam is useless, evidence of that can be seen on our wiki.

This is similar to using Robots NoFollow but this is per link rather than per page or site.

From Wikipedia: As of March 6, 2005, use of rel="nofollow" is suspended for the time being on en.wikipedia.org. The vote is far from a consensus and discussion shows strong opposing positions with very wide differences in priorities which have not yet been resolved. More advanced heuristic use of rel="nofollow" is likely to come, when someone has the time to put in the effort to make it work.

Many people argue that nofollow has failed since spammers don't stop hitting sites that use it. As stated above, its purpose is to devalue spam links.

Robots NoFollow

On many wikis, spam is still visible on old page revisions and diffs. This means that spammers still benefit even if their spam is removed right away. The more revisions that end up in Google the better for them.

You can prevent search engines from indexing these pages by applying a patch like RobotsNoFollow. (The chongqed.org wiki uses a similar solution.) Or you can add a "robots.txt" file at the root of your server. See usemod.com for a sample file. Or here for more info. Don't exclude your complete wiki, but the old revisions should be off-limits.

This will help reduce the reward and hopefully motivation for spamming. It also makes your wiki less visible to spammers. Many search for new targets by looking for existing spam; if they find it then they know that wiki is a good target.

ShotGunSpam

When most spammers hit a wiki they stick to a small number of pages. Then they really load those up with their links "akin to the loud burst of a shotgun and the wide visible area of damage." The proposed solution to this is limit the number of URLs any editor can add to a page at one time. This method can affect some users though since large edits or refactoring could be blocked.

This method comes from MeatballWiki where they have a preliminary patch for UseMod that implements a simple version of ShotGunSpam? detection.

SocialSpambat

Spambat's name comes from Spam Combat. Its aim is to let the spammer know that there is no point in spamming the page because of nofollow or link redirects. Whenever adding an external link, before the edit can be saved a user must click through the Spambat info page. This makes sense because spammers are usually too dumb to realize their spam on that wiki is useless otherwise. But it still requires the spammer to read the notice. So how effective it would be is questionable.

Surge Protection

Also known as Edit Throttle. This idea is to measure the speed of edits taking place by the same IP address. When making multiple edits, most spammers will make them very fast, likely with no preview. A regular editor isn't going to edit and save in just a couple seconds which spammers usually do (whether they are using automated tools or not). This method could be used in combination with most all other antispam methods with little effect on regular users to improve effectiveness.

TarPit

The tarpit is a second wiki behind the real thing where only spammers will end up. The real wiki will thus not be spammed and since search engines won't get to see the tarpit either, all the spammers do is waste time. With this system, however, they waste their own time and not ours.

Read more on the TarPit page.

ShieldsUp

Allow trusted users (editors) to hit a panic button to lock the wiki from edits by normal users. See UseMod for description. Similar methods already exists in some way on several wiki engines.

OpenProxy SelfBan

Trick open proxies into following a self ban link.

http://www.usemod.com/cgi-bin/mb.pl?OpenProxy

Bad Behavior

Bad Behavior is a WordPress? plugin that denys spammers based on analyzing the HTTP requests and comparing them to profiles from known spambots. It targets any badly designed software directed at a website. It provides a generic interface can be integrated into virtually any PHP-based software meaning it should work for several kinds of wikis.

FAQ

Why not require logins and passwords?

Many current spambots are able to create accounts automatically so requiring logins is a temporary solution at best. Actually, it is likely that by requiring logins you end up with spam in your wiki AND lots of spam accounts in your user database (if you allow open user registration). Also, most wikis are meant to be open to anyone to contribute. By requiring logins and passwords, even if simple to get, you rule out a lot of quick edits by new or infrequent users. Many people don't want another registration to have to remember just to make a small edit on someone else's wiki and they will just move on.

Why not block all URLs?

Without outside links that would be a pretty pathetic wiki. Wikis are designed to share information. Limiting a wiki in such a major way would make it almost useless except for very limited purposes.

What about AntiSpamBots?

I'm writing a bot that will apply a list of regular expressions to a remote wiki. This will allow us to apply content banning to remote wikis without the need to upgrade the wiki software with a content ban feature. Of course, it also points up the security hazards of having wikis which don't defend against bots. – BayleShanks?

Manni and I aren't too excited about automated cleaning methods. If not really careful you could do as much damage as the spammers. There has been one other guy who set up something similar. We suggested he only do it to wikis that he was a regular user, owner, or with permission of the wiki owner. And to let other people know what happened have the bot login and create a user page that explains that you are automatically cleaning spam.

Manni has recently experimented with his own AntiSpamBot for use on mostly abandoned wikis that suffer heavy spam attacks and has found "it's pretty difficult to implement something that works on all wikis, even if they are using the same engine." Be careful not to make a worse mess than it was. If you start reverting legitimate edits people are going to get upset. – Joe

What about Laws?

Someone asked: What about criminal or civil prosecution of spammers, including investigations and data logging of their ip addresses, the sites they spam for, and tracking down those sites owners?

The problem with that is web spam itself specifically is not illegal in any country. Interpreting existing laws for vandalism, denial/theft of service, etc. it is obvious these activities should be illegal. Some spammers are illegally using hacked computer networks (zombies) to do their spamming, but that makes them difficult to track down. Look how long it took for laws against email spam, and that is a problem for most all users of email, especially large corporations. We already had laws against junk faxes in the US that could have been interpreted as making spam illegal. But now that we have real laws against spam we have even less protection against it, long ago I read that the CAN-SPAM act had very low compliance and hardly reduced the amount of email spam at all. Even if antispam laws were useful, web spam is not likely to be taken seriously by law makers in the near future. Blogs, wikis, guestbooks, etc. do not have major backing that is needed to get the laws we need passed. And of course another big problem (just like email spam) is a large portion of spam comes from outside the US where our laws aren't going to stop anything. And last, most government agencies think they have more important things to waste resources on such as rapes, murders, assault, theft, drugs, terrorists, kidnapping, jaywalking, missing kittens, etc. – Joe - 2005-05-10 05:31 UTC


WikiSpam discussions on other wikis