WikiHome RecentChanges WikiNode Preferences chongqed.org

AntiSpamBot

Anti-Spam Bot

If you design a AntiSpamBot we suggest you only use it on wikis that you are the owner, a regular user, or with permission of the owner. While we encourage you to clean up any wiki you find spammed, be careful not to break it worse than the spam. If you are using an automated cleaning method be sure to review the edited pages to make sure nothing is going wrong. And to let other people know what happened have the bot login and create a user page that explains that you are automatically cleaning spam.

Manni has recently experimented with his own AntiSpamBot for use on mostly abandoned wikis that suffer heavy spam attacks. Without this data in abandoned wikis will be lost as old revisions expire. Lee's wiki has already suffered much data loss and inspired Manni to create his bot to prevent further loss. He has already discovered "it's pretty difficult to implement something that works on all wikis, even if they are using the same engine." And apparently it is hard to tell legitimate edits (ie. my cleaning) from spam. Not a big deal on an abandonded wiki, but if Lee ever comes back he is going to have trouble making his editing stick.

If you implemented a software solution that automagically replaces spammy links with links to chongqed.org, make sure that the users of your wiki know about this. We are not suggesting this method of cleaning, if you wish to leave links to our site its best to only replace wiki spam with a link to us on Sandbox pages. Even if you are the owner of the wiki, we want to keep wikis clean and a link to us only distracts from the purpose of the wiki. Links to us in the middle of your wiki where spammers often would place their garbage may look like spam to others. We don't want to be mistaken for spammers since as you know, we are the spam fighters.

One method we have seen is to replace non whitelisted sites with redirects (through Google, Yahoo, Blogger, etc.) to the site to prevent any PageRank benifit to the spammer. This doesn't clean or restore damage to the wiki though, it only makes the spammer links useless for their purpose. This also affects legitimate users' links that aren't whitelisted which may annoy them.

Another method which is what Manni seems to be doing is to identify which edits are from spammers and revert all their edits to a previous clean version. So that requires a known clean version so will take a good manual cleaning at least once.

The Bot Experiment

Yes, I have come up with a little Perl script that can act as a cleaning bot for wikis. I currently have two versions and both are directly targeted at one specific wiki each. The first one is working on Lee's wiki (at the time was UseMod but is now MediaWiki, it is now being protected by another bot, WikiMinion.) (btw: Lee seems to be a major contributor to the Wikimedia software [1]). Since Lee's wiki seems to be abandonded and is now a major target for every Chinese spammer out there, I based the bot on a whitelist. It will revert every change that was not done by the user called 'Cleaner'. Of course, that's bloody dangerous and will even revert good changes, like the one Joe quoted. I also try to keep an eye on the bot, monitoring its changes. But it seems to work pretty well and from what I see in my server logs, it seems that some of those spammers have even taken notice.

The second bot is a modification of that first one. I created this second one when I was confronted with the recent changes on the KnowHow-Wiki. Those were too much changes to do a manual clean up. But since I didn't know how to find the good revisions, I settled on an IP-backlist this time. This is nearly as dangerous as the whitelist thing because you can easily revert to a spammed revision if the author of that spammed revision is not yet on your blacklist.

Halz pointed me towards the BannedContentBot. I haven't had a look at the code yet, but from what I've read, this bot is based on a module that simplifies accessing OddMuse and UseMod wikis. Now, that is really cool and I plan to have a look at that module today.

Manni

Bot Improvement Ideas

I still don't really like the idea of an AntiSpamBot, but I admit for many sites its just impossible to handle manually. So here are some more of my ideas to build a more effective and safer bot. Since Manni is dealing with abandoned wikis its not totally necessary to have a perfect system, but I don't think it would be useable on an active wiki.

Another thing you could do is look at the diff between pages. If the addition is mostly URLs then revert it, otherwise flag it (on your end) for human review. Spammers will figure this out and start posting random junk like we see in mail (word salad) more than we already do, but spam and antispam methods will always evolve just like everything else.

Also if the previous version of the page contained significantly more (non link) content than the previous page then its likely the page has been overwritten by a spammer. Otherwise its by a legitimate user who is likely a regular to the wiki and will be manually or auto white listed.

Auto white listing and blacklisting could be done by measuring the amount of new content posting vs new URLs for users who login. Posting too much content in a short period (even non URL content) should be a sign of spaming. Spammers will hit many pages at once, but a normal user will only edit a few pages at a time.

This goes back to the edit throttle idea, but here its done remotely. The Cleaner Bot could calculate who is editing too fast and know its a spammer. The auto lists would be based on IP addresses but only used for a short period. Maybe 12 to 48 hours. Most of the worst spammers return between 6 to 24 hours to respam in case the wiki has been cleaned. Whenever they respam they are adding to their auto blacklist timeout. But spammers often use different IPs in their return attacks. Not sure how to get around that problem currently. But it shouldn't matter much since you are not preventing spam, you are reverting existing spam. Spam protection built into the wiki would have to decide from the first few edits that it's being spammed and prevent further edits. From the remote cleaner point of view you only need to identify existing spam which makes it much easier because you can look at all the evidence.

And finally, I must mention that cleaner bots probably should not be open sourced or even the executable shared with lots of people. Keeping it private limits the people who use automated cleaners to those who have studied what they are doing with the wikis well enough to program it. Hopefully since they know what is going on well they are also careful in using it.

Joe

Hello all,

I'm the author of the spamclean.py bot that Manni mentioned above (see InterWikiSoftware:SpamClean for more info; also, currently there is also some discussion at CommunityWiki:BannedContentBot).

Spamclean.py is like MT-blacklist; it relys upon a content blacklist, rather than an IP blacklist. That is, certain things (main URLs which have been seen in past spam) are illegal to post. The blacklist is downloaded anew each run from an OddMuse wiki of your choice (this way you can take advantage of the most recent spam regex entries to the master list).

I wrote spamclean.py as a matter of personal convenience, and also as a demo for WikiGateway?; if anyone else is interested in maintaining and improving it, I'd be happy to add you to the SourceForge? project. The source code is a little messy right now, sorry! I'll clean it up & comment it if anyone asks me to.

As Manni mentioned, spamclean.py uses a library module, WikiGateway?, which aims to provide a unified API for automated interaction with all sorts of wiki engines (not only UseMod, OddMuse, and MoinMoin; those are just the first ones I've gotten around to doing). The potential applications of such a unified API go much further than spambots; see InterWikiSoftware:WikiGatewayMotivation for other things I'm planning to do with it eventually.

However, this library could also be used for creating spam, or for other malicious attacks. Months ago, I did the opposite of what Joe recommended, and open sourced the code. There's a discussion about the pros and cons of doing this at InterWikiSoftware:WikiGatewayGeneralDiscussion. In short, I feel that the spammers will write their own bots eventually, and helping them do it only makes it happen sooner (rather than determining whether it happens at all). Relying on spammers not writing such a library themselves is sort of like "security by obscurity"; it relies not upon any real security, but merely upon the lack of motivation of the attacker. Wikis will need to create better security mechanisms eventually, and all this can do is hasten that day.

So, if spammers will spam either way, we may as well take advantage of the other neat things that we can do with a tool like this (one of which is stopping spam). Anyway, see InterWikiSoftware:WikiGatewayGeneralDiscussion for more discussion on this point.

– BayleShanks?

I have no doubt that eventually spammers will create their own cross wiki libraries (if they haven't already), but its going to be only worked on a small scale. There is some cooperation and coordination between spammers, but as open source they benifit from potentially many programmers. Now that its open source there isn't anything that can stop it but I wouldn't make the cleaner bot open source too. That could easily be used for a wiki denial of service. – Joe

I dunno, scripting a denial of service is trivial already with just the underlying library; I don't think that having the spamclean.py code makes it any easier.

In general, though, you may well be right. To be honest, I'm kind of uncertain if I'm correct in releasing these things. As I said above, I've thought about it and I think the benefits outweigh the dangers, but I hope I'm not wrong.

As I said on InterWikiSoftware:WikiGatewayGeneralDiscussion, though, if all goes awry it'll be easy for anyone to block WikiGateway? from their site; WikiGateway? hardcodes the names of the HTML edit form elements for each WikiEngine?, so changing the layout and names of the edit forms would render it inoperable, at least unless the user spends the time to update the WikiGateway? code.

– BayleShanks?

There is no way to say its wrong. Any technology can be used to cause problems. If email was never invinted we wouldn't have spam, but email is too useful to say it shouldn't exist. I just am not too excited about any kind of wiki bot (even Manni's). Not only would releasing a cleaning bot potentially help spammers or give vandals something to play with, it gives the ability to auto clean wikis to everyone whether its their wiki or not or weather their cleaning is wanted. It would be nice to kill off spam, but the potential problems a bot can cause worry me. And in the hands of inexperienced or overeager users it really worries me. – Joe

I'm having a hugh problem with wiki spam. It's a rather bizzare circumstance. The corner-carvers Wiki is an old version of PhpWiki and the admin is out of contact so upgrading to a new version is not an option. I have proposed a remote cleaner with off-site back-up since it's a small wiki and the load won't be too high. I intend a rule based approach, keywords + no. of links + etc. The problem is I am not an experienced Perl user. Any pointers to existing modules or procedures that could be incorporated to save time would be appreciated. I'm looking at WikiGateway and XML-RPC but I may have to implement it some other way because there is no interface for PhpWiki. (yet) I mirrored the problem Wiki with UseMod. Here is my page on WikiSpamFighting. Any suggestions or comments will be welcomed.

Roy.

Hi Roy. I dont think there's any existing anti-spam bots for PhpWiki. So far there seems to be three different bots, all working with UseMod / OddMuse wiki (we should refactor this page to make it clearer). WikiMinion is the most effective, mainly because RichardP is running the bot 24 hours a day. Fantastic service! Would be great if he could extend this to PhpWiki installations too, but obviously there's some development effort involved. He's the man to talk to though I guess. Maybe he'll share the source code with you. – Halz - 2005-07-16 09:59 UTC

Halz & Roy,

Currently WikiMinion supports UseMod, OddMuse, PurpleWiki?, OpenWiki?, MediaWiki and MoinMoin. I've had requests in the past to add support for PhpWiki to WikiMinion, and I'd be more than happy to do so, unfortunately PhpWiki doesn't support all of the features that WikiMinion requires. WikiMinion's robot and spam identification heuristics basically rely on four features from a wiki: the ability to get a list of recent edits, the ability to get for each page a page history consisting of a list of prior revisions of the page going back several days, the ability to reliably identify the IP address or host name from which edits originated, and the ability to retrieve the content of both the current as well as the older revisions of a page. Another feature, the ability to compare two revisions to see the differences, isn't required actually required by WikiMinion - however, if a wiki lacks that feature I am unwilling to have WikiMinion keep the wiki clean because it is too hard for me to occasionally check the edits made by WikiMinion to the wiki to make sure that WikiMinion isn't making mistakes. Most installations of PhpWiki lack one, two or all three of either an adequate page history, a mechanism to reliably identify an origin IP address or host name, or the ability to compare revisions - so I haven't been able to add support for PhpWiki to WikiMinion. – RichardP - 2005-07-16 11:31 UTC

Hi Richard, As I mentioned above I am pretty new to Perl. I'm struggling a bit and it would be very helpful to have your script source as a reference. I can be reached at Wikiperl at protoworksdotcom. I have been in communication with Bayle but you're pretty good at remaining anonymous. ;) (mail sent to Manni) Roy.

Roy, I'm happy to discuss technical details with you. Lets move the discussion to the new PrivateAntiSpamBot page you created. – RichardP - 2005-08-05 00:55 UTC


There's a bot created by TilmannHolst? called 'spare.pl'. See his message on usemod.com. Looks like he's making the code publicly available. – Halz - 2005-08-22 12:47 UTC


Just spotted something maybe Joe would find interesting. pywikipediabot is a wiki bot framework in python, for remote operations on MediaWiki installations.

Hmmm This page is a mess. Needs refactoring and splitting pages on each bot. – Halz - 15th Sept 2005