Many wikis, forums, and blogs allow limited html tags. Depending on what is available some are vulnerable to hidden spam with inline CSS.
MediaWiki is the most popular software that we have identified this problem in. They do block many tags, but not font tags or the style attribute, both of which could be used to hide spam. SpamHuntress first discovered this technique for spamming MediaWiki in June 2005. At chongqed.org we decided not to publicize this method at that time but now it is becoming more common.
Halz discussed this problem with Brion, a MediaWiki developer. Brion's comments were that the hidden spam is clearly visible in diffs and because of nofollow doesn't benefit the spammers anyway. And they don't want to remove features that users may have been using in their existing wikis.
Since MediaWiki is not interested in fixing it we are going public. There are plenty of ways to accomplish this anyway since spammers can use many CSS features to hide text.
MediaWiki development is focused on making the system work well for large wiki installations (especially Wikipedia) and not so concerned with features mostly for small and inactive wikis. The MediaWiki readme admits this, "MediaWiki is primarily targeted as an in-house tool." It makes sense to work on things that will help Wikipedia, but MediaWiki is very popular for new small wikis and those users have needs too.
Relying on the many users of a wiki to catch and revert spam works great for an active wiki, but most wikis aren't very active and spam can easily get overlooked. The rel="nofollow" attribute is a great feature for trying to fight spam, but it does not solve much. Spammers are too dumb and lazy to check if a page uses the nofollow attribute so they just continue spamming. And even if the spam is not indexed as links they do show up as text in Google. And as we know spam attracts more spam.
For small wikis hidden spam is especially hard to catch. If the spammer is the first user to edit the page, there is no diff to compare and the page looks empty, it is very unlikely to be caught as spam. This is especially bad on default pages.
Dirk of Geeklog says they allow users to post in HTML, but use a filter called kses to filter out any unwanted tags and attributes. It is written in PHP. This is a great solution if your software suffers from this problem.
If you want to protect your MediaWiki installation from CSS hidden spam, you can simply block the 'style' attribute. This is not a major restriction, since many styling can be acheived through other wiki syntax.
A less restrictive solution that covers most ways text can be hidden is to block specific style rules.
Both can be done by setting the $wgSpamRegex (or $wgSpamBlacklist in older versions) in your LocalSettings.php.
The less restrictive method is:
$wgSpamRegex = "/\<.*style.*(display|position|overflow|visibility|height)\s*:.*>/i";
That should be much more effective and still allows you to use div tags and styles if you want. More on this at our honeypot wiki.
See Discussion below for other variations you may want to use depending on your site's needs.
PForret posted a solution for MediaWiki 1.4.7. It blocked any div tag which solved the problem at the time, but style can be used on any tag. But he got us headed in the right direction. - Joe
I tested out that suggestion, on my mediawiki installation, then made a version of the regexp which caught the particular overflow/height attribute combination which we have been seeing everywhere. Seemed to work well, so I wrote some instructios on meta wiki. In fact I wrote that whole section, so feel free to improve it!
I don't know much about regexp, so it took some trial and error to arrive at that. Also I put that $wgSpamBlacklist had been renamed $wgSpamRegex, but that's actually just speculation about what software changes must have happened. Basically I was not the best person to write such instructions, but no MediaWiki developers seemed interested in providing this information.
I notice you've come up with another regexp Joe. Looks better actually. So that says disallow style attributes display, position, overflow, visibility, and height. That's a more restrictive setting which will be more effective if the spammer decided to switch their CSS code around a bit. I guess we should recommend this one to administrators of small wikis out there which are getting hit by this spam, but maybe it would catch some legitimate wiki content in large existing mediawikis. – Halz - 2006-01-04 23:10 UTC
Yep, it is more restrictive but allows more freedom since you can now use divs even with styles as long as they don't have the banned style attributes. I think restricting divs totally is worse since they are more likely to be used legitimatly in existing wikis than those style attributes. It would be even more spam proof if it included font-size and color attributes, but those clearly are usefull. I really don't see a lot of need for formatting and positioning inside wiki text. Still it could cause a conflict with a small number of wikis. In that case it would be easy to remove the attribute they need.
It seems you were right about the change to wgSpamRegex, I never found much about the change, but that has to be it. It is a better name since there is now a spam blacklist extension and all this was is a regex.
I am no regex expert either, but I have messed with them several times before. I love the power they give you. But they require a lot of trial and error till you get used to them.
– Joe - 2006-01-05 05:03 UTC
Some spammer experimented with font-size=2px setting. See: benmetcalfe.com/w/index.php?title=Main_Page&diff=1361&oldid=1305 – Halz - 2006-01-09 16:36 UTC
I said before font-size was not as useful for hiding spam, but I was wrong. If you use a 0 it is hidden. I think this regex will block the variations of extrememly small font sizes, but should be tested by others before I would recommend it. If there is a way to simplify it or improve that would be great. The original was nice and short, this one doesn't fit on the screen.
$wgSpamRegex = "/\<.*style.*((display|position|overflow|visibility|height)\s*:|font-size\s*:\s*(\dp(x|t)|0+(\.[^3-9][\d]*)?\s*(e(m|x)|%|[^\.]))).*>/i";
That will catch any font-size 0-9 px or pt, less than 0.3em or ex, 0%, or just 0.
One of my aims is to design this where MediaWiki will return the entire tag that matches the whole regex rather than just an internal parentheses match. This will make it a tiny bit harder for spammers to work around it. And if they do find a way we will keep updating the rule.
Of course, the simple solution is just to block all font-size attributes. Or maybe just allow those specified in text (xx-small, etc). That would conflict with a wiki that is already using font-size, but at least would give users some control over font-size while preventing hidden spam. This version does that:
$wgSpamRegex = "/\<.*style.*((display|position|overflow|visibility|height)\s*:|font-size\s*:\s*\.?\d+).*>/i";
Then there is always the font tag size attribute but IE 6 and FF 1.0 appear to have a minimum size (when defined by HTML) that can be specified with the font tag and is far larger than invisible. Sizes less than 0.6em are displayed at 0.6em (at the browser's default text size setting). Specified with a % or no unit are the same and also have a minimum of 0.6em. Font size in px or pt units appear to be treated the same as no units. This font minimum size does not apply to size specified by CSS.
– Joe - 2006-01-19 18:08 UTC