Ann posted a link to Bad Behavior a few days ago, but I didn't realize how useful it was. Its a WordPress? plugin that denys spammers based on analyzing the HTTP requests and comparing them to profiles from known spambots. Bad Behavior targets any badly designed software directed at a website. It provides a generic interface can be integrated into virtually any PHP-based software. – Joe - 2005-04-30 11:41 UTC
Found something interesting, but strange, in my server logs. Take a look at PrivateStrangeLogEntries. – Manni - 2005-04-29 09:03
Just found this in our list of referrers. No idea who set it up, but from the html is seems like it was an IE user. – Manni - 2005-04-29 08:15
Ann and I have been studying the GooglePray spammer and have gathered about as much info as we can understand. See my post. – Joe - 2005-04-28 18:15 UTC
From Spam Huntress we learned about Reffy and TheTrafficProject.com. Some more info can be found here, here, and here.
I just chongqed a few of their domains that I was able to find as referrer spam. Its harder to chongq those since they don't have keywords, but they were worth it. This is the kind of site that really deserves a good chongqing: thetrafficproject.com adminshop.com reffy.net the mass referrer marketer reffy link dump generate traffic
Today is the one year anniversary of chongqing. More on my blog. – Joe - 2005-04-28 07:51 UTC
We are on the front page of spamfo.co.uk with this post by Chris Hunter.
And this post by Aunty Spam is interesting, she found some blog spam software being advertised. It tells us a lot of the featues that are available that we should look for ways to block. – Joe - 2005-04-25 20:23 UTC
Got a really odd referrer on my blog. Has anyone seen similar? Any clue what it is? – Joe - 2005-04-25 04:57 UTC
Mysterious. Could be anything hey? Maybe it's a forum with some spammers discussing your site, or some people from Chongqing getting confused by it. – Halz - 2005-04-25 09:04 UTC
Well, since it just happens to show up on a lot of referrer pages where other spam does I suspect its where some group of spammers are discussing what pages to spam and in this case what to go read to learn how to spam better (or laugh at how dumb they think we are). – Joe - 2005-04-25 09:16 UTC
Ann, the most popular PHP-based wiki engine is probably MediaWiki. Of the wiki engines I'm aware of, the one with the most user authentication features is TWiki, however, it is implemented in Perl. TWiki is not as popular as MediaWiki. The presentation of both MediaWiki and TWiki can be altered, but TWiki's templating system and skins are more flexible. Both MediaWiki and TWiki are more complex to install and maintain than other wiki engines. I'm not aware of any Wiki engine that prevents changes from going live until approved by a moderator (what you call 'premoderation'). However, you could implement premoderation by installing two copies of your prefered wiki software. Configure a "live" wiki to only allow edits by moderators and a "beta" wiki that allows edits by anyone (or just registered users). The moderators can then monitor the "beta" and periodically copy approved modifications to the "live" wiki. Neither MediaWiki nor TWiki has what I would consider to be best-of-breed anti-spam features, but both have adequate anti-spam features, especially if you intend to limit edits to registered users. Also consider MoinMoin, it has excellent anti-spam features and a decent security model, however, it is written in Python. – RichardP - 2005-04-22 12:23 UTC
I just found a rather crude premoderation feature in PHPWiki. Hopefully it's not too cranky to use and install. With premoderation, I may not need that many other antispam features? – Ann
I'm not familiar with PhpWiki, however after a quick glance at the documentation it looks like it will do most of what you want. However, it doesn't appear to implement something like the template feature you were interested in. It does look like it has some anti-spam features. – RichardP - 2005-04-22 13:20 UTC
I don't have much experience with PhpWiki either, but that is partly because it is not frequently spammed so on spam hunts I don't see it often. Since you already use WordPress? I guess you have some familiarity with PHP. If you are interested in customizing things its always good to go with a scripting language you know. Of course you should find a wiki engine that has (at least most of) the features you need first, but it looks like you may have done that. Don't forget to check out plugins and modules, sometimes some really good features are not available by default.
Another PHP based wiki is ErfurtWiki but I only know slightly more about that one. It seems highly customizable but does not appear to use templates. I don't know about the features you were looking for.
– Joe - 2005-04-22 15:03 UTC
Unfortunately, I have to use premoderation. My bunch are very likely to try to defame someone else. They do it often enough via e-mail to me, and have on occasion tried doing it in public, so I don't want to tempt them too much. I would have gone for MediaWiki in a second, but they seem adamant about not implementing premoderation features. As for PHP, yeah, I have hacked a few small things. But the same is true of PERL. I just think PHP is kinder on the server. – Ann
Hi Ann,
Yeah you're running a tight ship if you use premoderation. Most wikis are more relaxed about defamation issues (after all, any user can remove the defamation). But if you are set on doing that, I would guess that spam will not be a big issue for you. For starters it will only be seen by your moderators, and they will, of course, decline the submission. And then because of the spam-attracts-spam effect we have observed, you are unlikely to get much. – Halz - 2005-04-22 15:43 UTC
Hmmmm thinking about it, there is a logical reason why you wont find so many wiki engines supporting 'premoderation'. It doesn't work so well. Imagine a page has a sentence with a typo. User A comes along and corrects it, but the correction is not available immediately, only submitted for moderation. Now if user B comes along and decides to move the sentence to a different part of the page, remember they are working with the original text, including the typo. How does the engine reconcile this now? It cannot just present a list of sequential changes to the moderator user. It requires some nasty merging tricks. See what I mean? So maybe the only way to do this really is to have two copies, as Richard suggested. A moderated snapshot, and an editable version. – Halz - 2005-04-22 15:57 UTC
Halz, that is a really good point. I had a quick look at how PhpWiki's moderation handles more than one edit to the same page. The changes stack up, each change consiting of an action along with a diff showing what was changed. The moderator can approve or deny any of the changes and he isn't limited to a strict ordering. However, PhpWiki makes no attempt to reconcile incompatible changes. It is up to the moderator to notice that two or more changes stomp on each other - presumably the moderator must then choose to apply one of the changes and manually make the other changes. – RichardP - 2005-04-22 16:28 UTC
I saw that too. Most of the time I'd do all the moderation, except when I'm travelling. And I'd probably fix some things myself. Looks like it could work for me. I'm just concerned that quite a few have given up on PHPWiki, and some have had trouble installing it. I've heard MediaWiki is a dream, so I'm sad that I can't use it. – Ann
Since you are worried about users defaming each other then setting up a second wiki isn't going to work for you. You want to prevent people from even seeing the defamation in the first place. A second wiki only prevents unauthorized changes to the good copy.
Most users probably aren't going to attempt moderation on a wiki so it is another feature you just aren't likely to find a really good implementation for. You are likely to have conflicts with edits though as Halz and Richard have mentioned, but even with really good conflict management there is always a point where two or more changes to the same section have to be done manaually.
Edit conflicts can be quite annoying even without moderation, we experience them sometimes here when two people end up editing the same page at the same time. Someone then has to go back and merge the two since this wiki doesn't automatically merge them. Depending on how active your wiki is that may not be a problem as long as you keep up with the moderation most of the time. You may want to add another user or two you trust to moderate as well (without full admin permissions if possible). – Joe - 2005-04-22 21:23 UTC
Note for Ann: if you put a -- ~~~~ after your post it will automatically put in a signature with time and date for you. It will use your login if you are logged in or use your IP address otherwise. Many wikis support this or something similar. – Joe - 2005-04-22 21:45 UTC
What a busy day on the forum! I have two announcements:
– Manni - 2005-04-20 14:21
A few questions/comments about the module are on the SpamCatching Module page. – Joe - 2005-04-20 17:31 UTC
I just noticed someone logged in as Texas Holdem about 29 minutes ago, I can only imagine that was a spammer who thought better before actually spamming. Manni, any more info you can pull out of the logs? This isn't the first login without spam that I suspected, but this one is pretty obviously not a normal user login name choice. – Joe - 2005-04-20 08:26 UTC
Hmm. I vaguely remember spam that was saved with this user name, so he might still have the old cookie that remembered that value. It was somebody from Bulgaria who came to the texas-holdem page yesterday (without providing a referrer). Today, he came back and took a fairly long look at random pages, clicking on the 'Edit' link a couple of times. No idea what he was trying to find out or do. Seems he never saved. – Manni - 2005-04-20 10:56
I found this intersting, a "Search Engine Relationship Chart." They have trademarked that phrase and only allow people to use that phrase online it in linking to their site which I find slimy, that is a totally descriptive title. They have a SEO Code of ethics that does say they should follow all spamming laws but doesn't say not to do it, as we know laws are far behind the spammers (spamdexing and webspam are not illegal but should be). And they list what major search engines consider spam. – Joe - 2005-04-20 07:25 UTC
Yeah there's some interesting stuff on that site actually. I think the first paragraph there explains the whole problem of SEO tactics well. They seem to advocating playing within the rules, which is good, but they could be just as evil as any other SEO company behind the scenes. – Halz - 2005-04-20 09:07 UTC
That is a problem with choosing any SEO company. No matter how good their site looks and even if they say they are white hat you can never be sure unless you catch them doing something evil. There is no way to prove you are always a white hat SEO. – Joe - 2005-04-20 09:13 UTC
It seems that SE Chart (who's name cannot be used) is not very up to date. In a bit of research for Links Experiment I found a few relationships that didn't seem to match up. – Joe - 2005-04-21 08:49 UTC
Got a referrer from Yahoo Search for SEO chongqing and discovered posts on my blog are results 1 and 2, chongqed.org is 6 and 10. – Joe - 2005-04-20 01:02 UTC
After chongqing the jerk who spammed my blog I wrote this to blogger support. While I doubt it will do any good I figured it was worth it. I decided it would be fun to list them on the LoserWebmasters page as they should be if they don't do anything.
title: blogger member spamming other blogs A blogger memeber spammed my blog with with many links to http://yomilf.com/, I had limited who could post comments to registered users to prevent abuse and still have the problem. I have now set it to accept comments only from blog members which I don't like doing. The blog post that was spammed: http://chongq.blogspot.com/2004/07/more-on-casino-online-on-line.html The profile of the spammer: http://www.blogger.com/profile/3783873 He also has two blogs that appear to be purely for the purpose of spamming. Months ago I reported similar blogs that were used entirely for spamming and blogger support was not interested in dealing with it. Blogs should not be allowed to be used for spamming.
Hostname: host-116-35.rev.vline.pl
IP Address: 217.76.116.30
UserAgent: "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL; rv:1.7.5) Gecko/20041108 Firefox/1.0"
Referrer: site:blogspot.com +"post a comment" +html (p 14)
time: 18/Apr/2005:17:56:13 -0400 (18 Apr, Mon, 17:12:42)
This appears to be a manual spam job. He loaded images from my blog which wouldn't happen with the usual bot.
From that search I identified a few other victims, I gave up but I am sure there are a lot more:
http://zaydoun.blogspot.com/2004/12/kuwaiti-christmas.html http://minimsft.blogspot.com/2005/03/ex-microsoftie-spotting.html http://harrykstammer.blogspot.com/archives/2004_09_01_harrykstammer_archive.html http://larrylyons2.blogspot.com/2005/03/rashawn-brazell-blog-movement.html http://mousewords.blogspot.com/2005/03/mouse-is-punk-rocker.html http://orangepulp.blogspot.com/2005/03/partly-fruity-with-chance-of-pulp.html http://iraqnow.blogspot.com/2005/03/amber-alert-another-missing-headline.html http://shrillblog.blogspot.com/2004/09/matthew-yglesias-is-one-of-us-now.html http://tigerhawk.blogspot.com/2005/03/american-traitor.html http://montages.blogspot.com/2005/01/voter-turnout-in-iraqi-elections.html http://rigorousintuition.blogspot.com/2005/01/do-you-know-this-woman.html http://xiaxue.blogspot.com/2005/03/fucking-fucking-stupid.html http://iraqpundit.blogspot.com/2004/11/hating-everybody.html http://internet-apps.blogspot.com/2005/01/meta-muddle-at-google.html
Apparently he didn't spam the blogs at the top of the search results. My blog and those others listed were on page 13 and 14. And of course he didn't spam every blog probably because some of them only allowed members to comment, which sadly I have gone back to for now.
– Joe - 2005-04-19 02:22 UTC
He recently attempted to attack another post, but was unable since I limited comments to only members.
Hostname: host-116-23.rev.vline.pl
Referrer: site:blogspot.com +"post a comment" +html (p 17)
time: 19/Apr/2005:04:58:29 -0400 (19 Apr, Tue, 04:14:58)
A few victims from page 17:
http://canadiancomment.blogspot.com/2005/04/sponsorship-inquiry-buzz-ii.html http://lyzzy.blogspot.com/2004_12_01_lyzzy_archive.html http://myurbankvetch.blogspot.com/2004/12/just-not-that-into-this-book.html
– Joe - 2005-04-19 10:55 UTC
I can think of lots of evil wiki tactics a spammer could use to make the spam harder to notice. I guess you guys also ponder such things. Lucky we're not a spammers hey? I feel the need to describe these somewhere e.g. PrivateSmartSpamTactics. It might help us to think about counter-measures before the spammers actually think of these tactics, but really there's not much benefit of describing these tricks, just a bit of fun… and the risk that the information could fall into the wrong hands. Anyway in general it's amazing how rare it is that a spammer does anything remotely smart. There's the link replacement trick, seen recently from top-point.net here:
http://www.employees.org/~alokem/magenta/index.cgi?action=browse&diff=1&id=Contact…but they're really only scratching the surface, when you think how sneaky spammers could be, to avoid being noticed. – Halz - 2005-04-18 16:36 UTC
That is the same guy that hit our wiki this weekend. While not the sneakiest we have seen, I named him SneakyBastard since he was certainly trying to be. How about PrivateWhatIfSpammersWereNotSoDumb, probably way too long. ;-) – Joe - 2005-04-18 17:14 UTC
I ran across an antispam blog with some good info for bloggers. It hasn't been updated since Febuary though. Its run by the owner of netnerds.net which is an active blog so I am sure the antispam blog will get some attention someday. Even inactive it has more info on blog spam than we do since we are mainly wiki people. – Joe - 2005-04-17 09:14 UTC
Sam sent me a note "here's something you might find entertaining.. a SEO wiki: http://www.organicseo.org/"
This section has a quite interesting take on the whole thing, and written as if it wasn't bad mouthing the whole seo wiki idea (almost). Basically it says, the wiki isn't going to have any of the really good stuff because why would SEO pros give it away? Makes sense to me. But there is still useful info there that could be useful for improved chongqing.
– Joe - 2005-04-16 10:42 UTC
Just a few days ago Manni mentioned: "Google is important, but there still are other engines." Yet according to this SEO tool, the other engines don't care much about chongqed.org. – Joe - 2005-04-15 21:14 UTC
Other tools at that site showed we need description and keyword tags for the wiki. – Joe - 2005-04-15 21:23 UTC
A domain name even more entertaining than chongqing.org. Rather than steal the spammer's favorite keyword, this spam fighter stole the domain. Likely to prevent whois lookups, the spammer started spamming for the domain before he even bought it. A victim of the pre purchace referrer spam noticed this and became the proud owner of http://jagk.com/, now an antispam site.
I am getting sick of the phentermine spammer that keeps hitting WikiHome. Each time he uses a different domain so he can't be blocked in the usual way. This guy and the .ru/.su spammer keep using the same subdomains, pagenames, and keywords. I wonder if its time to start a new keyword based blacklist. It shouldn't be in the main one since it could cause too many false positives. But we could have another database of the worst keywords. In the past I haven't been too excited about keyword based BannedContent, but its obvious spammers are finding ways around domain based blacklists. I think a spammy keyword based list is better than adding every free host and redirect service a spammer uses. – Joe - 2005-04-14 23:33 UTC
If you are curious, WikiMinion is reliably identifying edits by the phentermine spammer by virtue of the fact that he is isn't using open proxies: so far he has always inserted his spam from the following IP addresses: 69.50.184.211-221, 69.50.187.83-94, and 69.50.191.195-198. – RichardP - 2005-04-15 01:17 UTC
The general strategy I use in WikiMinion for handling identification of spammer URLs is a domain/path blacklist, a domain/path whitelist, and a URL regular expression exception list. I don't list free hosts, instead I list exact spammer domain/paths and try to catch future spammer edits with "spammer_sub_domain\.($list_of_free_hosts)" regular expressions. I could, in theory, handle everything with regular expressions, but it is much more efficient to apply the simple domain blacklist/whitelist to the URLs before checking the regular expressions - in Perl this gives me nearly a two orders of magnitude performance boost. – RichardP - 2005-04-15 01:17 UTC
That is some interesting stuff. Does WikiMinion do that check for list_of_free_hosts? If it does, how about a PrivateFreeHostsList? It certainly makes sense to be more strict on them since totally blocking them is not good. I always try to add the URL including subdomain when its on a free domain (unless its only used by spammers), but I have never liked that solution because its so easy for spammers to just create another subdomain. – Joe - 2005-04-15 01:39 UTC
Yes, WikiMinion defines a number of perl variables that can be used by the regular expressions in the regular expression based URL blacklist. The four that are relevant to this discussion are a list of free subdomain-based web hosts, a list of free subdomain-based redirection services, a list of free path-based web hosts, and a list of free path-based redirection services. I distinquish between subdomain-based vs. path-based and web host vs. redirection because most spammers seem to have a strong preference for one of these different kinds and rarely use the others. In addition, the subdomain-based vs. path-based lists are used differently in the regular expressions. Thanks for creating the PrivateFreeHostsList page, that is a good idea. I'd be nice to not have to track them all down myself. I'll add the ones that WikiMinion currently knows about to the PrivateFreeHostsList page. – RichardP - 2005-04-15 03:47 UTC
Reformat it however makes it easiest to work with WikiMinion if you need, currently your the one using the data. If we do anything with it for chongqing we can base it on the format you use. I had thought about seperating some of the different types (though not nearly as far as you have) but didn't realize it would be useful. This list would also be helpful for my chongqing utility too. I finally added a command line switch to output subdomains or not, but it's still manual and for the entire input file. This would help in my plan to automate preparing spam for chongqing.
Sounds like WikiMinion is far more complex than I had imagined. – Joe - 2005-04-15 04:29 UTC
Now the spammer is into using dynamic dns hosted servers. SpamHuntress posted something today about this kind of spam. – Joe - 2005-04-15 20:18 UTC
Those IP addresses he has been using point to atrivo.com which is run by Emil Kacperski in California. Apparenlty Atrivo and Emil have been a spam problem for a while. – Joe - 2005-04-15 20:40 UTC
Discussion moved to DelayedIndexing Experiment.
Discussion moved to Links Experiment.
Swiss spam: http://www.emacswiki.org/cgi-bin/emacs/2005-03-16 – MattisManzel
Wow. I almost missed that one. Thanks, Mattis. Seems Alex Schröder found spam for inspiriertwohnen.ch on emacswiki.org. The discussion with the spammer (or the guy responsible for the domain) is pretty fruitless. – Manni - 2005-04-13 08:16
Spam and Broken Windows: http://taint.org/2005/04/11/222338a.html – Joe - 2005-04-11 22:01 UTC
Was just chongqing some of the Wakka spammer most recent URLs and was wondering if plugging in my script into the CaughtSpam details pages would be useful? It could suck the URLs and keywords out and get them ready for chongqing. Most spam Dan catches is because the URLs are already known, but not always. – Joe - 2005-04-11 06:31 UTC
I forgot about one big short comming of the script, it only works right if the URLs are one per line, which we know is not always true. Someday I will get that figured out hopefully. And of course the other one is to do something like your numericalizer or it won't work for Chinese spam. I can stand running it manually until I get those solved. – Joe - 2005-04-11 06:49 UTC
Thanks to Manni's numericalizer code I was able to translate to Python pretty easily (mostly just changing for loop syntax) I can now handle Chinese spam on my own. Next I will be trying to solve the multiline problem which Manni also gave me an idea for. That I don't think will be as easy. – Joe - 2005-04-14 13:57 UTC
I just tried an experiement with the Wakka page that our most persistant spammer keeps hitting. I put a redirect on that page to see if the spammer would follow it to BackToTheFutureLooser?. As I guessed, he didn't fall for it. I wonder about making the page readonly to see what would happen, but then we wouldn't be able to collect URLs from him if he quit spamming us. Any opinions? – Joe - 2005-04-08 21:33 UTC
Joe, can you change the code? Since his guestbook spamming robot didn't follow the redirect, it probably just processes a list of previously compiled web pages that appear to contain edit forms. An interesting experiment might be to globally change the name of the edit action from "edit" to "edit2" in chongqed's code. This would break any software that directly invokes an edit page, and presumably chase the guestbook spammer away until his software finds the new edit form. I'd be curious to see how long that takes. – RichardP - 2005-04-09 00:55 UTC
We will have to wait till Manni reappears for that experiment, I can't access that kind of stuff. It would be very interesting to see the results of that too and how fast he reacts, but if he quits spamming us we couldn't collect his URLs so easily. That's why I didn't make the page read only without asking what everyone else thought. – Joe - 2005-04-09 01:34 UTC
Thought of another thing to try. Will the spammer replace his existing spam? I copied his spam back to the Wakka page and logged in as his last used name. It really was me, don't clean it until we see another visit from him. I am sure it won't be there long. I would really like to know why he searches for the Back to the Future crew. Is he trying to prevent respamming pages he already has spammed? – Joe - 2005-04-09 07:23 UTC
Looks like I was unsuccesful in my attempts. I am not that interested to try again right now posting it as "Steven Wolff", maybe we will learn something anyway. I got caught as a spammer and almost had to see the FF version of the cookie redirect since I guess I misstyped the password that would have allowed me to post spam. – Joe - 2005-04-09 07:42 UTC
The Malaysian spammer that spammed the WikiMinion page twice last night had an alarming referrer string: he searched Google for wikiminion. Richard, I guess you will have to do something about this. If spammers start using the word 'wikiminion' as a spam magnet (which makes sense!), the wikis you are cleaning are in trouble. – Manni - 2005-04-08 08:20
Yeah, this doesn't surprise me all that much. I've been expecting exactly this to occur eventually, although not this soon. As a minor countermeasure a while back I segmented the wikis protected by WikiMinion - on some WikiMinion makes changes under the username "WikiMinion" and on others it uses "RichardP". Hopefully that makes it a tiny bit more difficult to find every wiki protected by WikiMinion. At the other end of the spectrum I could implement a random username for every anti-spam edit, but frankly I think that specific solution is worse than the problem - it would be really really unfriendly to any valid wiki users who are trying to keep tabs on WikiMinion to make sure it isn't making mistakes.
If wiki spammers are indeed going to begin specifically targeting wikis protected by WikiMinion, I think the best response is to take advantage of the fact that this places WikiMinion in a good position to detect new spam domains and new spam hosts earlier than it would otherwise do so. If these wikis are going to be targets because they are protected by WikiMinion, they might as well act as honey pots/spam traps for the rest of the wiki community. Since WikiMinion already automatically discovers spam domains and hosts, the only missing piece is finding a good way for WikiMinion's database updates to be incorporated into public black lists. – RichardP - 2005-04-08 07:37 UTC
Sending the URLs WikiMinion collects would be very useful for early detection of new spam. But of course automated adding to a blacklist isn't safe. If spammers start targeting WikiMinion they could spam a protected wiki for their nonspammer competition to have them blocked. Which wouldn't make much difference since the competition is a nonspammer, but its still not desirable to have a nonspammer in the blacklist. That is why Manni and I do all our additions manually. If anything looks fishy we research it carefully. But WikiMinion could help collect information that humans would confirm and then add to a blacklist. – Joe - 2005-04-08 07:50 UTC
Yes, I should of been clearer. WikiMinion's database updates involve a manual confirmation step. During the course of its normal operation WikiMinion assembles a report that contains, among other facts, a list of previously unidentified hosts and domain names that it suspects are in use by spammers and the location it found them. At least once a day I examine the contents of the report to confirm that suspect hosts and domain names are indeed in use by spammers. Using the contents of the report, information collated from such tools as whois/nslookup/google, and by inspecting the originating edits I mark those hosts/domains that should be changed from suspect to spammer status and authorize a database update.
However, while naturally I think WikiMinion's database is accurate I wholeheartedly agree that since both you and Manni are ultimately responsible for the quality of the chongqed.org blacklist it would not make sense for you to blindly trust changes I've authorized to WikiMinion's database when building your blacklist. – RichardP - 2005-04-08 08:17 UTC
Now that I know more about how WikiMinion works it sounds like you handle it pretty well. With your current setup though I doubt it would be useful to directly input into chongqed's DB since we concentrate on keywords and are not only being a blacklist. Automated collecting of URLs with keywords is not nearly as easy as collecting just URLs. If you could figure out some way of capturing keywords too I think we could set up something. I have a very rough utility that I sometimes use for preparing links where no keywords were used for chongqing. – Joe - 2005-04-08 08:37 UTC
Manni mentioned that he was going to release his OddMuse hacks. What does that include? Does Dan go with it? If he does, it would be nice to have him change the CaughtSpam page directly rather than an external file that is included. I suspect not everyone has the include files option active. And then the history for that page would be more useful. Currently if you hit diff as I often do on other pages to see the recent change it does no good, you just get the last reformatting of the page. For us its fine, but for widespread use editing the actual page would be more desirable. What other stuff is included in your hacks module? – Joe - 2005-04-07 19:47 UTC
Here is a list of the features of that module:
You are right, what I currently do witht the CaughtSpam page isn't the best solution possible. But I find it hard to come up with something different. A regular wiki page that gets auto-edited might be a solution. E.g., it would be easy to not list tests of the functionality there (you could simply delete the link). I'll give this a try. – Manni - 2005-04-08 08:14
The CaughtSpam could be locked (as it has been here) and Dan given an editor account (if possible). It would be nice to preserve more info about the spammers Dan catches. Currently we loose the login they tried to post with. For us at least, UA would be nice to have too. And having an option to hide CaughtSpam edits (as minor) might be nice.
Be sure to set the refresh time on the blacklist for long enough, probably longer than it is here since that is just a local download. If this module becomes popular you may have a lot of traffic. You also might allow it to download other compatable blacklists instead or in addition to ours.
What will the module be called? Manni's AntiSpam? Module, chongqed.org AntiSpam? Module, AntiSpamDan? I am not sure Dan being on lots of wikis is a good idea but the module name could be named after him. We already have spammers logging in as Cleaner, Dan is less likely to be a victim of identity theft, but its possible. Having each wiki choose their own 'bot' name probably would be best.
– Joe - 2005-04-08 07:02 UTC
Manni, will the version of CaughtSpam that you release automatically fetch new versions of the chongqed.org blacklist? That is one of the features of MoinMoin that I admire (although it uses a page at moinmaster as its source). It would be great if there was a module I could recommend to OddMuse administrators that would keep their BannedContent page up to date from a centralized source. If you did offer this feature, it would be great if the admin had a preference variable he could set to specify the URL for the blacklist (that way they could choose to use the emacs wiki blacklist or some other source). – RichardP - 2005-04-08 07:06 UTC
RichardP's edit just overwrote mine with no diff. Did you break the conflict detection again? – Joe - 2005-04-08 07:11 UTC
Who? Me? No!
I haven't decided on a name for the module, yet. I thought about "Spam catching extension" and "Advanced content blocking". I don't think I'll mention Dan, though.
Yes, the module will download the blacklist from chongqed.org. What I still have to do is 'parameterize' lots of things. Like the name of the spam catcher, whether to have CaughtSpam show up as a minor edit or not, where to get the blacklist. I also will make the output include more information. I could list all the posted parameters and some of the http headers.
I planned to have the blacklist downloaded each time it gets older than 24 hours. Bandwidth isn't that big an issue because the blacklist is comparatively small. Currently, the module doesn't use the BannedContent page for the blacklist. I see at least three options here and I don't know which one might work best: 1) I could simply ignore BannedContent. 2) I could use our blacklist and BannedContent so people could influence what is being banned (at least add to what is being banned). 3) I could try to replace the contents of BannedContent with our blacklist each time the blacklist is retrieved.
– Manni - 2005-04-15 07:38
I would just leave the existing BannedContent handling and the blacklist your module uses is totally seperate. That way your not changing existing features. And our list is so big its not really worth putting on a wiki page. As the JoesTempSpamHolder pages have shown there is a limit on page size.
What are the current daily download numbers on the blacklist?
– Joe - 2005-04-15 06:36 UTC
How did the spam holder pages show that there was a limit? Can't remember that. I have no idea how often the blacklist is downloaded. From the logs I cannot tell which domain's or subdomain's root was requested, all I get is '/' which doesn't tell me that much. – Manni - 2005-04-15 08:59
At a certain point, those pages wouldn't allow me to add any more text. I am not right at the limit, I usually try to add 20 to 100 lines at a time, but each of those pages is near the max limit. I don't remember the exact error, it may have been just nothing happens when you try to save an edit. But I have experienced it with most if not all of the JoesTempSpamHolder pages. – Joe - 2005-04-15 07:14 UTC
OddMuse has a default upper limit for the total size of the edit form of 210K, thus limiting pages to about 200K. In addition, some browsers impose a limit on the amount of data that can be entered in a form's textarea field, irrespective of any limits on the server side. I know of several popular browsers with a 32K limit and I've been told that some have a 64K limit. The page JoesTempSpamHolder is currently about 70K. The chongqed.org blacklist is about 170K. – RichardP - 2005-04-15 09:22 UTC
You got me interested in finding out the actual sizes. The search page gives them:
JoesTempSpamHolder: 84K
JoesTempSpamHolder2: 105K
JoesTempSpamHolder3: 105K
JoesTempSpamHolder4: 98K
JoesTempSpamHolder5: 7K
Around 105K seems an odd size limit based on what you said, but I have no doubt its somehow related. According to Windows, JoesTempSpamHolder is 79.4K saved in ASCII. If OddMuse stores pages in UTF-8 that explains the slightly larger size, Windows reports 85.2K in that case.
– Joe - 2005-04-15 09:35 UTC
Did you notice that our inspiration's main domain appears to have been banned by Google? Search link or site. PageRank is 0. Sadly the .org version is still around. – Joe - 2005-04-07 07:57 UTC
Hmm. Nice. Searching for inurl:emmss or for emmss itself, returns the 'correct' results. Another SEO expert that did his job really well. – Manni - 2005-04-07 10:09 UTC
That bastard with his Russian links is driving me nuts. From now on, every link from a non-editor user to a site with a ru or su tld is considered spam and will be caught by AntiSpamDan. – Manni - 2005-04-05 17:31
I wondered about similar yesterday. I don't have a problem with this. But what if a user comes here to post what was spammed to his wiki and doesn't know to put it inside a pre tag. Is he going to end up at your cookie redirect? – Joe - 2005-04-05 15:49 UTC
Yes, he will. I could change the directions given on each edit page, though. – Manni - 2005-04-05 17:52
When does the redirect get set off? Is it when they hit the second save (crash) button? – Joe - 2005-04-05 15:55 UTC
No. It works like this: Somebody saves a page that includes spam. Spam is detected and somebody gets to see the "Don't post spam" warning. This page will also set the cookie. The redirect happens when somebody accesses the wiki with the spammer cookie set. – Manni - 2005-04-05 17:57
Dan just got him!!! – Joe - 2005-04-05 15:57 UTC
Now he is trying a proxy from ufl.edu. I have an uncle that teaches there. I am going to try and find someone to contact there about this. – Joe - 2005-04-05 21:11 UTC
Can we assume from his continued spamming that he is not being affected by your redirect? Maybe he clears cookies frequently or disabled JS. We know he uses them since he logs in every time. – Joe - 2005-04-06 06:32 UTC
No, not quite. 1) these are all different machines and I don't think that he is using them as proxies. I guess these are zombies that run his bot. 2) You don't need to have cookies enabled to sign your postings with a name. You just need to fill out the Username field. Remember that this is the guy who will try to leave a user name everywhere, even if it isn't supported. – Manni - 2005-04-06 09:02
Can you look up the info on this guy when he posted from ufl.edu. Maybe if I give them enough info they can track it down and maybe will let me know if it is a proxy or a zombie. And you could put up what other info you can gather on russianspammer?. – Joe - 2005-04-06 07:36 UTC
Isn't there already enough about this guy on BackToTheFutureII? When he used x40-215.dhnet.ufl.edu, he didn't provide a referrer and identified as 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'. – Manni - 2005-04-06 09:53
I didn't know this was the same guy. --Joe - 2005-04-06 07:57 UTC
A site with a collection of tons of different UserAgents: http://www.pgts.com.au/pgtsj/pgtsj0208c.html
Some interesting blog posts I found just now:
Interview with "Texas Hold'em" blog spammer which points to a Lockergnome interview with a reformed spammer.
New Scam in Comment Spam in which the author discovers spammers are sneaky. We have known of this method for a long time. Is it really going unnoticed by bloggers?
– Joe - 2005-04-05 09:10 UTC
That Lockergnome interview is certainly interesting (and entertaining). But the stuff about spammers being 'sneaky'? Has this guy spent the last years hidden under a rock? "Nice site. <URL>" is certainly not quite new. – Manni - 2005-04-05 11:40
Not willing to try it myself at the moment, but I am wondering about the new spam prevention cookie stuff. Have we got any victims yet? And more interestingly, anything from its clipboard feature yet? Did you add the Russian Spammer's URLs to the DB? – Joe - 2005-04-04 15:01 UTC
Can't blame you for not wanting to try it ;-) – Currently, all I do is redirect to the site quoted on Slashdot. So the clipboard contents is beyond my reach. I'm working on a version hosted here, however. We haven't got any victims, yet. Of course, I chongqed the Russian spammer. We know how persistent this guy is. Thank god we have Oddmuse's surge protection feature. – Manni - 2005-04-04 17:53
I just wondered since he had hit so many times. Must have been before he was chongqed. – Joe - 2005-04-04 18:06 UTC
I did a little logging on the redirects and it seems that we have redirected 9 spammers in the last 20 hours. – Manni - 2005-04-05 14:52
I just accidently walked into it that redirect (I didn't realize I had the spammer cookie on my IE). Quite effective. I knew what was going on so was able to get out using Task Manager pretty quickly. I only heard "I am watching gay porn" two or three times before I was able to react and hit mute. – Joe - 2005-04-09 07:42 UTC
Regarding our PageRank: We might think about yet another reorganization of the db output. Currently, we have pages for keywords and pages for spammers. I guess less pages would be better for Google. But how could be cut down on the number of pages? The problem is that one spammer can have many keywords and that one keyword can be used by different spammers. – Manni - 2005-04-01 13:54
I have been trying to figure that out too. The only thing I can think of that still follows the keyword chongqing idea is to categorize the keywords. Like porn, shoe, seo, photo, cars, electronics, lamecontest, etc. That would be a pain to do for all the keywords we have now, but we need to do something. Some would obviously overlap so we should be ablt to mark a word as fitting different categories depending on the spammer. If a porn spammer spams the word photos its porn, but if someone is spamming car photos then its car spam. Then instead of linking to each keyword page, the spammer page can link to the keyword category page. – Joe - 2005-04-01 12:10 UTC
For that each category page could contain some custom text and maybe list (some of) the keywords in the category. Another idea that I had long ago was to use a set of document pieces and based on a hash of the keyword build a page with those pieces. So pages would have a different combination of text. Just mixing 3 versions each of 4 paragraphs we would have a pretty good number of different page texts (81 if my math is good). I think our problem is not only the number of pages, but that they are all 99% identical. Both probably are hurting us since there is no other real content on the spammers subdomain. – Joe - 2005-04-01 12:29 UTC
So, regarding the categories, what you are suggesting is that people link to spammers.chongqed.org/category using the keyword 'keyword'? As in [http://spammers.chongqed.org/category keyword]? Yes, this would seriously cut down the number of pages served by the db. But we currently have 12,215 keywords in the db! And think about all the Chinese crap. Who's gonna categorize this? – Making the pages less identical sounds like a much better and viable idea. But how? – Manni - 2005-04-01 14:49
Maybe these pages shouldn't have so much text in fact. Could just make it look like what it is, a database record on a data driven website, i.e. Remove the text paragraphs and just have a couple of links to explaination pages for 'What is a wiki?' etc. That way there's not so much text which is (seemingly) duplicated across many pages.
Providing more genuine information, would increase the variety. E.g. when was the spammer submitted?, when were they last spotted?, on which wikis have they been spotted?, how do we categorise this spammer? But this is coming back to my idea of making every spammer page a wiki page, which is bad idea because there's so many of them (numerical domain names and such) – Halz - 2005-04-01 13:40 UTC
Less text might help make the pages not so identical, but the thousands of pages is also a problem. Google has indexed only a small fraction if you go by how many pages link to chongqed.org. Its only a couple hundred and many of those are internal links, most from the spammers subdomain. There should be 12,215 keyword pages and however many spammer pages too.
For now we could just catagorize all the chinese spam together. That sucks, but will do. Some of it we can catagorize based on who left it. Emmss stuff is probably SEO keywords, the panasonic guy is electronics. For the English stuff a large number can probably be catagorized by a few regex strings. A lot of our keywords are multiple term keywords, a match on one word should usually identify it.
We seem to be stuck though, we either do it now or wait a few months and do it when the DB has twice as many keywords. A temporary solution may be to add the catagory as input to new stuff we add to the DB. But that is a pain too. We wouldn't be able to just throw a ton of stuff in and hit submit. We would have to do it a piece at a time based on catagory. That makes this idea sound even worse to me. But I don't know how else we are going to reduce the number of pages a lot. – Joe - 2005-04-01 15:06 UTC
What are the chances of using PF to catagorize the keywords. With the XMLPRC API we should be able to feed it keywords and train it without using email. Then once its trained enough we feed the rest of the keywords and hope. That still doesn't solve the extra annoying step to new entries, but could really help in catagorizing existing stuff. I am sure there would be lots of errors, but it would be a start. – Joe - 2005-04-01 15:13 UTC
And what about the existing chongqing links out there? Even if we did categorize all our keywords, we couldn't simply serve up 404 errors in response to existing keyword links. – Manni - 2005-04-01 17:14
Well, that is likely just something we are going to have to accept. I don't think we can leave all the pages as they are now. Its just too many for Google to want to index from a single site. We could redirect each keyword page to its new catagory page, but we will loose any PR benifit of existing links. But at this point its not doing us any good because we have PR 0 on the subdomain anyway. Are there any spammer pages that have higher ranking? I know the wikispam page does, but that likely is benifiting from being relatively new. We know Google likes new pages. – Joe - 2005-04-01 15:20 UTC
Hey, I get a reading of PR1 for spammers.chongqed.org. OK, I admit that this isn't exactly the kind of earth-shattering stuff we are looking for. [1] [2] have PR3. [3], [4], [5] and lots of others have PR2. But that's about it. – I trimmed the pages for keywords and spammers down quite a bit. Let's see if that helps. – Manni - 2005-04-01 21:04
By trimmed you mean less identical text? – Joe - 2005-04-01 19:20 UTC
Yup. I even removed the chongqed.org navigation links. – Manni - 2005-04-01 21:28
Hey! I just found a nice referrer from Amazon! – Manni - 2005-04-01 21:31