WikiHome RecentChanges WikiNode Preferences chongqed.org

DelayedIndexing Experiment

DelayedIndexing Experiment

While we are experimenting, there is one I have wanted to try out. DelayedIndexing seems like it will cause Google to remove the page from its index if it visits during one of those delay times. I would like to see if they remove a site instantly when the bot hits a page with noindex. I would also like to know what happens to that page's PageRank when it comes back and how long it takes to reappear in the index. Its not a terrible situation if your wiki is really being hammered and your desperate, but I don't think its a good general solution. DokuWiki is planning on including this as an option in the next release. Ward's wiki is already using it. Someone should do some experimenting on this before it becomes a wide spread "solution." – Joe - 2005-04-14 08:06 UTC

That would be an interesting experiment. However, how long it takes for Google to re-add a page after removing a meta robots noindex tag probably depends a great deal on how often the server is indexed by Google. Google says that they revisit popular sites that change frequently as often as several times a day, while months may go by between visits to lower popularity sites that change infrequently. Personally, if I were to implement delayed indexing for a wiki, I would probably add "rel=nofollow" to external links on recently edited pages rather than a whole page noindex tag. – RichardP - 2005-04-14 09:06 UTC

I wonder though if the noindex will cause Google to check that page even less frequently. It may crawl the rest of the site, but why keep checking a page that said noindex last time. It would be more efficient not to check it. Google does a lot to optimize its crawler and not put a strain on a webserver its crawling. If the page has already been identified as noindex it makes sense to me to not look at it again for a while.

Adding the rel=nofollow would be good for not helping spammers, but a spammed revision can still make it into Google which will attract more spam. Spammers aren't smart enough to check for nofollow. Unless they are an ethical spammer (if that paradox can even exist) they would stop spamming a page that did that, but just spamming the page is easy and checking for nofollow is another step. Why take the extra minor step, just spam everything and hope it works. Many probably have no clue what the tag even does.

Joe - 2005-04-14 09:40 UTC

I think I can already provide some data on this one. The last time the Google bot hit a page on this wiki that has NOINDEX,NOFOLLOW was just a couple of minutes ago: it had a look at the history of the SpamBlockLoop page. I grepped the logs and here are the visit dates to that page's history: 14-Apr, 5-Apr, 25-Mar, 4-Mar, 23-Feb. Of course, these aren't enough data points to try to find a pattern, but how about "twice a month"? This page (WikiForum) should be a bit more popular, but I find the same pattern for this one's history.

It's interesting to note that no other spider had a look at either page. Either other robots actually never look at a NOINDEX page or they don't like to visit pages with cgi-script parameters. But we should not forget about other robots. Google is important, but there still are other engines.

How about the INDEX,FOLLOW pages? SpamBlockLoop shows the same pattern as its history page. WikiForum gets one visit per day.

Of course, the history pages always had a NOINDEX,NOFOLLOW tag. But it's hard to imagine that Google is prepared for a robots meta tag that can change daily.

Manni - 2005-04-14 13:06

I can see this is going to be a hard thing to fully figure out. The main thing though is does the page get dropped immediatly (it should). It seems how soon it would be put back depends on how often the page is normally changed. Google revisits active sites much more frequently than inactive ones, it makes sense to do that for specific pages of a site. How it affects the PageRank is still an important question. PR is supposed to be based on links so its hard to believe that you would loose PR in the long run, but newly indexed pages don't have any PR. So if the existing score is deleted from the DB when the page is removed from the index it could take a good while to get it back. Google doesn't update PR that frequently. – Joe - 2005-04-14 11:29 UTC

Halz just pointed out that TWiki's BlackListPlugin uses the method suggested by RichardP above. They put rel=nofollow on new external links for a certain amount of time. I am not sure how its implemented but it is definatly worth looking into. – Joe - 2005-04-20 08:54 UTC

Sounds very interesting. But I thought about nofollow yesterday for a while and I'm just no sure about it. Let's see how our experiment turns out, but: Of course it's good that nofollow links will not enhance the page rank. On the other hand, Google did make it sound like the linked page was nevertheless indexed even if noone else linked to it. That could be bad. But the again, our friends do use the Google et al submit pages I guess so there wouldn't be any harm. – Manni - 2005-04-20 11:10

I think that Twiki plugin does nofollow not on just new external links, but on all external links of the page which was edited. To do it on just newly added links would be the best solution, but thats got to be a tricky thing to implement. I don't think they've achieved that with this plug-in. See also the dicussion hereHalz - 2005-04-20 09:18 UTC

I started running the experiment on April 27 2:30am CST. I had 3 pages with PR2 that I was willing to play with. The main domain is PR3 (root last crawled 4-26-05, Google cache date 4-16-05 which may not mean much). All three are linked from the main page. This experiment could take a while since the site is not frequently crawled. – Joe - 2005-04-27 08:07 UTC

I gave up on the other two pages. One was only a control page to compare how often it was crawled, but Googlebot does not crawl all pages at once so it doesn't really matter. The other was trying only noindex, but Google did not seem to be interested in crawling it. – Joe - 2005-05-07 07:32 UTC

Google April/May 2005
testing cache date prev crawl noindex + gone noindex - crawl back cache date noindex + gone noindex -
meta noindex, nofollow not cached 4-264-27*4-304-30?5-14-275-1*5-75-7

* Somehow page is gone, but I can't find any access that looks like Googlebot from after I added the noindex.

? Somehow the page is back in the index and retained its original PageRank, but no one but me has accessed that page since I changed it. There were a total of 5 accesses to that page according to my logs and they all came from my IP Address. Looking at accesses to robots.txt and User Agents, Google did not crawl my page at all today. Any ideas how this is possible? I have kept all the logs if anyone has any ideas what to look for. – Joe - 2005-05-01 03:31 UTC

Just incase anyone wonders, the exact meta tag used was:

<meta name="robots" content="NOINDEX,NOFOLLOW">

Text from c2:

Perhaps we could show Google an older version of the page until the "no index" timeout expires. – JeffGrigg?

I wondered about that, but then you would have to detect when Google or other search engines visit. But luckily it looks like there was nothing to worry about. I have been testing it out on our wiki. So far it seems there are no bad effects in my testing. PageRank is returned if the page is removed from the index temporarily. If a page was not indexed during the period Google recalculates PageRank that could be different, but that doesn't happen that frequently. I am relying on the PageRank info we can get from the Google Toolbar which may not be totally accurate or up to date with what they actually use for ranking, but its all we can do. I will give it one more try just to make sure. See: http://wiki.chongqed.org//DelayedIndexing_Experiment

This could still be a problem if a page is very frequently edited since it could be removed from Google's index for a long time, but that probably isn't a problem for most wikis if the noindex expires soon enough. I still don't especially like this idea, but it is good protection for a wiki. It is probably better to have your page not indexed than have it indexed with spam because that will attract other spammers. If the wiki is large and active with a good PageRank Google will probably crawl it often enought that if removed from the index it would not be very long before it recrawled and hopefully did not find another noindex. If the wiki is not active this method will have little benifit anyway since no one will likely find and clean the spam before the noindex expires. Maybe the noindex mode could only be turned on if external links are added to the page. That would allow for normal discussion to go on without the page always going into nofollow mode.

The ideal version of delayed indexing would be to show an old revision to the search engines, but that could potentially cause trouble since that could be considered trying to trick the search engine which can get a site banned. The chances of being caught for that are low and if it is being done for a good reason as it is here probably would be no problem. But you would also have to be sure to go back to a clean version. If some spam does slip past the expire, the next cleanup would be the one with noindex while the wiki shows the spammed revision to Google.

There are potential problems with the delayed indexing whole idea but it also provides relatively good protection from attracting new spammers so its hard to say whether its good or bad. That depends on the activity of the wiki and how bad the spam problem is for that wiki. And another major factor in the desicison is how important being indexed in Google is to the site. If the wiki is just a supplement to a larger website maybe that doesn't matter, but if it's Wikipedia you want to stay indexed in Google. It certainly is not good for all wikis, too small and it probably doesn't help, too big and it may hurt too much, but somewhere in the middle it could be a perfect solution.

Joe - 2005-05-07 22:51 UTC