Estimating Google's number of servers using the WWII tank method
posted Sep 16, 2007

Website stats that show referrer information look similar to this:

51	51	 0.38%	http://www.google.com.mx/search

But if you have a Google Gadget that makes requests to your website, you'll see something like this:

...
9	9	0.47%	http://9.gmodules.com/ig/ifr
8	8	0.27%	http://22.gmodules.com/ig/ifr
8	8	0.07%	http://89.gmodules.com/ig/ifr
5	5	0.12%	http://34.gmodules.com/ig/ifr
5	5	0.18%	http://36.gmodules.com/ig/ifr
5	5	1.12%	http://38.gmodules.com/ig/ifr
...

When Google gadgets make requests for data, these requests are sent to a proxy server at Google, which then repeats the request to the original destination. This is because of the browser restriction that javascript can't make requests to domains other than the domain of the page it's on.

The proxy servers also cache the requests and responses. If the google gadget wants a fresh copy for every request, it can add some #randomdata to the end of the request. The gadget mentioned above does this.

Anyway, here's the interesting thing. Because the proxy servers use a unique ID in their referrer strings, we can try estimate how many of them there are. Why would we do this? Because Google is notably secretive about the numbers of boxes they employ.

We can use the German Tank method to do this estimation. If I were a real statistician, I'd know the actual name of the formula. But I'm not, so let's call it the German Tank method.

For yesterday, there were 62 unique *.gmodules.com domains, with a maximum value of 103. The formula (103-1)(62+1)/62 gives 103.645. Let's say 104.

The same formula applied to three other days gives 104, 101, and 103.

This doesn't really tell us much. I'm sure those gmodules proxy servers (of which there are roughly 103) represent a small fraction of Google's architecure. But it's interesting to know whether there are about 100, 500 or 1000 of them.

I suspect Google will change the proxy request referrer string to remove the unique ID :)

Comment by Walter
posted Sep 17, 2007

Interesting, but I'm pretty sure I read somewhere that Google has a system in place to automatically allocate servers to particular tasks based on load. Not sure if that applies here, of course.

Comment by LPD
posted Sep 17, 2007

Good point. That would make the estimation even less meaningful :)