Tuesday, April 05, 2005

Flickr: Forums: FlickrHelp: cannot reach flickr.com

Le explicación de la pudredumbre de flickr ayer....
Si lo entienden, háganme el favor y explíquenme porque me quedé en el aire...

Flickr: Forums: FlickrHelp: cannot reach flickr.com: "So basically the issue folks was the network equivalent of 1 sparkplug of 8 cylinders only firing correctly 87% of the time.

Longer, more detailed version:

One of the two load balancers had one of its eight network processors in a borked state. It continued to pass traffic, for sessions that had been started on it, but didn't recycle/clear old sessions. That one processor saw 64512 concurrent sessions, and never cleared it, but also didn't count against the global limit of the balancer, so it didn't failover to the secondary.

The result was a decent amount of packet mangling, (most SYNs never got ACKed) and it showed no pattern, nor a time slice of accepted sessions. Again, this only affect about 12.5% of the requests to the site, and it took awhile to actually find a server out there that was seeing the issue.

Once I was actually able to get an ssh session on a box that experienced the issue, troubleshooting got a lot easier. :)

Man, that did suck.

We are now running on the other secondary load balancer, and a hardware post mortem is being done on the bad, bad, naughty one.
Posted 10 hours ago."