On July 15 Eve Online experienced its longest downtime since Incarna released in 2011 and a recent devblog explains why. Some veterans of the game remember the warnings to set a long skill training before patch day. Back when CCP released expansions every six months it wasn’t uncommon for these massive patches to take the system down for several hours or even days. That time is long past though and the six week deployments have been very uneventful for the most part. Newer Eve players have likely never experienced a downtime longer than an hour.
When the second half of the Aegis patch deployed on July 14, it brought a sweeping change to nullsec sovereignty mechanics which should already be familiar to readers of this site. That change rolled out with little issue and the denizens of New Eden rallied behind the call to entosis all the things. This resulted in hundreds of sovereignty event timers. All seemed well and the developers in Reykjavík likely had no idea that their small hotfix patch the following day would bring Tranquility to its knees.
Eve Online is unique in the MMORPG genre because all the players reside in a single shard where they can interact with each other. Some may mistakenly assume single shard means single server but this couldn’t be further from the truth. The Tranquility cluster consists of hundreds of server blades and—back in 2013—was reported to have the equivalent of 4 terabytes of memory and 2.5 terahertz of CPU speed. At one time the cluster was regarded as one of the world’s largest supercomputers.
Rebooting Tranquility requires tight coordination between the individual servers as the cluster comes online. Each server transitions through four stages before it is considered ready for use. On July 15 several of these servers were stuck in the final stage which prevented Tranquility from coming online. The developers worked frantically to determine the cause and get the cluster operational. After some trial and error they found that deleting the sovereignty event and vulnerability data allowed Tranquility to boot. Perhaps the players were too zealous when waving their entosis wands.
Resetting the prior day’s sovereignty data wasn’t an acceptable solution, so the developers kept searching for the root cause. They narrowed the problem down to the processing of sovereignty events and were able to boot by removing that data. However, the nodes would fail when they manually added that data after the cluster was online. Next they tried turning off the server log messages and the server came online without any issues. Satisfied that they had a working solution all log messages were removed from the new sovereignty code and Tranquility was brought online ending the nearly 12 hour ordeal.
Log messages are important for finding bugs in a distributed system like Tranquility so removing them is not a valid long-term fix. The developers are performing experiments on Tranquility during downtime to search for the underlying problem. There were two log message channels used in the new sovereignty code—one for generic messages and one for sovereignty campaign messages. Each worked fine with low volumes of data similar to what would be seen on the test server. However, the campaign channel causes processing to grind to a halt when presented with a large volume of data like that seen on the live server.
The developers at CCP continue to search for the root cause for the problems with the logging channel. Since these issues only appear on the Tranquility cluster they only have a few minutes each day to run tests. It will likely take them a while to narrow down the problem and restore logging.
This article originally appeared on TheMittani.com, written by Turk Fezzik.