An Interview with the Server Gods

2021-10-04

On Wednesday September 15, I was given the opportunity to interview CCPs Tuxford and CCP Explorer – both members of CCP’s infrastructure team – with CCP Swift sitting in to help guide the conversation. This was a rare opportunity for me, as usually when the inner workings of the infrastructure teams are concerned, things tend to be guarded in a shared veil of common experience. What I mean by this is that such conversations tend to be on rails, with a pre-structured set of questions being offered forth by a current or former CCP staff member and answered out of practice or from an established agenda, as happened on a stream with Carneros in November of last year.

Hidden out of the Way, but not necessarily Obscured

There have been other panels and discussions where the questions come free-form, such as at Fanfest or EVE Vegas, when players have an opportunity to interact with the devs directly and ask questions. Though both devs admitted enjoying Fanfest – because it has happened in Iceland in the past and was close enough for them to attend and share drinks with players – these events are rarely attended by CCP Tuxford or Explorer specifically because of the operational demands of their roles. 

This is perhaps where my interview with these gentlemen yielded the most fruitful answers; most game companies are black boxes, where the roles or functions of individual developers or managers are at best poorly understood abstracts. Unless you have experience working for a game company you’re unlikely to have a grasp of what people do or what their titles mean in a practical way. There are reasons to keep it this way, certainly, but some of the lack of clarity comes down to two factors: a general, public lack of desire to know, and a dearth of opportunities to ask.

For the average reader of INN, I believe it would be safe to hazard that most would agree that deep conversations on server architecture and function are neither sexy nor stimulating. Most people don’t need to know how or why the game runs, just as long as they can launch their game clients, login, and have everything perform as expected. That said, I spend my days as an IT systems engineer immersed in the inner workings of servers; keeping the proverbial lights on, monitoring logs and tweaking performance, and putting out fires. To me, a chance to see how the infrastructure team works at CCP is a fascinating look into someone else’s black box. 

Slow Gardening and Internal Growth

What I learned from CCP Tuxford and Explorer was that while the ‘slow gardening’ Explorer referred to in November has been happening in the servers, more change has been happening internally at CCP. 

Yes, it’s true that a non-insignificant number of servers from the last hardware uplift (in 2015) are still in active use for the SQL cluster in the monolith or hosting solar systems. However, while SQL servers are resource-hungry and hugely memory-dependent, they’re not particularly prone to excessive wear and tear. Unless they suffer significant faults, and as long as they’re well-maintained, they can simply plug along for years. CCP has already stated they have a plan to phase these older machines out.

However, while the monolith can benefit from a capital boost and have new hosts and blade servers brought in, change happens more slowly where human capital is concerned. 

The roles of the infrastructure team are operationally divided between Tuxford and Explorer, with the other members of the team filling in the mix – Tuxford’s role is more tightly tied to evaluating the code which is coming into the pre-production stream (ready for public eyes) and ensuring the teams are delivering code which is ready to push to production. Meanwhile, Explorer is more of a mechanic; keeping the proverbial lights on and easing problems out of the system. Explorer will address hardware, logs, outages, break-fix and optimization, and spends a lot of time combing logs to isolate and address proactive changes.

Tuxford, however, also has some hand in internal policy and access to tools which enhances or facilitates the different teams’ ability to work independently. In fact, this was a big part of a recent internal reorganization, with the focus on building the ability for dev teams to self-resolve some tension points without having to touch the rest of the development pipeline or involve other teams. New tools have added layers of abstraction, which have in turn helped to build operational independence. 

What this means in layman’s terms is, teams spend less time throwing small changes and requests over the wall to other teams or kicking requests and side projects back-and-forth, and instead have the ability to resolve these things directly within their own teams or with limited intervention. Since the November 2020 stream, Tuxford has spent a lot of time on removing cracks and gaps between the teams, and in some cases helping them to build stronger barriers so they can be more decisive or effective.

As he puts it, “Some of the challenges come from organizational difficulty, not technical difficulty. Focusing on inter-team dependencies, versus intra-team dependencies, can mean being more boundaried.”

Big Moves and Smaller Downtimes

As was highlighted in the November stream, there have been some huge economizations in the amount of traffic and requests handled by the proxy servers by shifting services out of the monolith and onto what CCP calls the ‘EVE domain services’. In essence, these are elastic compute spaces and storage spun up in Amazon Web Services. These are processes and parts of the EVE experience which aren’t dependent on the render portion of the game engine; the market, chat, the activity tracker, and the dynamic bounty system are just some examples.

Keep in mind that we are ultimately dealing with a two decade-old game initially written in a programming language (Python) which hadn’t/hasn’t yet fully realized all that it can (and can’t) do. There’s a lot I can say about Python – a cluttered namespace, whitespace quibbles, the lack of constants and private instance variables, and the obsessive use of the import function. Or, as one past EVE developer put it, “Import magic. Sssh! Hush now! Don’t question the magic.”

The more of the old spaghetti code and truncated development CCP has been able to prune out of the game base, the better and more stable EVE has arguably become. These have been big moves, and more are certainly on the way, but there are barriers yet to be removed. 

CCP Explorer and Tuxford were able to explain that the hinge point for a lot of the reset events in EVE – regeneration of asteroid belts, for one – aren’t actually dependent on downtime but on startup. It sounds like a minor distinction, but the concept is pretty simple. It’s not shutting down the server that causes those events to kick off, it’s the processes that occur when everything starts up. That these processes are attached to startup is actually rather arbitrary, but this is a decision that was made back in 2001 or 2002, before the launch of EVE, and it’s just never been changed.

I asked what reorienting the startup-dependencies would look like for EVE, and that’s where Tuxford circled back to the internal reorganization and some of the team developments. “That’s where we get into discussions of game development.”

In essence, they can be hooked to anything; a timer, or a series of timers, a scripted trigger, or a cascading series of changes. This is where game developers get to make the decisions. Those reset cycles may find their way into being worked into redistribution.

Dreams of Robotic Sheep

I asked if there is conceivably a time when EVE could be moved away from downtimes entirely, or moved off the cluster/monolith and wholesale into AWS.

Both men stated that there are always going to be cases for some aspects of the environment to exist in an on-premise fashion (servers living somewhere physical, owned by CCP), versus living in the cloud. Some services make sense to grow and self-balance, which is another benefit of the scalability offered by AWS. Existing in an elastic, cloud environment allows for the ability to quickly scale and downscale based on demand, but other services need to exist in a physical infrastructure. Big fights, optimizing for best performance, and so on.

I asked if computer learning and automation could take a bigger role in easing the workload for the team, and they both responded that there is ongoing focus on automation, but that a lot of what they do already happens this way.

While moving away from downtime has business drivers (continuity of experience, etc), CCP has found that player experience is perhaps the biggest benefactor of reduced downtimes. Where people have come to expect downtime, they have developed certain habits, with people logging off from the servers in advance of downtime and not returning until (in some cases) several hours later. Whether this is because it’s a conditioned response, or just a convenient break to go and take care of needful real-life tasks and responsibilities is uncertain. However, both Tuxford and Explorer stated that the typical downtime tasks currently take around five-to-seven minutes and it is rare they need an hour for downtime.

So when EVE finally moves away from downtime, will EVE ever sleep? And if it does, will the Server Gods send tiny robotic sheep? Probably only for the really big changes.

Special thanks to CCP Swift, Tuxford, and Explorer for putting up with my questions and offering me a chance to interview them. I look forward to seeing some of their internal blogs make their way public, and hopefully news of a completed infrastructure uplift in the future.

Let your voice be heard! Submit your own article to Imperium News here!

Would you like to join the Imperium News staff? Find out how!

Comments

  • kwnyupstate .

    You can tell it is another boring CCP shill interview by the lack of comments (interest).

    October 5, 2021 at 10:33 PM
  • sj

    It didn’t really tell us much however:

    other services need to exist in a physical infrastructure. Big fights, optimizing for best performance, and so on.

    I don’t get this bit because unless its to do with security requirements or you need very specific hardware then nowadays on-premise computing is a compromise. Their performance problems seem to stem from a decision made two decades ago which at the time may have been the right one but set them on a road that by now a wholesale rewrite is probably the only way out.

    October 7, 2021 at 5:08 AM
  • Xa1n

    Any update is good for the game, thanks for this.

    October 19, 2021 at 3:59 AM