Ghosts in the machine

Air traffic control

Ancient operating software, which has been tinkered with but not upgraded, is threatening to wreak havoc in European airspace according to a UK government body that promotes safer critical national infrastructure systems. Gary Mason reports.

Air traffic controlIt is almost 12 months exactly since a major failure of air traffic control software used by NATS caused significant disruption to UK airspace.

Although this may be considered a “one off” event that will never happen again there are concerns being expressed from outside the commercial aviation industry not just in the UK but in Europe that reliance on old systems, some of them written in the 1960s, and a failure to update them, is a catastrophe that is waiting to happen.

The Trustworthy Software Initiative (TSI) is a UK government entity that is funded from the cyber security budget and is sponsored by the Centre for Protection of Critical National Infrastructure.

Within the initiative there is a steering group comprising academia, professional institutes and private industry. The Civil Aviation Authority are present on the advisory board and help keep an eye on infrastructure (such as that operated by smaller local airports) which many not be regarded as “critical” but is nonetheless a very important cog in the wheels of civil aviation.

The TSI has expressed concerns about not only main systems operated by NATS but the back up systems and software operated by all airports. It says that the key problem with the software ‘lifecycle challenge’, is specifically that, during the evolution of software, very little information is exchanged between the original software engineers, those who alter the software over the course of its life, and the end-users such as airports. This means Air Traffic Control Centres might literally be ‘in the dark’ over the true nature and limitations of the software they rely on to track air traffic.

This ‘software lifecycle challenge’ means that all Air Traffic Control Centres must have complete up-to-the-minute visibility over the limitations of the safety-critical software upon which they depend.

A specific problem that arises is when “Composability/Traceability of Assumptions and Assertions” underpinning software design have not been adhered to. This can manifest itself in so-called Race Conditions where the end-user (an Air Traffic Control Centre) is unaware of a necessity for a specific set of sequential activity when reusing and assembling old software from multiple sources.

The TSI says it is aware of cases where Air Traffic Control Centres have only discovered that their software was out-of-date and could not handle the number of planes it was required to track, through rigorous testing for different scenarios.

Recent examples of aviation software failures include Paris Orly Airport’s software that had not been updated since 1992 (see box) and the United Airlines software bug that was only uncovered by a security researcher and would have allowed a hacker to remotely cancel flights.

A fundamental problem is that many airports may be unaware their software is out-of-date or may not be capable of doing what it will be required to do because the original software engineers are failing to explicitly document the underlying assumptions and ‘assertions’ behind its design.

As airports such as Heathrow increase in size and capacity, Air Traffic Control centres will be required to track far more aircraft and manage far more data at once. This puts them at risk if their old software was never designed to cope with such demands.

Tony Dyhouse from Trustworthy Software Initiative told Airport Focus: “Airports seem to have had a budgetary failure to update their systems. This is very largely because we are talking about niche software. Software that we use very day we start to notice faults and errors – this is different because a lot of the software was written a long time ago.”

A recent report revealed for example that some of the software used by NATS was written in the 1960s. “Even patching as a process hadn’t become a major forte then,” he adds.

“What we are seeing now is that the requirements for which the software was originally developed no longer exist. Although it is still there and working today the older it gets the more difficult it is to do anything about it. This is because it will be difficult to find the people who know how it was written in the first place and the constraints placed upon it at that point.”

According to Dyhouse there is a real danger that software used by airports and Air Navigation Service Providers will fail. The situation has been complicated he says because over the years the software has been linked to other types of software it wasn’t originally designed to link with.

“The worst case scenario is that not only will the basic software fail but that it will cause other vital systems to fail as well. The NATS system failure in December 2014 was a perfect example of the original constraints of the software being exceeded. They were quick to point out that there was no danger to safety and they were able to retranche to a manual method but as we go on in time those manual methods can get a bit rusty as well. They are in some cases just gathering dust and this shows the importance of regular testing and exercises to make sure that there is resilience and if the software does fail there is a reasonable back-up.”

Because the software of most concern is by definition very old this also means that the hardware it runs on is also often vulnerable to collapse, the TSI says. “A lot of this hardware is becoming increasingly scarce and some reports have indicated that airports are having to source hardware from ebay,” says Dyhouse. “That is a very similar scenario to what is happening with other industrial control systems around the world. Because they weren’t originally connected to the internet and were for niche purposes as long as they are there doing the job they are largely left alone until something goes seriously wrong.”

Some of the systems have almost been developed in-house as bespoke systems which goes against the modern trend of buying commercial off the shelf (COTS) systems built by designers who know how to build resilience into systems that will be needed over several decades.

“There is a lot of commercial sense in avoiding such systems because there will be a certain amount of patches that can be relied upon to sort problems out,” says Dyhouse. “Whereas if we look at what happened at Orly airport they were running Windows 3.1 and said they also had a lot of Windows XP systems. These systems are no longer even supported by the manufacturer.

“The difficult question to answer is why have not budgets been set aside to upgrade these systems as they should have?”

An emerging trend within other areas of technology is that the development of a software release is driven by functionality so there is a reduction on the usual testing on the release. This is a crucial weakness for systems which will be relied upon by critical national infrastructure entities such as airports.

“Largely budgets have gone into client facing roles such as airport wi fi and linking commercial retail outlets and systems involved in security,” says Dyhouse. “These are the systems that the public see but they are not the systems that are controlling our air space and are often hidden.”

Another issue according to TSI is that when there is a significant failure of one of these critical “hidden” systems it is underplayed. “It is not in their nature to make too much of it when there is a failure,” he says. “We only became aware of failures such as NATS and the Paris incident because it hit the media. The inquiry into the NATS system failure pointed out it quickly became a cause celebre because of press interest. If that hadn’t have happened people might not have know about it.”

Dyhouse says that it needs to be recognised that software is not a fixed asset but is a living thing that needs to be managed on a day to day basis. “It needs to be managed and changes to it need to be planned as with any other project. If someone comes in and tinkers with it that needs to be a matter of record and those changes need to be clearly logged or else no one else will have any idea how and why those changes have been made,” he adds.

This means that right at the start of the life of a system there should be a document that spells out what the software is supposed to do and more importantly what it cannot do. “There should be a clear understanding of the limitations so that if the system is bent to exceed those limits it will break,” he says. “Times change and a management system is essential to ensure that the software is maintained in an appropriate manner and the system requirements for which is was built are still valid. We don’t have to be rocket scientists to know that the number of planes in the sky have increased significantly. With the rate of airport traffic growing as it is the fact that these systems are so old is shocking.”

So is there a simple fix to the problem or are airports and ANSPS all going to have to invest a lot of money in upgrading their systems and hardware? “There is no simple fix to the problem,” says Dyhouse. “But I am afraid they almost need to start again at the point where they are asking what do we need and what do we need it to do? A key part of that process is asking what they expect their business to look like in 10 years’ time. We can design software well but there isn’t a short cut. Orly Airport said they would not be able to get a replacement system up and running until 2017 but it is going on a wing and a prayer to rely on a system that you know is going to fail. What I would advise airports to do is to have a complete audit of their software systems and check if there is a record of how every software system was designed and to ask if it is still valid. They might be shocked at how much of that information is missing.”

How Orly was brought to a standstill

A computer glitch that brought Orly to a standstill in November has been traced back to the airport’s operating system.

The computer failure had affected a system known as DECOR, which is used by air traffic controllers to communicate weather information to pilots.

DECOR, which is used in takeoff and landings, runs on Windows 3.1, an operating system that came onto the market in 1992.

DECOR’s breakdown prevented air traffic controllers from providing pilots with Runway Visual Range, or RVR, information — a value that determines the distance a pilot can see down the runway. As fog descended onto the runway and engineers battled to find the origin of the glitch, flights were grounded as a precaution.

“The tools used by Aéroports de Paris controllers run on four different operating systems, that are all between 10 and 20 years old,” explained Alexandre Fiacre, the secretary general of France’s UNSA-IESSA air traffic controller union. ADP is the company that runs both Orly and Paris’ other airport, Charles de Gaulle, one of the busiest in the world.

“Some of ADP’s machines run on UNIX but also Windows XP,” said Fiacre, who works as an aviation security systems engineer.

“The issue with a system that old is that people don’t like to do maintenance work,” explained Fiacre. “Furthermore, we are starting to lose the expertise [to deal] with that type of operating system. In Paris, we have only three specialists who can deal with DECOR-related issues,” said Fiacre.

“One of them is retiring next year, and we haven’t found anyone to replace him,” he added.

Fiacre compared the challenges of running the Windows 3.1-supported DECOR to the issues faced by NASA with its Voyager program, which was launched in 1977.

French aviation systems engineers face their own maintenance challenges, compounded by the unavailability of spare parts for these outdated machines. “Sometimes we have to go rummaging on eBay to replace certain parts,” said Fiacre. “In any case, these machines were not designed to keep working for more than 20 years.”

France’s transport minister has promised that “equipment will be upgraded by 2017.” But Fiacre is not so sure about this timeline. “In my opinion, we’ll upgrade in 2019 at the earliest, perhaps even in 2021,” he said.

Fiacre described the breakdown as a “warning,” but noted that the systems failure had in no way “endangered passengers, since [air traffic] controllers took a number of precautionary measures to eliminate all risk.”

The NATS system crash

Following a failure of some United Kingdom air traffic control (ATC) services on 12 December 2014 the Civil Aviation Authority (CAA) and NATS established an independent enquiry into the cause of the failure.

The Incident started with the failure at 14.44 of a computer system used to provide information to Air Traffic Controllers managing the traffic flying at high level over England and Wales. This traffic includes aircraft arriving and departing from London airports as well as aircraft transiting UK airspace.

At 14.55 all departures were stopped from London Airports and at 15.00 all departures were stopped from European airports that were planned to route through affected UK airspace. The computer system was restored to the Controllers at 15.49, but without its normal level of redundancy. By 19.00, the Engineering staff believed they understood the cause of failure and full redundancy of the computer systems was restored at 20.10. Traffic restrictions were gradually lifted from 15.55 as confidence increased, and the final restriction was lifted at 20.30. The disruption caused by the restrictions affected some airlines, airports and passengers into the following day.

The systems at the NATS Swanwick operations centre entered service in 2002 but were in development during the previous decade. Failure occurred on 12 December 2014 because of a latent software fault that was present from the 1990s. The fault lay in the software’s performance of a check on the maximum permitted number of Controller and Supervisor roles (known as Atomic Functions).

Following publication of the inquiry team’s report a spokesman for NATS said: “We agree with the panel that it is unrealistic to expect that complex systems such as ours will never fail. To mitigate this we will continue to invest in making sure that failures are extremely rare and the impact of such failures on the travelling public are minimised as far as reasonably practical. And we are pleased that the panel recognised the continued programme of investment to accelerate the deployment of our next generation of systems.”

share on:

Leave a Response