That claim that a computer 'passed' the Turing Test was crap - and here's why [Update]

This week, the tech world went ever so slightly crazy with excitement at news that a supercomputer had, for the first time, achieved what some believed might never be possible: it had passed the Turing Test. A remarkable machine had somehow managed to convince humans - real, live people with functioning brains - that it too was a human, through typed conversation. 33% of the judges who assessed its capabilities were apparently convinced. 

Like many other news outlets - from CNET to The Independent and The Washington Post - we found ourselves giddy at the thrilling prospect of artificial intelligence finally coming of age, and we rushed to publish an article to share this news with the world. But now that the rush of excitement has calmed, it is becoming clear that our delight was premature. 

To start off with, as TechDirt notes, the great supercomputer isn't a supercomputer at all. It's a chatbot, and it's not a particularly sophisticated one. Those who have managed to use it - and they are surprisingly few in number, since the site that hosts it has been plagued by near-constant technical difficulties since the announcement - note that its responses to most questions are feeble, at best. 


Most of those trying to interact with 'Eugene Goostman' have seen nothing but error messages

Ask it a simple question, like "how are you?" and you'll get a simple response, but ask it something more complicated, and its responses quickly drift into stiff responses, infused with forced hints of 'personality', or unsophisticated pivots to draw you away from topics that it is unequipped to deal with. Consider, for example, this exchange - an actual transcript of a conversation on which an 'official' judge based his or her assessment that this chatbot was in fact a 13-year-old boy called Eugene Goostman: 

Particularly telling are the latter exchanges in this excerpt. At 16:14:41, the judge misspells a word; 'Eugene' replies by telling the judge to take some typing lessons; this may sound like a sassy response, but it's more likely an obscure attempt by the chatbot to disguise the fact that it isn't able to understand an enquiry that contains a very minor typo. The next response makes this even more obvious - the judge tells 'Eugene' that it was rather rude to suggest typing lessons, to which the chatbot replies that it simply didn't understand the question - confused by the judge's misplaced question mark. 

Frankly, none of the exchange is very convincing, and it's hard to imagine that anyone with two brain cells to rub together could seriously believe that this could be a real human being. 

But one crucial factor may explain why these judges - or at least a third of them - may have been convinced. Eugene Goostman is not intended to be just any 13-year-old boy; he is a Ukrainian 13-year-old boy, speaking English, a language foreign to his mother tongue. It would be unfair to assume that someone of that age would be able to speak flawless English as a foreign language, so perhaps, in the minds of the judges, that would account for the slightly unusual phrasing of certain statements, and the occasional odd response, right? 


Among the judges was Robert Llewellyn, who played the mechanoid Kryten
in BBC sitcom 'Red Dwarf' (image via What Culture)

The problem with this assumption is that by telling the judges that they are speaking with a young teenage boy from Ukraine, the test immediately becomes skewed; the judges become more likely to lower their expectations, and make allowances for oddities in the conversation. This is like asking someone to judge your theatrical performance, but making it very clear, in no uncertain terms, that you've only had a limited amount of time to rehearse. It has obvious potential to completely distort perceptions, and thus, the eventual outcome. 

In its coverage, The Verge reported that Vladimir Veselov, one of the computer engineers that developed 'Eugene', seemed to acknowledge that this was a factor in how the test was created and judged. Veselov stated: "Our main idea was that he can claim that he knows anything, but his age also makes it perfectly reasonable that he doesn't know everything." 

But the problems don't end there. Consider the fact that there was no peer review of the claims made. The passing of such a significant milestone surely deserves the scrutiny of the foremost global experts in the field, and yet there has been no such verification. There has been no opportunity either to assess the fairness of exactly how the test was carried out - just a press release stating that the 'supercomputer' had been judged, along with some vague claim that the test was "independently verified", but with no declaration of who precisely was verifying the process, nor indeed if they were qualified to make such a critical assessment.

Computational cognitive scientist Joshua Tenenbaum, from the Massachusetts Institute of Technology (MIT) - an actual expert in the field - told WIRED: "There's nothing in this example to be impressed by." Gary Marcus, lead cognitive scientist at NYU, wrote that "chatterbots like Goostman can hold a short conversation about TV, but only by bluffing... but no existing program - not [IBM's] Watson, not Goostman, not [Apple's] Siri - can currently come close to doing what any bright, real teenager can do: watch an episode of 'The Simpsons', and tell us when to laugh." 

This exposes perhaps the biggest flaw in the claims that the Turing Test has been 'passed'. Even if we assume that the chatbot had fairly managed to fool a third of the judges under reasonable testing conditions, there is a big difference between being able to respond to enquiries based on language assessment, and actual cognition and artificial 'thought' that can interrogate meanings, explore subtexts and understand the world in any meaningful way. 


Professor Kevin Warwick (image via teinteresa.es)

Finally, as TechDirt explains, the most notable indicator that things are far from what they seem - or rather, far from how they have been presented - is the organizer of the test itself, Kevin Warwick, visiting professor at the University of Reading, the institution through which the press release was released. Warwick, it seems, has plenty of form in the field of making outlandish scientific claims.

In 1998, he claimed that he was the world's first cyborg, after getting an RFID chip implanted in his arm, with which he was able to perform tasks such as opening doors and controlling lights. Noted British tech publication The Register quickly dubbed him a "media-obsessed fantasist", and has been cataloguing his exploits ever since. 

It says much that an experienced and respected tech journalist like Mary Branscombe tweeted this as soon as she realised who was behind the Eugene Goostman claims: 

With just the most delicate of scrutiny, then, the house of cards - built around these claims that the Turing Test has been beaten - has completely collapsed. The weight of doubt surrounding the claims is too great to overlook, and in the absence of any meaningful independent verification or any significant substantiation behind them, it is impossible to afford them any real credence. 

One day, perhaps - and maybe even sooner than we might think - the Turing Test may indeed be passed. But that day, it seems, is not yet upon us. 


Update: Professor Warwick commented on Neowin's Facebook page to respond to this article, and to correct one aspect of it. He said that "there is a claim that the judges were told that they would be conversing with a machine posing as a Ukrainian boy. This is completely false. They were told no such thing." He also said that "the article appears to completely miss the point that this was all about the Turing test. This requires a direct comparison with a hidden human in each test, it is not merely about picking holes in a computer's response." 

While we fully acknowledge Professor Warwick's correction regarding the assertion that the judges were not told that the machine was posing as a Ukrainian boy, we maintain that the test was still deeply flawed for all of the other reasons stated, and more.

For example, Professor Warwick asserts that the test specifically compared the responses between the machine and a hidden human; was the hidden human also a 13-year-old Ukrainian boy with comparable English language skills? Or was the hidden human an adult who spoke English as a first language, but who was attempting to emulate a boy of that age? Or something else entirely? The quality of the human responses is as important as those returned by the machine if one is to argue that the test was based on a fair comparison between the two. 

This is precisely why independent peer review - to scrutinise these and other factors - is such a crucial and essential element of making a scientific declaration with a potentially massive impact. Picking and choosing specific statements to respond to on a news outlet's Facebook page is not the same as giving the scientific community the opportunity to properly review the findings, and the means by which they were tested.

Indeed, the fact that Professor Warwick even felt the need to correct us on one aspect of how the test was carried out exposes precisely the lack of transparency in the findings that makes the declaration of a result so problematic, and it is telling that priority is being given to correcting an inaccuracy in a media report rather than to opening up the entire experiment to wider public scientific scrutiny so that the findings may be properly verified. 

And while Professor Warwick may nonetheless view our assessment as unfair, it is shared not only by the scientists quoted in the article, but many more besides; as well as by a growing number of news organisations who, like us, have now taken the time to examine the claims a little more rigorously, including New Scientist, Metro, The Huffington Post, Business Insider, BuzzFeed and VICE.

Consider Alan Turing's own words on the subject. Turing did not equivocate over whether a machine could convince a human that it was also human only a third of the time; and while pretending to be not an adult of mature intelligence, but a teenager; and not a teenager of comparable linguistic ability, but who speaks English as a second language.

Turing asked whether a human speaking with a machine could believe that they were speaking with a human "as often" as when they were actually speaking with a real person. That is a long, long way from convincing a panel of judges 33% of the time, and it is far removed from the caveats and equivocation of deliberately establishing excuses for why the machine might fail to convince humans with any convincing reliability, by making it emulate a boy with inferior language skills. To suggest that 'Eugene Goostman' passed the Turing Test under these conditions appears far removed from the spirit of the test that Turing himself defined.

It seems that Professor Warwick is the one missing the point, not us. 

Previous Story
Microsoft is expanding its feedback program, putting consumer ideas front and center
Next Story
iOS 8 code includes multitasking support, but you can't use it...yet