AI won't be taking up software engineering jobs any time soon, but it's getting there

Last Sunday, we looked at OpenAI's latest work in which the firm trained diffusion models to generate deepfakes and subsequently achieved a new state-of-the-art in multiple image generation tasks. Today, we shift gears and focus on another big and recent development in the field of artificial intelligence—transformer models.

Transformer models came to the forefront with Google's open-source implementation of BERT. By improving on the shortcomings of RNNs and LSTMs, this deep learning architecture revolutionized the field of natural language processing and generation. We first saw the potency of such language models in the form of OpenAI's GPT-2 with 1.5 billion parameters when the language model produced news, stories, lyrics, and other pieces of text that could easily be mistaken as a piece of work by a human and not a language model. Soon after, the GPT-3—successor to the GPT-2—essentially borrowed all the best bits from its predecessor and with 175 billion parameters to back it up, produced work that sounded shockingly cohesive, sophisticated, and factually correct. Since the training dataset for this language model was basically the entire internet, we could ask it to produce pretty much anything that is publicly available in textual form on the internet. Stories, lyrics, news pieces, and conversations aside, the GPT-3 even wrote valid CSS and HTML code. The last of these, a language model's ability to write code, is what we shall be focusing on today.

A couple of days back, a team of researchers comprising of individuals from UC Berkeley, UChicago, UIUC, and Cornell published a paper in which it gauged the ability of the best language models of today in writing code. In the paper titled Measuring Coding Challenge Competence with APPS, the researchers essentially put these language models in the shoes of an individual who's taking a programming interview where their ability to understand a given problem and code its solution is being tested. To do this, the team introduces a new dataset called the Automated Programming Progress Standard (APPS).

APPS evaluates models not only on their ability to understand coding syntax, but also on their ability to understand task descriptions and devise algorithms to solve these tasks. If a model were to perform well on APPS, this would indicate an ability to flexibly use data structures and programming techniques, as well as an ability to correctly interpret diverse task specifications, follow instructions, and understand human intent.

The dataset consists of 10,000 coding problems split into three categories (Introductory, Interview, Competition) and written in plain English that is typically expected in programming interviews today. These problems were taken from open-access sites like Codewars AtCoder, Kattis, and Codeforces, where programmers share coding problems with each other. To validate the provided solutions, the dataset contains 131,836 test cases and 232,444 ground-truth solutions written by humans in Python.

1. Introductory Level. Problems that most programmers with 1-2 years of experience can answer without requiring complicated algorithms. Examples of such problems include counting the number of appearances of a substring, or finding if a string is a palindrome. There are 3,639 problems classified as introductory level and 1,000 in the test set.

2. Interview Level. These are problems that are more algorithmic and difficult in nature and would be at the level of questions asked in difficult technical interviews. Examples of such problems might include those involving data structures such as trees or graphs, or problems that require modifying common algorithms. There are 5,000 problems classified as interview level and 3,000 in the test set.

3. Competition Level. These are problems that are even more challenging and are at the level of the most advanced high school and collegiate programming competitions, including USACO, IOI, and ACM. There are 1,361 problems classified as competition level and 1,000 in the test set.

The following image shows an excerpt from the dataset:

With the APPS dataset prepared, the researchers trained three of the best language models openly available today: GPT-2, GPT-3, and GPT-Neo (a free alternative to the closed-source GPT-3). Once the training was complete, the models were evaluated and compared against each other.

The team researchers found out that while there are definite positives, understanding and coding problems is still a notoriously challenging task for even the best language models that we have today.

We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems, so we find that machine learning models are beginning to learn how to code.

For the positives, the models demonstrated the ability to understand the problem, write import statements, define classes, and form program flow. Here is a sample from GPT-2, the smallest of the three models, on a test sample for which it passed all 18/18 test cases:

And here's an example of what the GPT-3 produced for a separate problem.

Most evidently, the models sometimes suffered from syntax errors. But the larger models were more resilient against them and more fine-tuning and training exponentially decreased these syntax errors. There were also times when the solution given by these models would pass off as correct at the first glance despite failing all test cases once validated.

The team thinks that a possible 'memorization' of code blocks from the training set might be the culprit here. To tackle such problems, usually, the idea is that we need more trainable parameters. Overall, it is clear from the results above that while language models have come a long way in conversational abilities and creative and formal writing, their ability to code is still lackluster. But it's definitely getting there.

We assessed state-of-the-art generative models on our benchmark and found that overall performance was low. However, the prevalence of syntax errors decreased exponentially with fine-tuning models of increased scale, and recent models such as GPT-Neo solved a number of introductory problems.

Moving forward, the team envisaged that as language models keep growing larger and more robust, concerns about malicious code and automation might arise in the future. For those times, the APPS dataset proposed here might come in handy. For now, it doesn't seem like language models have a shot at landing a decent software engineering job. More details can be found at this GitHub repository or the pre-print repository arXiv.