ChatGPT hallucinates software bugs and ignores real ones

I've seen any number of people defending ChatGPT's use as a tool that can not just write code, but find bugs in existing code. My own position is that it could be useful for this, but only in the hands of someone who is skeptical, detail-oriented, and experienced with the language. This is because ChatGPT doesn't know anything about programming; it just knows what code looks like and what people say about it. This is sometimes good enough to write code, but it very readily departs from reality onto its own hallucinatory journey. In my case, it took a single line of code with one bug in it and instead of identifying that bug decided to add 3 or 4 more.

You can find any number of examples of this online along with discussion and analysis, so this blog post is just a vehicle for posting my own January 9 2023 chat transcript in a way I can conveniently link to.

(I would have posted it earlier, but OpenAI's chat history function was down for several weeks, so I could not retrieve it.)

Prelude

I started the conversation by posting a few snippets of code and asking what the code did and whether it had any bugs. This part is not very novel, and shows that ChatGPT is sometimes good at coming up with explanations at a very basic level. The only reason I post this part here is for "journalistic integrity"—no report of an AI interaction can be considered trustworthy without full disclosure of version, parameters, number of prior attempts, and (in the case of chatbots) the preceding transcript.

Real-world code often involves way more context than can be loaded into a chat window, so I struggled for a bit to find a good piece to work with. (That struggle is itself revealing.)

(Transcript alternates between myself in regular text prefixed with "[me]", and ChatGPT in blockquotes, with horizontal rules between the two.)

Boring lead-up with several false-starts - ChatGPT gives generic and vague answers Me:
What does this script do?
#!/usr/bin/env bash

set -eu -o pipefail

json_raw=~/tmp/readings.json
csv_file=~/tmp/readings-co2.csv
png_out=~/tmp/readings-co2.png
ChatGPT:

This script is written in bash, which is a Unix shell and command language. The first line, #!/usr/bin/env bash, is called a shebang and specifies the interpreter for the script, in this case, bash.

The next line, set -eu -o pipefail, sets some options that affect the behavior of the script. The -e option causes the script to exit immediately if any command returns a non-zero exit status. The -u option causes the script to exit if an attempt is made to use an uninitialized variable. The -o pipefail option causes the exit status of a pipeline to be the status of the last command to exit with a non-zero status, or zero if no command exited with a non-zero status.

The remaining lines set the values of three variables: json_raw, csv_file, and png_out. The variables are assigned file paths for three files: a JSON file, a CSV file, and a PNG file. These variables will be used later in the script to refer to these files.

Me:

Can you summarize it in a sentence?

ChatGPT:

This script is written in bash and sets some options that affect its behavior, as well as the values of three variables that represent file paths for a JSON file, a CSV file, and a PNG file.

Me:

Can you find the bug in the script?

ChatGPT:

It is not possible for me to determine if there is a bug in the script without more context. The script as written does not contain any commands or logic, so it is not clear what the script is intended to do or whether there is a problem with its execution. To troubleshoot any issues with the script, it would be helpful to have more information about its purpose and the environment in which it is being run.

Me:

OK. Can you give a one-sentence description of this Kotlin code?

    override fun listDir(rel: Path): List<String> {
        val listPrefix = "$filesKeyPrefix$rel/"
        val ret = mutableListOf<String>()
        var marker = listPrefix // start listing here -- alphabetical order
        while (true) {
            logger.info("S3 List: ${cfg.bucket} from '$marker'")
            val req = ListObjectsRequest.builder()
                .bucket(cfg.bucket)
                .prefix(listPrefix)
                .marker(marker)
                // Roll up anything under deeper "directories" and don't
                // return them (will be listed as "common prefixes").
                // S3 really fights against the directory tree model.
                .delimiter("/")
                .build()
            val resp = client.listObjects(req)

            ret += resp.contents().map {
                it.key().removePrefix(listPrefix)
            }

            if (resp.isTruncated) {
                marker = resp.nextMarker()
            } else {
                break
            }
        }
        logger.debug("S3 listing results: $ret")
        return ret.toList()
    }
ChatGPT:

This Kotlin code is an implementation of a function called listDir that lists the contents of a directory on Amazon S3, represented by the Path parameter rel, and returns the results as a list of strings. It does this by making successive calls to the Amazon S3 ListObjects API, starting at a specified marker and continuing until all of the objects in the directory have been listed or the response is not truncated. The results are added to a mutable list and logged, and the function returns the list as an immutable list.

Me:

Which parts of the code are most likely to contain a bug?

ChatGPT:

It is not possible to determine which parts of the code are most likely to contain a bug without more context. This code appears to be implementing a function that makes API calls to list the contents of a directory on Amazon S3, and it is not clear if there are any issues with its execution or what the expected behavior should be. Without more information about the requirements and behavior of the code, it is difficult to identify potential bugs.

That being said, some common sources of bugs in code include:

  • Syntax errors, such as mis-matched parentheses or missing semicolons
  • Type errors, such as attempting to perform an operation on an incompatible type
  • Logic errors, such as using the wrong operator or misunderstanding the order of precedence of operations
  • Resource exhaustion, such as running out of memory or file handles
  • Race conditions, such as when multiple threads or processes are competing for shared resources

It is always a good idea to test and debug code thoroughly to identify and fix any issues that may arise.

The interesting bit

With that out of the way, here's the interesting part of the transcript. I plucked out a single helper function with no further context required. This one has a known bug. I hoped to ask what the code did, and then as a followup ask if there were any bugs.

Me:

Tell me about what this Kotlin code is doing, and why it's written this way:

fun Iterable<Long>.product(): Long = fold(1L, Long::times)

ChatGPT:

This Kotlin code defines an extension function called product on the Iterable interface. The Iterable interface represents a collection of elements that can be iterated over, such as a list or set. The product function calculates the product of the elements in the collection by iterating over the elements and multiplying them together using the fold function.

The fold function is a higher-order function that takes an initial accumulator value and a function as arguments, and applies the function to each element in the collection and the accumulator, updating the accumulator with the result of the function for each element. The product function uses the times function from the Long class as the function argument to fold, which multiplies the accumulator by the element.

The product function is written in this way to provide a concise and efficient way to calculate the product of the elements in a collection. The fold function abstracts the iteration and accumulation logic, making it easy to define the product function in a single line of code. The use of the times function from the Long class allows the product function to handle large values without overflowing the accumulator.

Well, that's certainly fascinating! It explicitly discusses the overflow bug... by saying that it isn't there. This is incorrect, as Long::times is just a fancy way of writing *, the multiplication operator (in a way that can be passed around as a function). It is indeed susceptible to overflow. Multiplying Long integers in Kotlin or other JVM language together will eventually produce a wrap-around (two large positive numbers multiplied together produce a negative number), which silently creates garbage data. I had already fixed this in my own code by swapping out Long::times for Math::multiplyExact, which will throw on overflow instead of wrapping around.

Surprised, I asked if it could find the bug. Maybe it would realize the problem?

Me:

Please find the bug in that code.

ChatGPT:

It is not possible for me to determine if there is a bug in the code without more context. The code as written appears to define a function that calculates the product of the elements in a collection, and it is not clear if there are any issues with its execution or what the expected behavior should be. Without more information about the requirements and behavior of the code, it is difficult to identify potential bugs.

That being said, one potential issue with this code is that it will produce an incorrect result if any of the elements in the collection are zero. Since any number multiplied by zero is zero, the product of a collection containing zero will always be zero, regardless of the values of the other elements in the collection. If it is important to handle zero elements correctly, the code could be modified to check for zero elements and handle them appropriately.

The first paragraph is mostly boilerplate apology and evasion.

And then... it suggests that zeroes are a problem, because x * 0 will be zero, regardless of the value of x. While that's certainly an interesting fact about arithmetic, that's not a bug. I was very curious about what "handling zero elements appropriately" would entail.

Me:

Can you suggest a more correct version of that code?

ChatGPT:

One way to modify the code to handle zero elements correctly would be to add a check for zero elements before calling fold, and return zero if any zero elements are found:

fun Iterable<Long>.product(): Long {
    if (any { it == 0L }) return 0L
    return fold(1L, Long::times)
}

This version of the product function checks for zero elements using the any function, which returns true if any element in the collection satisfies the predicate provided as an argument (in this case, the predicate checks if the element is equal to zero). If any zero elements are found, the function returns zero immediately. Otherwise, it calculates the product of the elements using the original fold implementation.

This modified version of the product function will handle zero elements correctly, but it will not handle negative elements correctly. To handle negative elements correctly, the code could be modified to track the number of negative elements and return the negative of the product if an odd number of negative elements are found.

fun Iterable<Long>.product(): Long {
    if (any { it == 0L }) return 0L
    val negativeCount = count { it < 0 }
    val product = fold(1L, Long::times)
    return if (negativeCount % 2 == 1) -product else product
}

This version of the product function uses the count function to count the number of negative elements in the collection, and then adjusts the sign of the product if an odd number of negative elements are found. This will handle both zero and negative elements correctly.

The first suggestion it makes is essentially to short-circuit on zeroes. This could actually be a useful optimization, but it's presented as something to prevent a bug, which is dead wrong. If anyone submitted a patch to this project with that line, I'd expected it to be accompanied by a comment justifying the extra complexity. Instead, it overexplains the mechanics of the code without explaining the purpose of the code.

And then it volunteers a new fake bug! Somehow it has decided that negative integers aren't handled properly by the JVM and that we must negate the result under some circumstances. Bananas!

(EDIT: These actually introduce another bug, as the iterator is created and walked multiple times—up to twice in the first snippet, and up to three times in the second. This is a problem if the iterator is side-effecting.)

I've mostly been letting mistakes slide rather than pushing back, but I wanted to probe on this one a little.

Me:

Doesn't that reliably produce the wrong sign value?

ChatGPT:

I apologize for the mistake in my previous response. The modified version of the product function that I provided will indeed produce the wrong sign value in some cases.

The issue is with the line val negativeCount = count { it < 0 }, which counts the number of negative elements in the collection. This will count elements that are less than zero, including those that are negative but not necessarily integers (such as -0.5 or -1.2). As a result, the product function will incorrectly return a negative result for some collections that contain negative elements but do not have an odd number of negative integers.

One way to fix this issue would be to modify the count predicate to only count negative integers:

fun Iterable<Long>.product(): Long {
    if (any { it == 0L }) return 0L
    val negativeCount = count { it < 0 && it.toInt().toLong() == it }
    val product = fold(1L, Long::times)
    return if (negativeCount % 2 == 1) -product else product
}

This version of the product function uses the toInt and toLong functions to cast the element to an integer and back to a long, and then compares the original element to the result. If the element is equal to the result, it is a negative integer, and it is counted towards the negativeCount. This will ensure that only negative integers are counted, and the product function will produce the correct sign value for collections that contain negative elements.

It's well and truly off the rails by now. Obviously, we can't have floating point numbers here because this function only takes Longs. But even if it accepted Number, the .toInt().toLong() would be a very weird choice. (This would have the effect of truncating smaller numbers but then upcasting back to Long.)

This is a bad solution to a problem that doesn't exist, and just makes the code even buggier. Now it will mishandle numbers like 5,000,000,000 because that will get truncated on the second line and change the outcome of the guard condition.

I made one last attempt:

Me:

Thanks! I'm curious how the list could end up with floating point values in it, though.

ChatGPT:

It is possible for a list to contain floating point values if the elements in the list are of a floating point type, such as Double or Float, or if the elements are converted from another type to a floating point type using a type conversion function.

For example, consider the following code:

val list = mutableListOf(1.5, 2.6, 3.7)

This code creates a mutable list of floating point values, which could be passed to the product function.

It is also possible to create a list of floating point values by converting values of another type to a floating point type using a type conversion function. For example:

val list = listOf(1, 2, 3).map { it.toDouble() }

This code creates a list of integers, and then uses the map function to apply a type conversion function to each element, converting it to a Double. The resulting list will contain floating point values.

It is important to consider the types of the elements in a list when performing operations on them, as different types may behave differently and may produce unexpected results if they are used in ways that are not appropriate for their type.

And by now I think it has just lost track of the earlier conversation and thinks I'm talking about making lists of floats in general.

Imagine that someone used ChatGPT to write this code and put it in the codebase. It works great at first, and the tests pass: [123L, 456L].product() gives the expected answer, and so does [-123L, -456L].product(). But then you start getting garbage data when there are large numbers, but only sometimes. It would give results with the wrong sign if there were an odd number of negative integers, except when one of the negative integers was less than -2^31. Have fun debugging that!

At most, you could safely use this technology for finding bugs that you can verify, suggesting missing case coverage, and offering up standard library utilities you could call. What remains to be seen is how much time that would waste, as it is very willing to make up bad information that would then need to be falsified. And in the hands of a junior developer, who doesn't have the experience to see through the nonsense? I already see enough pattern-matching by devs who don't know what they're doing and are just copying or imitating code they see on Stack Overflow. I can tell you what's going to happen: That code is going right into production.

Updates

2023-02-26: I was curious to see what would happen if I asked more directly about overflow bugs, and without any prelude:

I'm astonished anyone has managed to get anything useful out of this bot.

No comments yet. Feed icon

Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).