Experiment: Using LLMs to automate user acceptance test creation

A small experiment to satisfy my own curiosity on whether LLMs can generate user acceptance tests

Apr 20, 2024

Subjectivity notice:

This is an experiment I conducted, with limited time and effort, to satisfy my own curiosity on the extent LLMs might be used to automate application testing during the application development process. It is by no means a rigorous and complete study, especially not in the fast moving world of Generative AI.

Motivation

User acceptance tests are part and parcel of enterprise application development. Once an application feature is developed, the business user has to go through the application to ensure that the requested feature is developed according to requirements and the business process that the feature is part of can be successfully executed. During this stage, the business user interacts with the application usually via the user interface and executes certain steps, like filling forms or clicking on buttons, that indicates that the application is functioning according to business expectations.

It then struck me that it looks a lot like front-end testing, except that instead of testing for particular elements in the user interface, you test the execution of a business process from a user’s perspective. You just need to get, via the user interface, from point A to point B in the application. If we were to automate this, libraries like Selenium and Helium (which is a higher level abstraction over Selenium), should be able to help.

The benefit of being able to automate this is

the development team can quickly test how the application feature would function from the users’ perspective during development in a end-to-end fashion
we might be able to skip the slow and manual testing by users if we are able to simultaneously generate artifacts, such as screenshots at each step, that demonstrate that the business process can be successfully executed

However the obstacle of translating what is in the users’ minds and coding the tests remain. This is a task that involves significant effort. And this is what I am hoping to see if large language models (LLMs) can help overcome.

IF we can allow users to simply say what they want to test, then they can have as many test (positive/negative) as they want!

I want to know whether, given natural language descriptions of user journeys like those from a non-tech user, an LLM can generate test code sufficiently well to perform user acceptance tests.

Experiment Parameters

Test user journey

As a user of github.com, I want to create a private repository on GitHub.

Approach

Use OpenAI’s ChatGPT 3.5
Ask chatGPT to use Helium/Selenium
Start from very specific step-by-step instructions on what needs to be done. For example, “click on button labelled ‘Next’”.
Make the instructions more “vague” or conversational. I.e. more of a description of intent rather than a recipe.
See what works (or doesn’t) and what might work the generated code work better

Results and Discussion

I won’t be listing down all the prompts (and their responses) that I’ve tried. I will only be talking about the ones that are most interesting to me.

Step-by-step recipe

Prompt

You are a frontend tester.

Given the following context, write the test code using Helium syntax to test the frontend application accordingly.

This is a code versioning web application.

Open github.com

Login with the following credentials:

- Username: <testuser>
- Password: <testPassword>

Click on “New”

Fill in “test-chatgpt-repo1” as the “Repository name”.

Fill in “This is a test repo” in “Description”

Click on “Private”

Click “Create Repository”

As you can see, I am very specific about what to click and what to fill in each field. A human can follow this instruction quite easily without much thinking.

Response

open("https://github.com")

click("Sign in")

write("testuser", into="Username")
write("testPassword", into="Password")
click("Sign in")

click("New")

write("test-chatgpt-repo1", into="Repository name")
write("This is a test repo", into="Description")

click("Private")

click("Create repository")

First thig to note that open(…)in the first line actually won’t work. It should be something like start_chrome.

Other than that, the code works. It can log in and create the repository. This means that chatGPT actually has some knowledge of Helium and Selenium syntax. Unsurprising I guess since code generation is one of the much touted features of LLMs.

But this is not good enough. If I’m going to have to spend so much effort to tell the LLM to the last detail of what to do, I might as well have coded the tests myself.

Becoming slightly more vague

Prompt

You are a frontend tester.

Given the following context, write the test code using Helium and Selenium syntax to test the frontend application accordingly.

This is a code versioning web application hosted at github.com.

Login with the following credentials:

- Username: <testuser>
- Password: <testPassword>

Create a new repository, call it “test-chatgpt-repo2” with a description of “This is a test repo2”.

Set the repository as a private repository before creating it.

Now notice that instead of saying “click this” or “click that”, I simply ask chatGPT to “create a repository” or “call it XXX”.

Response

from helium import *
import time

# Function to login to GitHub
def login(username, password):
    start_chrome("https://github.com/login")
    write(username, into="Username")
    write(password, into="Password")
    click("Sign in")

# Function to create a new repository
def create_repository(repo_name, description):
    click("New")
    write(repo_name, into="Repository name")
    write(description, into="Description")
    click("Private")
    click("Create repository")

# Test scenario
def test_create_private_repository():
    # Login
    login("<testuser>", "<testPassword>")
    time.sleep(2)  # Wait for the page to load

    # Create a new repository
    create_repository("test-chatgpt-repo2", "This is a test repo2")
    time.sleep(2)  # Wait for the page to load

    # Verify repository creation
    if Text("test-chatgpt-repo2").exists():
        print("Repository created successfully.")
    else:
        print("Failed to create repository.")

Notice that it knows to use the into parameter when using write, such as write(repo_name, into="Repository name").

Another thing that surprised me was that chatGPT separated the steps into functions.

What disappointed me slightly was

it used time.sleep to wait for the next step to load instead of using .exists as in did in the final statement
the test does not end using the kill_browser() statement so it justs sits there after the repository is created

Nonetheless, by and large, the code worked.

Saving screenshots as artifacts

Now I want to see if I can also instruct chatGPT to generate code to take screenshots that would serve as proof that the feature works end-to-end as expected.

Prompt

You are a frontend tester.

Given the following context, write the test code using Helium and Selenium syntax to test the frontend application accordingly.

This is a code versioning web application hosted at github.com.

Login with the following credentials:

- Username: <testuser>
- Password: <testPassword>

Create a new repository, call it “test-chatgpt-repo2” with a description of “This is a test repo2”.

Set the repository as a private repository before creating it.

Save a screenshot using Selenium at each step.

Response

from helium import *
from selenium import webdriver
import time

# Function to save screenshot
def save_screenshot(step_name):
    screenshot_path = f"{step_name}.png"
    browser.save_screenshot(screenshot_path)
    print(f"Screenshot saved: {screenshot_path}")

# Open browser and navigate to GitHub login page
browser = webdriver.Chrome()
start_chrome("https://github.com/login")
wait_until(Text("Sign in").exists)

# Login
write("<testuser>", into="Username or email address")
write("<testPassword>", into="Password")
click("Sign in")

# Save screenshot for login step
save_screenshot("login")

# Create a new repository
click("New")
wait_until(Text("New repository").exists)
write("test-chatgpt-repo2", into="Repository name")
write("This is a test repo2", into="Description (optional)")
click("Private")
click("Create repository")

# Save screenshot for repository creation step
save_screenshot("repository_creation")

# Wait for repository creation to complete
wait_until(Text("Quick setup").exists)

# Close the browser
kill_browser()

Again, the code works, by and large.

First thing to note, the steps are no longer in functions. From a code point of view, this might be stylistically inconsistent for the development team.

Next, the screenshots generated actually doesn’t work . This is because the screenshots are taken using the browser instance created separately instead of get_driver().save_screenshot(…) as specified in the Helium documentation.

This got me thinking. Maybe I can provide chatGPT on instructions on how to use Helium. Thankfully, the Helium author provided a cheatsheet.

Teaching chatGPT how to use Helium

Prompt

You are a frontend tester who uses Helium and Selenium.

Here are instructions on how to use Helium.

```
# Helium cheatsheet

This page very quickly teaches you the most important parts of Helium's API.

## Importing

All of Helium's public functions lie directly in the module `helium`.
You can for instance import them as follows:

```python
from helium import *
```

## Starting a browser

Helium currently supports Chrome and Firefox. You can start them with the
following functions:

```python
start_chrome()
start_firefox()
```

... <THE REST OF THE HELIUM CHEATSHEET> ...
 
## Taking a screenshot

Use Selenium's API:

```python
get_driver().save_screenshot(r'C:\screenshot.png')
```

Note the leading `r`. This is required because the string contains a backslash
`\`.

```

Given the following context, write the test code using Helium and Selenium syntax to test the frontend application accordingly.

This is a code versioning web application hosted at github.com.

Login with the following credentials:

- Username: <testuser>
- Password: <testPassword>

Create a new repository, call it “test-chatgpt-repo2” with a description of “This is a test repo2”.

Set the repository as a private repository before creating it.

Save a screenshot at each step.

Response

from helium import *
from selenium.webdriver.common.keys import Keys

# Start Chrome browser
start_chrome("github.com/login")

# Login
write("testuser", into="Username")
write("testPassword", into="Password")
click("Sign in")

# Save screenshot of login
get_driver().save_screenshot("login.png")

# Create a new repository
click("New")
write("test-chatgpt-repo2", into="Repository name")
write("This is a test repo2", into="Description")
if Button("Public").exists():  # Check if the repository is already set to private
    click("Public")
    click("Private")
click("Create repository")

# Save screenshot of repository creation
get_driver().save_screenshot("repo_creation.png")

# Close the browser
kill_browser()

This time we got it right!

What if I just state my intent?

Now, what if I just tell chatGPT what I want to do? Like how a non-tech user would articulate his/her needs/wants.

Prompt

You are a frontend tester who uses Helium and Selenium.

Write a piece of test code that would test the creation of github repositories.

Take screenshots at each step to prove execution.

Response

from helium import *
from selenium.webdriver.common.keys import Keys

# Function to take screenshot
def take_screenshot(step_name):
    screenshot_path = f"screenshot_{step_name}.png"
    screenshot(screenshot_path)
    return screenshot_path

# Open GitHub in browser
start_chrome("https://github.com")

# Click on Sign in button
click("Sign in")

# Input your GitHub username
write("your_username", into="Username")

# Input your GitHub password
write("your_password", into="Password")

# Press Enter to sign in
press(ENTER)

# Wait for the page to load
wait_until(Text("Explore GitHub").exists)

# Click on the '+' button for new repository
click("New")

# Input repository name
write("TestRepository", into="Repository name")

# Input repository description
write("This is a test repository created for automation testing", into="Description (optional)")

# Choose repository visibility
click("Public")

# Initialize README
click("Initialize this repository with a README")

# Click on Create repository button
click("Create repository")

# Wait for the repository to be created
wait_until(Text("Quick setup").exists)

# Take screenshots at each step
screenshot_paths = []
screenshot_paths.append(take_screenshot("step_1_sign_in"))
screenshot_paths.append(take_screenshot("step_2_new_repository"))

# Close the browser
kill_browser()

Unsurprisingly, the screenshot function wouldn’t work. Also the screenshot steps are placed at the wrong place. Screenshots should be taken after each step instead of at the end.

But what surprised me was that chatGPT actually knows the layout of GitHub. See write("TestRepository", into="Repository name"), it actually knows to write into the “Repository name” field.

Perhaps I shouldn’t be so surprised since github is such a popular web application and a lot of the code that chatGPT is trained on probably came from github.

This makes me wonder what would happen if the application in question is a custom application that chatGPT has not seen before. Would it guess?

Final thoughts

The whole purpose was to see whether I could go from a natural language description of what business users want to do/test to actual test code that would execute the users’ intent. I also want to make sure that I won’thave to be too specific with my instructions, otherwise I might as well have coded the tests directly myself.

ChatGPT performed well on some fronts. It was able to generate code that was quite workable and was even more accurate when I was able to provide code documentation (see step with cheatsheet). With minimal code modification, the code can be quite usable. Thus, in terms of reducing developer toil in coding the tests, there might be some benefits there.

Also, if the user does not mind being very specific with their instructions (i.e. step-by-step) we might be able to generate good test codes.

However the last test of simply stating intent showed me that chatGPT might struggle abit for a bespoke application unless step-by-step instructions are given.

That being said, my tests were very limited. Under a multimodal setting, where the foundation model has visual information (i.e. the model can see the application as well), the test codes might be better generated. On top of that, if we take an agentic approach where the intent can be broken down into sub-steps, like what we have in LlamaIndex’s sub question engine, then there might really be a chance that good tests codes can be generated from non-technical descriptions. Sadly, I neither have the time nor resources to test it out fully at this point in time.

Discussion about this post

Ready for more?