Stefan Judis

I'm a bit of an AI skeptic. And even though GitHub Copilot is my daily auto-completion on steroids, I always double-check the code generated by LLMs. If you're using AI for coding, you probably know that the results are sometimes surprisingly good and other times shockingly terrible.

Lately, I have seen more and more articles and even docs recommending ChatGPT to generate Playwright tests. 

Could this be true? Are ChatGPT and friends really that good at generating test code? And more importantly, are the new AI tools better than Playwright's code generation tools?

I spent a few days playing with AI tools to understand what language models can and can't do. If you have the same questions, let's find some answers!

Write Playwright Tests in Seconds with ChatGPT!?

This article and video are part one of a new "Playwright and AI" series. If you want to catch the next one covering Playwright coding with GitHub Copilot and Cursor, subscribe to our YouTube Channel or RSS feed.

The baseline: Generate Playwright code with built-in Codegen

Playwright provides built-in tools to record browser actions. When you call the codegen command, it opens a browser window, and the Playwright inspector running in recording mode. When you then navigate a site, click around, and fill out forms, all actions will be put into a generated Playwright script.

Copy the script, run it, and off you go with your new end-to-end test!

Like with all code generation tools, the codegen results are not perfect, but Playwright is good at creating code for itself to run.

Let's take two scenarios and find out if an LLM like ChatGPT or Claude can beat the native Playwright tooling.

Here's our first end-to-end test action plan:

expect that the headline contains the word "Locators"

Transforming these actions and assertions into Playwright code should be straightforward regardless of the approach, right?

Search the docs with Playwright-generated code

To kick off Codegen, head to your terminal and run npx playwright codegen.

Set the viewport dimensions to an iPhone (--device), automatically navigate to playwright.dev, and generate your test. Here's the result of me clicking around in the Playwright docs.

Generating a Playwright test that navigates the docs takes twenty seconds using Codegen. And while it's not perfect (you could remove the click and extract some variables), it's following Playwright's best practices (getBy locators and web-first assertions). 

Recording a Playwright test requires almost no effort and results in decent code. 

I went to ChatGPT and tried to perform the same task with the new canvas features that aim to good at coding. Here's my first prompt:

The initial ChatGPT response wasn't great. It hardcoded the iPhone device sizes, used deprecated methods, and didn't follow Playwright's best practices. After a bit of back and forth and telling ChatGPT what to do, I could convince the LLM to generate decent Playwright test code.

Here's the whole conversation if you're curious.

But did searching "locator" in the Playwright docs work with this generated script?

It didn't. When you ask ChatGPT or Claude to perform actions on a website, it's just making things up. It doesn't know what it's dealing with. The LLM is not reaching out to the web to analyze some HTML; it's really just guessing Playwright code.

It puzzles me how people can say, "Just use ChatGPT" for writing end-to-end tests because it won't generate good code by default, and almost always, you'll end up in a debugging session from the beginning. 

Using vanilla ChatGPT or Claude is not that great compared to Playwright's native Codegen.

Adjust the LLM prompt to get higher-quality code

But what can we do about the problem of ChatGPT spitting out poor code? 

I researched good coding prompts and discovered cursor.directory, a site that lists AI coding prompts for the Cursor editor. If you haven't heard of Cursor, it's a VS Code fork that enriches the editor with AI suggestions and workflows. Some people are pretty hyped about it, and I will look at it more closely soon.

It turns out that by adding some small adjustments to the LLM prompt, you can improve the resulting code tremendously. All the community prompts include:

a role for the LLM to take - "You are an expert in [...]".

additional rules and direction - "You write concise, technical code".

With an extended prompt, ChatGPT or Claude generated good Playwright code, and surprisingly, sometimes they even generated test code that worked for playwright.dev. This was unexpected, but searching the Playwright docs might be a typical tutorial example.

But it became clear that whatever you do with an LLM, you should always use a customized prompt or a specialized GPT that includes advanced LLM instructions. Context, rules and instructions are essential for good results. If you "just use ChatGPT" to generate your end-to-end tests, you'll end up with poor and outdated code. 

Side note: the same rule applies to asking ChatGPT for Playwright help. Asking "How to click an element with Playwright?" can result in terrible advice, whereas an adjusted prompt ("You are an expert […] How to click an element with Playwright?") drastically increases the quality.

But what if we give ChatGPT more context and include the HTML of the site we want to test in the prompt?

I created a quick demo app to keep the complexity low and maybe even see the mighty AI and LLMs succeed.

"The app" consists of a single HTML page with some inline JavaScript. You can press a button, the clicks will be counted, and if you click the button three times, you'll be greeted with a sad raccoon. 

There's no magic, and this functionality should be pretty straightforward to test. 

Check if "You clicked the button too many times!" is visible on the page

Check if a raccoon image is visible on the page

Let's see which code-generation tool wins this battle!

Test a simple local app with Playwright-generated code

Recording the required actions for this example app took me again 20 seconds with Playwright Codegen. The generated test code used the recommended locators. Playwright discovered that there was a test id in the HTML, and most importantly, the test succeeded. 

The test code could have been a bit more DRY, but the generated code worked fine!

Test a simple local app with AI-generated code

I reused the prompt from the previous section and told the LLM about the HTML.

And I must say, after providing the advanced prompt and including the HTML as context, ChatGPT shocked me.

This was good code! ChatGPT (and Claude!) discovered test ids, used the recommended locators, analyzed HTML attributes, and kept the code well-structured. But would this test code work when we run it?

After running the AI-generated end-to-end tests, I almost fell off my chair.

ChatGPT (link to conversation) and Claude (link to artifact) generated code that worked and had better quality than Playwright Codegen. They extracted variables, correctly parsed the HTML, and extracted alt attribute values to be used in locators.

Against my expectations, the LLMs generated good code and did it faster than I could with Playwright's tooling. I. Am. Mindblown!

What did this exercise tell me about AI and Playwright scripting then?

First, "Just use ChatGPT!" is a myth. Of course, it would be great to open a chat window, enter some tasks, and magically generate high-quality code. In reality, though, this doesn't work.

When talking to the LLM, you must provide as much instruction and context as possible. The LLMs know many outdated Playwright practices; you must tell the AI what methods to use and what code to avoid. A specialized prompt should be the absolute minimum. 

When you generate end-to-end tests, you must either inline or upload as much source code as possible to receive a working test. There's really no benefit in generating a test leading you straight into a debugging session. No LLM in the world will reliably "guess" the correct actions or locators without context and access to source code.

But then there's the problem: pasting or uploading all the code into a chat window is neither practical nor convenient. There must be a better way!

AI editors like GitHub Copilot and Cursor talk to the LLMs with specialized prompts while providing code context from within your editor. Does this mean that you could open your code base, prompt some actions, and generate good Playwright code? I don't know yet and am still skeptical about LLMs handling real-world application code, but I'll research this topic next! 

If you want to follow along, subscribe to the Checkly YouTube channel or our RSS feed. The next "Playwright and AI" post will be published in 2-3 weeks!

And if you're using Playwright for end-to-end testing, you should check out synthetic monitoring! It lets you take your existing end-to-end tests and run them on a schedule from anywhere worldwide to get alerted when you have production issues. It's pretty cool, trust me. 😉

Are ChatGPT or Claude better than Playwright Codegen?

Detect

Communicate

Resolve

Monitoring as Code

Developers

Resources

Community

Are ChatGPT or Claude better than Playwright Codegen?