Are ChatGPT or Claude better than Playwright Codegen?

Share on social

Are ChatGPT or Claude better than Playwright Codegen?
Table of contents

I'm a bit of an AI skeptic. And even though GitHub Copilot is my daily auto-completion on steroids, I always double-check the code generated by LLMs. If you're using AI for coding, you probably know that the results are sometimes surprisingly good and other times shockingly terrible.

Lately, I have seen more and more articles and even docs recommending ChatGPT to generate Playwright tests.

Could this be true? Are ChatGPT and friends really that good at generating test code? And more importantly, are the new AI tools better than Playwright's code generation tools?

There's only one way to find out! 

I spent a few days playing with AI tools to understand what language models can and can't do. If you have the same questions, let's find some answers!

This article and video are part one of a new "Playwright and AI" series. If you want to catch the next one covering Playwright coding with GitHub Copilot and Cursor, subscribe to our YouTube Channel or RSS feed.

The baseline: Generate Playwright code with built-in Codegen

Playwright provides built-in tools to record browser actions. When you call the codegen command, it opens a browser window, and the Playwright inspector running in recording mode. When you then navigate a site, click around, and fill out forms, all actions will be put into a generated Playwright script.

Copy the script, run it, and off you go with your new end-to-end test!

Playwright Codegen — Record your end-to-end tests with native Playwright tooling.

Like with all code generation tools, the codegen results are not perfect, but Playwright is good at creating code for itself to run.

Let's take two scenarios and find out if an LLM like ChatGPT or Claude can beat the native Playwright tooling.

Scenario one: search the Playwright docs

Here's our first end-to-end test action plan:

  • set the viewport to an iPhone
  • open playwright.dev
  • click on the search button
  • search for "locator"
  • click on the first search result
  • expect that the headline is visible
  • expect that the headline contains the word "Locators"

Transforming these actions and assertions into Playwright code should be straightforward regardless of the approach, right?

Search the docs with Playwright-generated code

To kick off Codegen, head to your terminal and run npx playwright codegen.

npx playwright codegen --device "iPhone 11" https://playwright.dev

Set the viewport dimensions to an iPhone (--device), automatically navigate to playwright.dev, and generate your test. Here's the result of me clicking around in the Playwright docs.

// generated by Playwright Codegen
import { test, expect, devices } from '@playwright/test';

test.use({
  ...devices['iPhone 11'],
});

test('test', async ({ page }) => {
  await page.goto('https://playwright.dev/');
  await page.getByLabel('Search').click();
  await page.getByPlaceholder('Search docs').click();
  await page.getByPlaceholder('Search docs').fill('locator');
  await page.getByRole('link', { name: 'Locators', exact: true }).click();
  await expect(page.getByRole('heading', { name: 'Locators', exact: true })).toBeVisible();
});

Generating a Playwright test that navigates the docs takes twenty seconds using Codegen. And while it's not perfect (you could remove the click and extract some variables), it's following Playwright's best practices (getBy locators and web-first assertions). 

Recording a Playwright test requires almost no effort and results in decent code.

Let's see how AI performs!

Search the docs with AI-generated code

I went to ChatGPT and tried to perform the same task with the new canvas features that aim to good at coding. Here's my first prompt:

Write a Playwright test using test and expect in TypeScript: 

- set viewport to an iPhone
- open playwright.dev
- click on the search button
- search for "locator"
- click on the first result
- expect that the headline is visible
- expect that the headline contains the word "Locator"

And here's the first result.

Generated Playwright code with many issues.

The initial ChatGPT response wasn't great. It hardcoded the iPhone device sizes, used deprecated methods, and didn't follow Playwright's best practices. After a bit of back and forth and telling ChatGPT what to do, I could convince the LLM to generate decent Playwright test code.

// generated by ChatGPT
import { test, expect, devices } from '@playwright/test';

// Define a Playwright test using TypeScript
// This script sets the viewport, searches for "locator" on the playwright.dev site,
// and makes assertions to validate the content of the headline

test.use({ ...devices['iPhone 12'] });

test('Search for Locator on Playwright.dev with iPhone viewport', async ({ page }) => {
  // Go to the playwright.dev website
  await page.goto('https://playwright.dev');

  // Click on the search button (using role-based locator)
  await page.getByRole('button', { name: 'Search' }).click();

  // Enter the search term "locator" in the search field (using label-based locator)
  await page.getByPlaceholder('Search docs').fill('locator');

  // Wait for search results and click on the first result (using role-based locator)
  await page.getByRole('link').nth(0).click();

  // Get the headline of the first result page
  const headline = page.getByRole('heading', { level: 1 });

  // Assert that the headline is visible
  await expect(headline).toBeVisible();

  // Assert that the headline contains the word "Locator"
  await expect(headline).toContainText('Locator');
});

Here's the whole conversation if you're curious.

But did searching "locator" in the Playwright docs work with this generated script?

A failing test case shown in Playwright UI.

It didn't. When you ask ChatGPT or Claude to perform actions on a website, it's just making things up. It doesn't know what it's dealing with. The LLM is not reaching out to the web to analyze some HTML; it's really just guessing Playwright code.

It puzzles me how people can say, "Just use ChatGPT" for writing end-to-end tests because it won't generate good code by default, and almost always, you'll end up in a debugging session from the beginning. 

Using vanilla ChatGPT or Claude is not that great compared to Playwright's native Codegen.

Adjust the LLM prompt to get higher-quality code

But what can we do about the problem of ChatGPT spitting out poor code? 

I researched good coding prompts and discovered cursor.directory, a site that lists AI coding prompts for the Cursor editor. If you haven't heard of Cursor, it's a VS Code fork that enriches the editor with AI suggestions and workflows. Some people are pretty hyped about it, and I will look at it more closely soon.

Screenshot of cursor.directory showing various LLM prompts.

It turns out that by adding some small adjustments to the LLM prompt, you can improve the resulting code tremendously. All the community prompts include:

  • a role for the LLM to take - "You are an expert in [...]".
  • additional rules and direction - "You write concise, technical code".
  • as much context as possible.

So, I tweaked my Playwright LLM prompt.

// Adjusted Playwright-focused prompt
You are an expert in TypeScript, Frontend development, and Playwright end-to-end testing. 
You write concise, technical TypeScript code with accurate examples and the correct types. 

* Always use the recommended built-in and role-based locators (getByRole, getByLabel, etc.) 
* Prefer to use web-first assertions whenever possible 
* Use built-in config objects like devices whenever possible 
* Avoid hardcoded timeouts 
* Reuse Playwright locators by using variables 
* Follow the guidance and best practices described on playwright.dev 
* Avoid commenting the resulting code 

Here's the task: 

Write a Playwright test using test and expect in TypeScript:

* Set viewport to an iPhone
* Open playwright.dev
* Click on the search button
* Search for "locators"
* Click on the first result
* Expect that the headline is visible
* Expect that the headline contains the word "Locator"

And the resulting code was way better!

ChatGPT generating good Playwright code.

With an extended prompt, ChatGPT or Claude generated good Playwright code, and surprisingly, sometimes they even generated test code that worked for playwright.dev. This was unexpected, but searching the Playwright docs might be a typical tutorial example.

But it became clear that whatever you do with an LLM, you should always use a customized prompt or a specialized GPT that includes advanced LLM instructions. Context, rules and instructions are essential for good results. If you "just use ChatGPT" to generate your end-to-end tests, you'll end up with poor and outdated code. 

Side note: the same rule applies to asking ChatGPT for Playwright help. Asking "How to click an element with Playwright?" can result in terrible advice, whereas an adjusted prompt ("You are an expert […] How to click an element with Playwright?") drastically increases the quality.

But what if we give ChatGPT more context and include the HTML of the site we want to test in the prompt?

Scenario two: test a simple HTML app

I created a quick demo app to keep the complexity low and maybe even see the mighty AI and LLMs succeed.

Two screens showing a but to click and a message "You click the button too many times!"

"The app" consists of a single HTML page with some inline JavaScript. You can press a button, the clicks will be counted, and if you click the button three times, you'll be greeted with a sad raccoon. 

There's no magic, and this functionality should be pretty straightforward to test. 

  • Navigate to localhost:8080
  • Click the increase button
  • Check if the counter shows 1
  • Click the increase button two more times
  • Check if "You clicked the button too many times!" is visible on the page
  • Check if a raccoon image is visible on the page

Let's see which code-generation tool wins this battle!

Test a simple local app with Playwright-generated code

Recording the required actions for this example app took me again 20 seconds with Playwright Codegen. The generated test code used the recommended locators. Playwright discovered that there was a test id in the HTML, and most importantly, the test succeeded. 

The test code could have been a bit more DRY, but the generated code worked fine!

// generated by Playwright Codegen
test('test', async ({ page }) => {
  await page.goto('http://localhost:8080/');
  await expect(page.getByTestId('count')).toContainText('0');
  await page.getByRole('button', { name: 'Increase' }).click();
  await expect(page.getByTestId('count')).toContainText('1');
  await page.getByRole('button', { name: 'Increase' }).click();
  await expect(page.getByTestId('count')).toContainText('2');
  await page.getByRole('button', { name: 'Increase' }).click();
  await expect(page.getByRole('img', { name: 'A sad raccoon' })).toBeVisible();
  await expect(page.getByRole('heading', { name: 'You clicked the button too' })).toBeVisible();
});

How's AI performing this task?

Test a simple local app with AI-generated code

I reused the prompt from the previous section and told the LLM about the HTML.

You are an expert in TypeScript, Frontend development and Playwright end-to-end testing. 
You write concise, technical TypeScript code with accurate examples and the correct types. 

* Always use the recommended built-in and role-based locators (getByRole, getByLabel, etc.) 
* Prefer to use web-first assertions whenever possible 
* Use built-in config objects like devices whenever possible 
* Avoid hardcoded timeouts 
* Reuse Playwright locators by using variables 
* Follow guidance and best practices described on playwright.dev 
* Avoid commenting the code 

Here's an HTML page.

[all the HTML] 👈👈👈

Here's the task: 

Write a Playwright test that: 

* Navigates to localhost:8080
* Clicks the increase button
* Checks if the counter shows 1 
* Clicks the increase button two more times
* Checks if "You clicked the button too many times!" is visible on the page
* Checks if a raccoon image is visible on the page

And I must say, after providing the advanced prompt and including the HTML as context, ChatGPT shocked me.

// generated by ChatGPT
import { test, expect } from '@playwright/test';

test('should click the button and verify behavior', async ({ page }) => {
  // Navigate to the page
  await page.goto('http://localhost:8080');

  // Reusable locator for the button
  const increaseButton = page.getByRole('button', { name: 'Increase' });

  // Reusable locator for the count element
  const count = page.getByTestId('count');

  // Click the increase button
  await increaseButton.click();

  // Assert that the count shows 1
  await expect(count).toHaveText('1');

  // Click the increase button two more times
  await increaseButton.click();
  await increaseButton.click();

  // Assert that the "You clicked the button too many times!" text is visible
  await expect(page.getByRole('heading', { name: 'You clicked the button too many times!' })).toBeVisible();

  // Assert that the raccoon image is visible
  await expect(page.getByRole('img', { name: 'A sad raccoon' })).toBeVisible();
});

This was good code! ChatGPT (and Claude!) discovered test ids, used the recommended locators, analyzed HTML attributes, and kept the code well-structured. But would this test code work when we run it?

After running the AI-generated end-to-end tests, I almost fell off my chair.

Playwright UI showing two passing tests.

ChatGPT (link to conversation) and Claude (link to artifact) generated code that worked and had better quality than Playwright Codegen. They extracted variables, correctly parsed the HTML, and extracted alt attribute values to be used in locators.

Against my expectations, the LLMs generated good code and did it faster than I could with Playwright's tooling. I. Am. Mindblown!

Conclusion

What did this exercise tell me about AI and Playwright scripting then?

First, "Just use ChatGPT!" is a myth. Of course, it would be great to open a chat window, enter some tasks, and magically generate high-quality code. In reality, though, this doesn't work.

When talking to the LLM, you must provide as much instruction and context as possible. The LLMs know many outdated Playwright practices; you must tell the AI what methods to use and what code to avoid. A specialized prompt should be the absolute minimum. 

When you generate end-to-end tests, you must either inline or upload as much source code as possible to receive a working test. There's really no benefit in generating a test leading you straight into a debugging session. No LLM in the world will reliably "guess" the correct actions or locators without context and access to source code.

But then there's the problem: pasting or uploading all the code into a chat window is neither practical nor convenient. There must be a better way!

AI editors like GitHub Copilot and Cursor talk to the LLMs with specialized prompts while providing code context from within your editor. Does this mean that you could open your code base, prompt some actions, and generate good Playwright code? I don't know yet and am still skeptical about LLMs handling real-world application code, but I'll research this topic next! 

If you want to follow along, subscribe to the Checkly YouTube channel or our RSS feed. The next "Playwright and AI" post will be published in 2-3 weeks!

And if you're using Playwright for end-to-end testing, you should check out synthetic monitoring! It lets you take your existing end-to-end tests and run them on a schedule from anywhere worldwide to get alerted when you have production issues. It's pretty cool, trust me. 😉

Share on social