How good is GitHub Copilot at generating Playwright code?

Share on social

How good is GitHub Copilot at generating Playwright code?
Table of contents

People keep asking us here at Checkly if and how AI can help create solid and maintainable Playwright tests. To answer all these questions, we started by looking at ChatGPT and Claude to conclude that AI tools have the potential to help with test generation but that "normal AI consumer tools" aren't code-focused enough. High-quality results require too complex prompts to be a maintainable solution.

In this post, we'll look at AI-assisted coding with GitHub Coplit to see if it moves the needle regarding Playwright test generation. Ready? Let's go!

If you want to learn more about Copilot for Playwright scripting, this article is also available as a video on YouTube.

Requirements of a good AI prompt to generate Playwright end-to-end tests

In our previous post, we learned that to generate high-quality code with an LLM, you must understand the end-to-end testing space, use specialized prompts, and, most importantly, evaluate the generated code. 

Creating a well-running end-to-end test suite based purely on a chat conversation would be nice, but I think we still need to get there.

After much testing and prompting, I learned that a good LLM coding prompt consists of multiple building blocks, regardless of the tool.

Example of a good LLM prompt to generate Playwright tests.

First, you must get familiar with role prompting and set clear code generation boundaries. With a solid foundation and baseline instructions, your code quality will improve, and you'll generate better testing code. You must always remember that Copilot and friends will only be as good as the prompt you provided.

For Playwright scripting, I landed on a "you are an expert in TypeScript and Playwright end-to-end testing" role paired with clearly defined guardrails enforcing Playwright best practices. By applying these instructions, the generated code improved tremendously!

But even then, LLMs still need application code and as much context as possible to know what you're trying to test. Without it, AI tools can only guess how your application works, and you're immediately inviting hallucinations into your project. Inlining the source code into the prompt is another essential building block for AI code generation.

After adding detailed instructions, additional source code, and crafting a verbose prompt, the results were promising, but this approach wasn't maintainable. 

Let's switch to GitHub Copilot to see if a code-focused AI solution improves our Playwright workflows.

Advanced end-to-end test generation with GitHub Copilot

On first look, Copilot doesn't look much different than ChatGPT or Claude. You can open Copilot, ask questions, and generate code in a chat conversation. It's a wrapper around the existing AI APIs, after all.

Copilot chat in VS Code.

But there are two significant advantages when using Copilot for coding:

  1. Copilot Chat can embed source code into your prompts.
  2. Copilot Chat allows you to provide project-specific roles and instructions.

Let's see how these two features help with Playwright scripting.

Automatically apply source code and files as context 

As seen earlier, providing application code was the biggest hurdle when using LLMs to generate good Playwright code. Leaving your favorite editor to copy and paste source code into another chat application is everything but a productivity boost.

Copilot Chat helps out here by automatically applying the context of the currently open file to the chat conversation.

File context in Copilot Chat.

But, of course, to generate test code, you'll need to apply more context than the current file to your prompt. Copilot Chat allows you to provide additional context by dragging files into the conversation or using the #file: shortcut.

Two files added to the Copilot Chat context.

After providing additional file context, Copilot becomes pretty knowledgeable and analyzes the provided source code. You can even pass over your entire project, and while it's not perfect if you give enough source code and context, Copilot can translate source code to Playwright instructions surprisingly well.

But even after providing source code, the problem remains that without guiding Copilot towards good Playwright code, it will often use outdated and sometimes even deprecated methods to test your applications. Don't expect an LLM to know about good code and quality standards.

Low quality generated Playwright code.

Automatically apply custom prompt instructions to GitHub Copilot

Of course, you could extend your Copilot prompt and add the role prompt mentioned earlier ("You are an expert…") to the chat conversation, but there's a better way. The fine folks at Github extended the Copilot functionality with new preview features.

Let's enter the bleeding edge of Copilot!

When you inspect your VS Code Copilot settings, you'll discover the new "custom code generation instructions" settings. These new features allow you to define role prompts in your repository and automatically apply them whenever you have a quick chat with Copilot.

VS Code copilot setting to add custom instructions to your LLM prompts.

You can either refine your prompt in a .github/copilot-instructions.md file (a preview feature at the time of writing) or go with the experimental feature that allows you to set instructions in JSON or custom file locations.

If you now define the custom instructions…

.github/copilot-instructions.md
You are an expert in TypeScript, Frontend development, and Playwright end-to-end testing.
You write concise, technical TypeScript code with accurate examples and the correct types.

- Always use the recommended built-in and role-based locators (getByRole, getByLabel, etc.)
- Prefer to use web-first assertions whenever possible
- Use built-in config objects like devices whenever possible
- Avoid hardcoded timeouts
- Reuse Playwright locators by using variables
- Follow the guidance and best practices described on playwright.dev
- Avoid commenting the resulting code

… every chat conversation will automatically consider the custom role prompt...

Copilot Chat session with three added files to the context.

… and the quality of the generated code will be much higher. Copilot just became a Playwright expert and everyone on your team can reuse the same AI configuration to avoid generating poor Playwright code. Win-win!

By relying on the new Copilot features, you can extend your prompts, quickly provide the application code, and rely on best practices.

However, can AI then take over and write our end-to-end tests for us?

Can AI and Github Copilot generate end-to-end tests?

After testing multiple projects, prompts, and configurations, I must say that "just using AI" to generate end-to-end tests is an intriguing myth.

AI excels at generating code for simple applications

LLMs do surprisingly well for simple applications. Today's AI tools are great at parsing vanilla JavaScript and transforming simple functionality into code. But when I tried to generate tests for our marketing site here at checklyhq.com, Copilot sometimes got confused about the Next.js / React code forcing me into an AI conversation of death.

In these cases, starting from scratch was the only way to get Copilot back on track. When AI code generation works, it works shockingly well. But you'll be wasting much time and energy when it doesn't.

The generated code will only be as good as your prompt

Let me be frank: I'm okay with being surprised when AI code works and can live with the occasional AI back and forth to tell robots what to do. The results can be excellent if you provide enough source code, define a role, and set guardrails. But there's a big "but"…

Whether you're generating end-to-end tests or application code, the effectiveness of AI code generation heavily relies on the quality of your prompts. Detailed, well-thought-out instructions are key to producing good code. You cannot expect a language model to create production-ready code without clearly defining your requirements.

For the context of Playwright scripting, it's best to be very explicit when generating end-to-end tests. Most of the time, crafting a good prompt takes me longer than quickly coding the instructions. Click the "Open navigation" button isn't that different than page.getByRole('button', {name: 'Open navigation'}).click() after all.

And let me be the Playwright Codegen fan I am; generating a Playwright test with npx playwright codegen is a matter of seconds with a very high success rate. When you start with Playwright's Codegen, your locators, actions, and assertions will follow best practices and usually just work. 

Playwright Codegen is often better and quicker than AI Codegen.

You still must be the expert and know what you're doing

You can't expect an LLM to generate end-to-end test code for you. Regardless of what you're building, AI-generated code can not be trusted, and you should always double-check if it matches your quality standards.

This additional quality check is essential for end-to-end testing. Generating a "working" test case with Copilot might be done quickly, but what if you generated a false-positive test case that's always passing?

Then, your test suite or synthetic monitoring setup will be worth nothing because you'll miss regressions, bugs, or outages. I can't stress it enough: regardless of whether you, another human, or a machine creates your end-to-end tests, you must always ensure they'll fail when they're supposed to. Otherwise, your Playwright tests will be useless.

Conclusion

The previous paragraphs might now sound like I'm an AI skeptic, but this isn't the case. The new tools are a massive performance boost if you limit your AI expectations. AI still isn't the silver bullet for everything.

  • Do you want to refactor some Spaghetti Playwright code into a page object model? No problem, Copilot does this wonderfully!
  • Do you want to generate some test data? Easy-peasy, Copilot will hand it to you in a matter of seconds!
  • Do you want to generate a high-quality and complete end-to-end test suite? I don't think we're there yet. 😅

However, after using Copilot to generate Playwright tests, I started to wonder if passing application code to LLMs was the correct approach in the first place. Playwright testing is all about ignoring implementation details. You define your initial UI state, perform actions, and expect a new UI state; that's it.

Most of the time, testing implementation details is a code smell and can be considered an anti-pattern. Does the same apply to AI Playwright code generation?

I don't know yet, but new Playwright tools monitor the DOM and feed HTML snapshots to the LLMs to generate locators and actions. Is this a better approach than prompting source code? We'll find out in the next blog post! 

If you want to follow this Playwright/AI journey, I'll also publish new Playwright/AI content on the Checkly YouTube channel, and we will announce it in our Slack community. Come and say hi; I'll see you there!

Share on social