Featured image of post Universal UI testing based on image and text recognition

Universal UI testing based on image and text recognition

The story of how I created Askaiser.Marionette, a universal test automation framework based on image and text recognition for .NET.

At every major release of the ShareGate Migration Tool (SGMT), a part of the development team spends a few hours to do regression testing. These tests cannot be automated using UI testing tools such as Cypress, Playwright and Selenium because SGMT is a desktop application made in WPF.

I started looking for UI testing solutions for desktop applications. I quickly discarded expensive UI testing suites. WinAppDriver was very interesting, but one of the requirements was to add some kind of identifiers on every UI element that we wanted to interact with. This would have required a huge effort from the team.

I wanted a solution that does not require to modify our application. Something easy to learn, where tests could be written quickly and with low-maintenance over time as the product evolve. This is how I started working on Askaiser.Marionette, a universal UI testing library based on image and text recognition, made in C#.

Before going further, I wanted to show you how I was able to automate the happy path of one of the key features of SGMT: Microsoft Teams migration. It took me approximately one hour, and I even had the luxury of writing the test using reusable page object models. This is the result:

Screenshots, template matching and text recognition

You must provide text or partial screenshots of the UI that Askaiser.Marionette should interact with. Then, the library periodically takes screenshots of the screen and tries to find these elements.

In order to do so, I had to learn about the basics of image processing. I was able to find an image within another with a technique called template matching, and I used an OpenCV wrapper made for .NET. I will not go through the technical details here, but you can find the relevant source code in my GitHub repository.

Template matching with OpenCV

In order to implement text recognition, I needed to use another library, Tesseract. It is an optical character recognition engine made by Google. There is also a .NET wrapper for this library. I had to use a few tricks to properly match arbitrary text with a library that is best suited for digitized books, and cleverly use page segmentation methods. You will find how I used Tesseract in this C# class.

Both techniques required some pre-processing, such as rescaling, optional use of thresholding, grayscaling, binarization or even negative filters.

Blazing fast development experience using C# source generators

After a few days, I was able to find images and text within screenshots. I also added mouse and keyboard interaction. Things were going well, I was quite happy, and I learned a lot. Then, I started thinking about the developer experience.

What do Cypress and other UI testing framework have in common? You must provide identifiers, XPath or CSS selectors. You must spend time reading HTML and writing selectors that will eventually change. These selectors could be poorly written, and a small change in the web application could break everything. You would have to do this all over again.

Making partial screenshots for Askaiser.Marionette was easy. The hard part was to integrate these images in the test project. A complex application such as SGMT would require a lot of small images, and developers should focus on writing the tests and not somehow converting the images for the library.

At that time, Microsoft introduced C# Source Generators. I saw a huge opportunity there. What if a developer could copy-paste images directly into a C# project and have classes and properties automatically generated? These objects could then be interacted with using Askaiser.Marionette in a matter of milliseconds.

That’s exactly what I ended up with. Everything worked perfectly. You can see the unit tests of the source generator, they cover many scenarios. A partial class must be written with a special attribute that specifies the path of the directory that contains the images. Any existing, added or modified image gets almost instantly converted into a property of that class. Directories also become nested classes, allowing developers to logically organize the images and have the same structure in the code.

Source images and a part of the test code

Generated code from the images and directories

A world of possibilities

When you think about it, Askaiser.Marionette can be used in many ways. You can interact with anything displayed on your screen, so not just a single application, but the whole operating system. It can be integrated within a CI build, Azure Pipelines supports screen emulation and my library is able to detect the fake monitor. Some might want to create a bot for a game. Others might even want to use it to test web applications!

After all, you only interact with what is displayed on the screen. A test might fail because a button became larger than expected. This is the kind of thing that would be harder to validate with a conventional UI testing framework.

So, give it a try, and take a look at the sample project! It only works on Windows, because screen capture, mouse and keyboard interaction are only implemented for that OS so far.