Advances in visual phishing detection

Tomas Trnka 27 Nov 2018

Protect yourself from phishing scams with these best practices and our improved phishing detection AI.

Phishing is a long-standing social engineering technique used by cybercriminals to trick people into giving up sensitive information such as credit card details and login credentials. It can come in many forms, including telephone phishing, smishing (SMS phishing), phishing emails, and phishing websites.

Phishing links leading to malicious websites can be delivered in emails that appear to come from legitimate sources. They can also be attached to messages sent on social networking sites and apps, like Facebook and WhatsApp. They can even misleadingly appear in search engine results.

Phishing websites can be hard to identify. Hoping to trick victims into giving up personal information, many sites look convincingly like the ones they are imitating. Phishing emails are typically less successful thanks to advances in classifying them as spam. Some phishing emails and links, however, still manage to slip into inboxes.

Phishing continues to be a leading attack method because it allows cybercriminals to target people at scale. Usually, bad actors distribute phishing scams under the pretense of being representatives from major companies where the intended targets have accounts.

At Avast, we use artificial intelligence (AI) to detect phishing scams and protect our users from malware and malicious sites.

Phony websites

To enhance our phishing detection capabilities, we had to follow a cybercriminal’s line of thought. The cybercriminal creates phishing websites that look very similar to legitimate sites in order to successfully deceive people. The visual similarities alone are often enough to fool unsuspecting users into entering their credentials and other sensitive data requested by the malicious site.

Theoretically, cybercriminals could use the same exact images for their phishing sites as those used on the sites they are imitating. However, owners of legitimate websites can detect when hoax websites are linking to the same images hosted on their servers.

Furthermore, accurately replicating websites requires time and effort. Cybercriminals would have to ensure that the phishing websites they design are properly coded down to every last pixel. This forces them to use creativity to assemble sites that look almost the same as the originals, but with minor differences that may not be noticeable to the average user.

Although our detection engines flag phishing sites based on HTML content, the more sophisticated methods used by cybercriminals to build their phishing pages can bypass antivirus detections. Using AI, our approach covers these techniques by detecting pages that duplicate the images from the legitimate website, instead of rendering a page as normal or pages that use heavily obfuscated javascript, which might evade detection.

Detecting phishing with AI

Avast has a network of hundreds of millions of sensors that feed our AI with data so that we can detect threats quicker and protect our users better. To do this, we scan every website our users visit, taking a close look at the popularity of the domains hosting the websites. Other factors are also assessed, such as the website certificate, the age of the domain, and suspicious URL tokens to determine whether or not a site should be processed.

The lifespan of most phishing sites are very short – too short for search engines to index them. This is reflected in a domain’s rating. The popularity and history of a domain can also be initial indicators of whether a page is safe or malicious. By looking into this and comparing the site’s visual characteristics, we can decide whether the website is clean or malicious.

Let’s take a look at a phishing site that our AI recently detected. It is imitating the login page of French telecommunications company Orange (Orange.fr).  

The phishing version:

orange-fr-fake-website-1

The actual Orange.fr login page:

orange-fr-real-website-page-2

At first glance they look very different. The malicious version uses an old design of the Orange site while the legitimate site uses a more modern and safer design, prompting the user to enter his or her password as a second step, rather than requesting the username and password on one page.

The image below shows the phishing site’s ranking, across several ranking engines, which can be used as a proxy to assess its popularity:

domain-analysis-for-orange-france-website

Domain analysis for orangefrance.weebly.com —  the phishing version of the website

From this, we can see that the domain of the phishing website is very unpopular. On the other hand, the legitimate Orange.fr page is ranked 7/10. While the phishing site looks a lot like the former design of the Orange.fr site, it’s neither hosted on Orange.fr nor on any other popular domain. This information indicates that the phony site is potentially not safe, triggering a protocol to dig deeper.

The next step is to check the website’s design. One would think a simple pixel-by-pixel comparison between a fake site and a clean site would be sufficient. It’s not. We tested another approach using image hashes — a method of compressing rich image data into a smaller (but still expressive) space, such as a fixed-size vector of bytes with a simple metric. This approach allows the AI to consider similar images, as long as they do not exceed a certain distance threshold. This technique, however, was not as robust as expected.

So, we looked at other ways to check the website designs more accurately, and in the end we decided to use classic computer vision methods. These methods help convey the images to our AI by taking a very detailed look at particular pixels, as well as the pixels surrounding them. This is achieved with descriptors, which are vectors of numbers that describe the relative changes of the patch surrounding the pixel. Through this process, we can better understand the changing intensities in the grayscale picture, for example noticing if a gradient is present and how strong it is.

From now on, we will refer to the pixels chosen by our algorithm as interesting points. After receiving the image descriptors, we can compare the interesting points against a database of descriptors we maintain. But, as mentioned above, an image containing pixels that are similar to those in another is not sufficient enough to decide how closely an image resembles one from our target set. Therefore, we introduce another step called spatial verification — a technique used to compare spatial relations of particular pixels in a picture.

Here is an accepted example of spatial configuration of pixels:

spatial-verification-example

This is a rejected example:

rejected-spatial-verification-example

Spatial verification delivers sound results, but to mitigate possible false positives we have built in additional steps, such as the aforementioned image hashes.

A common problem with detecting interesting points in a picture occurs when an image contains text. There are a lot of gradients in text and letters which, by design, create a lot of edges. When an image contains a lot of letters, there are many interesting points in a small area, which can result in false positives, despite spatial verification.

For this reason, we created software that is able to classify patches within images and decide whether a patch contains text. In cases such as this, our AI would avoid using the points from the patch as part of the matching process.

This entire procedure is automatic and 99 percent of the time will recognize a phishing website in less than ten seconds, which in turn enables us to block that malicious site from our connected users.

Phishing sites revealed

Modern phishing sites are extremely deceptive. Cybercriminals put a lot of effort into making them look like the real thing. In the examples below, you can see how similar a phishing site looks compared to its authentic counterpart.

google-phishing-login-screen-vs-real-login-webpage

In the Google example above, we can spot small details that differentiate the phishing site from the actual login page. The phishing version doesn’t include the logos of Google’s apps. It also uses different colors for the user’s account avatar and offers slightly different options in the grey login box.

Here’s another example:

apple-id-login-page-phishing-vs-real-login

The fake Apple login site uses slightly different icons. It also uses a different typeface than the official page. These differences are rather subtle, so in order to spot them users have to know what they are looking for.

Phishing sites have greatly evolved over the years to become convincing counterfeits. Some even use HTTPS, giving users a false sense of security when they see the green padlock.

The minor flaws in a phishing website might appear obvious when positioned alongside a legitimate page, but not so noticeable on their own. But think about the last time you saw the login page for a service you frequently use. Chances are you’ll struggle to recall all the details, which is exactly what phishing scammers are hoping for when they design their pages.

How is the threat spread?

Historically, the most common way to spread phishing websites has been via phishing emails, but they are also spread via paid advertisements that appear in search results. Other attack vectors include a technique called clickbait. Cybercriminals typically use clickbait on social media by promising something, such as a free phone, to encourage users to click on malicious links.

What happens when a phisher makes a catch?

Like nearly all cyberattacks, phishing is used for financial gain. When users give up login credentials to a phishing site, cybercriminals can abuse them in a number of different ways, depending on the type of site used to phish. Many phishing attacks imitate financial institutions such as banks or companies like Paypal — targets that can yield significant financial rewards for cybercriminals.

If a cybercriminal tricks a user into giving up their credentials to a shipping website, such as UPS or FedEx, they are unlikely to profit from accessing the account. Instead, they may try to use the same credentials to access other accounts with more valuable information, such as an email account, knowing that people often use the same passwords across multiple services. Another way for the cybercriminal to profit would be to sell the stolen credentials on the darkweb.

This is a “spray and pray” attack mechanism. There are many outdated WordPress sites on the web that can be hacked and used for phishing campaigns at a very low cost. Generally, the price to deploy a phishing kit is roughly $26.

How to protect yourself

The time varies between a successful phishing attack and when the cybercriminal uses the stolen credentials. The quicker we are able to mitigate the threat, the more potential victims we are able to protect. Once a user’s credentials are stolen, there is not much they can do other than change those credentials as soon as possible.

So far in 2018, we’ve seen malicious emails sent from compromised MailChimp accounts, sextortion scams, and GDPR-related phishing campaigns, among many others. Moving forward, we can expect to see phishing attacks increase in volume and new techniques emerge to camouflage cybercriminals’ efforts to steal sensitive user data.

Below is a short checklist that, if followed, will help to prevent you from falling victim to one of the most successful forms of cyberattack:

  • First and foremost, install an antivirus solution on all of your devices, whether PC, mobile, or Mac. Antivirus software acts as a safety net, protecting online users.

  • Do not click on links or download files from suspicious emails. Avoid replying to them, as well, even if they allegedly came from someone you trust. Instead, contact those entities through a separate channel and ensure that the message actually came from them.

  • Directly enter a website’s URL into your browser whenever possible, so you end up visiting the site you want to visit, rather than a phony version.

  • Do not solely rely on the green HTTPS padlock. While this signifies that the connection is encrypted, the site could still be fake. Cybercriminals encrypt their phishing sites to further deceive users, so it’s important to double check that the site you are visiting is the real deal.


Avast is a global leader in cybersecurity, protecting hundreds of millions of users around the world. Protect all of your devices with award-winning free antivirus. Safeguard your privacy and encrypt your online connection with SecureLine VPN.

Learn more about products that protect your digital life at avast.com. And get all the latest news on today's cyberthreats and how to beat them at blog.avast.com.

--> -->