The State of the Web: Technology Adoption and Security Issues in the Internet’s Top 1 Million Sites | Wave 12 | Project Resonance

The State of the Web: Technology Adoption and Security Issues in the Internet’s Top 1 Million Sites | Wave 12 | Project Resonance

At RedHunt Labs, we regularly perform various internet-wide studies as a part of Project Resonance, to keep up with ever-changing cyberspace and enrich our Attack Surface Management product, NVADR. This blog post is about our recent study in which we analysed the Tranco Top 1 Million websites, resulting in interesting insights. 

DISCLAIMER

No attacks or exploitation were conducted during our analysis. We solely crawled the websites and analyzed their content, along with examining the response headers of the websites. Our intention was purely for research and analysis purposes to better understand the landscape of internet security and compliance. At no point did we engage in any form of malicious activity or attempt to exploit vulnerabilities. Our goal is to contribute to the enhancement of internet security through responsible observation and analysis.

In the vast world of the internet, there are millions of websites, each with their unique identity and purpose. But what lies behind these websites? What technologies and tools power them, and what potential vulnerabilities might exist? 

To understand this better, we’ve decided to take a closer look at the Tranco (https://tranco-list.eu/) Top 1 Million websites on the internet. We have carefully examined and analyzed these websites, trying to uncover the technologies and the tools/technologies that are used to build and operate them.

Beyond just observing the surface-level features, we’ll also look into the security of these websites. We have examined their configurations and headers, looking for common issues that can pose risks to the security and privacy of the data being exchanged.

Having just returned from RSA, we observed a plethora of defensive solutions available in the market. This raises an important question: with so many solutions out there, how secure is the internet today? To answer this, we’ll analyze the current state of internet security to see how effectively these solutions address the security issues they claim to solve.

To scan the top 1 million websites on the internet, we started this initiative by obtaining the list of domains from Tranco, as the Alexa ranking has been deprecated and is no longer available. Once we had the list, we divided it into 20 smaller parts of equal length.

Further, we used our own in-house tool for scanning these websites. Our tool first checks if the domain is live on either port 443 (HTTPS) or port 80 (HTTP) and then scans for the technologies being used on the domain. We also looked for common misconfigurations such as weak SSL ciphers, lack of cookie control, absence of HTTP Strict Transport Security (HSTS) enforcement, etc.

To efficiently carry out this large-scale scanning operation, we have used Terraform to deploy our tool and the divided domain lists onto 20 separate machines on the Google Cloud Platform (GCP). Each machine had a cronjob setup to push the scanning results to our Elastic DB every 30 minutes.

The entire process took around a day to finish, from scanning the websites to sending the data to ElasticSearch using carefully crafted plugins and config files.

After scanning the top 1 million websites on the internet and storing the results in Elasticsearch, we made several observations, some of which are quite concerning. Many issues persist that could be easily resolved, yet they continue to exist.

We’ll begin our observations by examining sites running on port 443 versus those on port 80.

As we mentioned earlier, our tool checks if websites are live on either port 443 (HTTPS) or port 80 (HTTP). The scan results showed that 7.56% of websites were still running on port 80.

Running a website on port 80 isn’t necessarily problematic, as long as they aren’t handling sensitive activities like login/payment transactions over an unencrypted network. Upon analyzing the status codes of all sites running on port 80, we found that approximately 85.53% of sites were actually serving content, which amounts to roughly 55,000 sites.

We also examined the issuer names from SSL certificates on websites running HTTPS and found that 31.37% of certificates were issued by R3, which is provided by Let’s Encrypt. When we also consider certificates from E1, it becomes apparent that Let’s Encrypt issues nearly half of the certificates on the internet. This makes sense, given that Let’s Encrypt is a nonprofit, free, automated, and open certificate authority.

Our in-house tool can also detect the technologies utilized to build websites, similar to Wappalyzer and Builtwith. Our findings yielded some intriguing insights. For instance, 3.31% of sites were found to be running WordPress, while surprisingly 1.75% were operating on PHP. It seems that the notion of PHP being dead is not applicable. Moreover, a notable 6.31% of sites were running on HTTP/3, which is commendable, given its superior performance and security compared to HTTP/2.

Additionally, we also checked for common misconfigurations. Here are the top 20 issues we found on these websites.

IssuesCount of records
Server supports weak SSL ciphers765950
Cookie Control Not Implemented721773
HTTP Strict Transport Security (HSTS) not enforced618224
Secure cookies not used220067
X-Powered-By header exposed182471
TLS v1.3 isn’t supported168438
Server information exposed via header146197
HSTS policy doesn’t include subdomains126273
No base-uri directive is specified in CSP105013
No object-src directive is specified in CSP97551
No script-src directive is specified in CSP86861
No default-src directive specified as fallback mechanism84372
ACAO header allows access from any origin40381
Invalid SSL certificate33764
Usage of ‘unsafe-line’ in CSP policy33409
Usage of ‘unsafe-eval’ in CSP policy27811
X-XSS-Protections does not enable ‘block’ mode18939
ASP.NET version header exposing specific ASP.NET version14107
SSL Certificate is expired12974
SSL Certificate will expire soon12601

One of the most concerning findings from our analysis is the prevalence of invalid and expired SSL certificates:

  • Invalid Certificates: Found on 33,764 websites, invalid certificates pose significant risks as they can undermine trust and potentially allow for man-in-the-middle attacks.
  • Expired Certificates: Even more alarming, 12,974 websites had expired SSL certificates, and 12,601 websites were on the verge of certificate expiration within the next 20 days. Expired certificates mean that the data exchanged between users and these websites is no longer secured, making it vulnerable to interception and exploitation by attackers.

Additionally, 168,438 sites did not support TLS 1.3, the latest version of the Transport Layer Security protocol, which offers enhanced security features. This lack of support indicates that many websites are not leveraging the most robust security measures available.

Furthermore, 765,950 websites were discovered to be employing weak SSL ciphers, including:

  • TLS_RSA_WITH_RC4_128_SHA
  • TLS_RSA_WITH_3DES_EDE_CBC_SHA
  • TLS_RSA_WITH_AES_128_CBC_SHA256
  • TLS_ECDHE_ECDSA_WITH_RC4_128_SHA
  • TLS_ECDHE_RSA_WITH_RC4_128_SHA
  • TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA
  • TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
  • TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256

These weak ciphers leave connections to these websites vulnerable to interception and tampering by potential attackers who can exploit these vulnerabilities. Addressing these issues is crucial to enhancing the overall security of the internet.

Lack of proper cookie control implementation raises significant concerns, particularly regarding compliance with regulations such as GDPR (General Data Protection Regulation). Without proper cookie control, websites may inadvertently collect and store user data without explicit consent, violating GDPR requirements for transparency and user privacy protection. This oversight exposes users to potential privacy risks and leaves website owners liable to hefty fines and legal consequences for non-compliance with data protection regulations.

From our analysis, over 720,000 websites were found to lack proper cookie control implementation. This widespread non-compliance indicates a major gap in the adherence to critical data protection standards. Ensuring proper cookie consent mechanisms is essential for protecting user privacy and maintaining trust in online services. Website owners must prioritize this aspect of compliance to avoid potential legal repercussions and enhance their commitment to user data protection.

There were instances where websites utilized security headers, but unfortunately, they were misconfigured, which is the same as wearing a bulletproof vest with a hole in it.

For example, the “includeSubDomains” directive in the HSTS header is crucial for instructing browsers to apply the same HSTS rules to subdomains, ensuring that they also require a valid certificate. However, some websites neglect to include this directive, leaving subdomains vulnerable to security risks.

Similarly, misconfigured CSP headers pose significant risks. For instance:

– No base-uri directive: This oversight means that the CSP fails to specify a base-uri directive, which is vital for restricting the URLs that can be used in an <base> element.

– No object-src directive: This indicates a lack of an object-src directive in the CSP, which controls the URLs from which the browser can load plugins.

– No script-src directive: This suggests that the CSP does not define a script-src directive, responsible for governing the URLs from which the browser can execute JavaScript.

Each of these directives plays a critical role in defining the security policy of a web application. By omitting them, websites essentially allow potentially harmful resources to be loaded or executed, leaving them vulnerable to various attacks. It’s essential to properly configure these headers to enhance the security posture of your website and protect against potential threats.

The server HTTP response header contains the server software name. It is a great source for finding the market share of various server software and their versions being used in the wild.

From what we’ve observed, a large number of websites were using Cloudflare in some capacity, either as a Content Delivery Network (CDN) or a Web Application Firewall (WAF). It was quite surprising to see Cloudflare dominating the landscape, with a whopping 40.98% share. While I initially expected Cloudflare to have a significant presence, I didn’t anticipate it to be this high. It’s worth noting that the percentages for Apache and Nginx might not be entirely accurate since variations like “ngnix” and “ngnix/1.18.0” are counted separately.

We also identified several hosts running outdated server software with at least one known public CVE for instance

Server versionNumber of serversPublic vulnerabilities
Apache/2.4.41361141
Apache/2.4.52243718
nginx/1.14.01362912
nginx/1.18.0132085

HTTP headers play a crucial role in boosting web security with easy implementation. They help prevent security vulnerabilities like Cross-Site Scripting, Clickjacking, and Information disclosure.

In our recent analysis of 1 million sites, it’s evident that many developers are still unaware of security headers and how they can enhance their website’s security. While security headers don’t guarantee 100% protection against web vulnerabilities (for example, misconfigured CSP headers can still leave a website vulnerable to XSS attacks), they make it harder for attackers to exploit them.

After observing a low volume of data from our analysis, we initially suspected issues with our tools or Elasticsearch functionality. However, upon cross-referencing with other research sources, such as crawler.ninja, we found similar trends. The adoption of security headers appears to be notably low, with only approximately 250,000 websites implementing them in some form. 

Not just this, developers are also misspelling response headers. For example, our Elasticsearch dashboard shows many instances of misspelled Content-Security-Policy headers:

Misspelled Response Headers

This lack of implementation and minor mistakes leave websites vulnerable to various attacks, including clickjacking and XSS (Cross-Site Scripting), highlighting the urgent need for improved security measures across the internet.

In our ongoing efforts to advance web security, we are excited to release datasets derived from our scan of 1 million websites. These datasets are designed to support researchers, developers, and security professionals in their endeavours to understand and improve the landscape of web technologies and security practices. Below are the datasets available for download, along with a brief description of each:

Content Security Policies (CSP) Dataset

File Name: SH-content-security-policy.txt

Description: This dataset contains Content Security Policies (CSP) implemented across the top 1 million websites. CSPs are critical in protecting websites from various types of attacks, including cross-site scripting (XSS). This file provides insights into how websites configure their CSPs, highlighting common practices and potential areas for improvement.

Unique HTTP Response Headers Dataset

File Name: SH-response-headers-uniq-count.txt

Description: This dataset lists unique HTTP response headers identified during our scan. HTTP headers are essential for communicating between web servers and clients. Understanding the diversity and frequency of these headers can provide valuable insights into how websites manage communication and security measures.

Top 1000 Technology Stacks Dataset

File Name: SH-top-1000-techstack.txt

Description: This file contains the top 1000 technology stacks (tech stacks) websites use. Tech stacks consist of the various software and technologies used to build and run a website. This dataset provides a valuable snapshot of the most popular and emerging technologies driving today’s leading web platforms.

‘X-Powered-By’ Header Counts Dataset

File Name: SH-x-powered-by-uniq-counts.txt

Description: The ‘X-Powered-By’ HTTP header reveals the technology or framework powering a website’s server. This dataset aggregates unique counts of these headers, offering insights into the web’s prevalence and variety of server-side technologies.

You can download the entire collection of the above-mentioned files directly from our datasets repo: Click Here

As we move into a world where everything is connected through the internet, it’s very important to keep an eye on how secure the internet is and where it might be vulnerable.

Our study looked at lots of websites to see what security issues they might have. Even though people are more aware of cybersecurity problems, we found that many websites still aren’t very secure. They have things like using expired SSL certificates, using weak SSL cipher, not setting up security headers correctly, or using old servers that are not safe anymore.

As seen from our observations, the internet appears to be very vulnerable, and market solutions seem to be ineffective. There is a clear need for improvement in the products available. At RedHunt Labs, we conduct this kind of research and integrate better tools into our products so that our customers can have peace of mind, knowing their online assets are secure.

In this world of internet dangers, it’s crucial for people who own websites or run them to make security a big priority. They should check their security regularly, follow the best safety rules, and keep learning about new threats.

For companies with stuff connected to the internet, they need to be extra careful. They should regularly check all the places where someone could sneak in and cause trouble, like checking if any doors are left open. Solutions like RedHunt Labs’ ASM Platform can help with this by providing continuous visibility on your dynamic attack surface and associated risks.

To sum it up, keeping internet-connected things safe is very important as attackers keep finding new ways to cause trouble. By being careful and always watching out for problems, we can make sure we stay secure in this big, connected world.

Let’s Reduce Your Org’s Attack Surface.

Leave a Reply

Your email address will not be published. Required fields are marked *