At RedHunt Labs, we regularly perform various internet-wide studies as a part of Project Resonance, to keep up with ever-changing cyberspace as well as to enrich our Attack Surface Management product NVADR. This blog post is about our recent study in which we analyzed 65 million web servers resulting in interesting insights. At the end of this blog post, we are releasing a few datasets from our internet scan results for the community.
Over the decades, the internet became home to not just computers but also other devices like webcams, wireless routers, smart doorbells, and many others. These devices can be accessed by their IP addresses even if they don’t have a DNS record.
Talking about the scale of the internet (not considering IPv6), there are about 3.7 billion IPv4 addresses available for public use. The rest of the IP addresses are reserved for special purposes such as broadcasting, multicasting or experimenting. Each of the public IPs can serve content on any of the 65,353 ports, that’s about 242 trillion possible ports for a service to run on. This vast space is home to everything that is connected to the internet, intentionally or unintentionally and that’s why it is important to analyze the security posture of this space.
Scanning the entire internet itself is an interesting topic that has been talked about in many great blogs and presentations like the following:
- Nmap: Scanning the Internet [Talk by Fyodor]
- Internet Scanning: Current State & Lessons Learned [Talk by Mark Schloesser]
- Mass Scanning the Internet [Talk by Rob Graham, Paul McMillan, and Dan Tentler]
What did we do?
To perform such a study ourselves, we started with scanning ports 80 and 443 on all the (safe to scan) public IPv4 addresses, the ports usually used by web servers and other devices that are supposed to be accessed over the web. The aim of this study is to pave the way for novel research and provide the data to aid them.
Uncovering the unknown
When analyzing the raw internet scan data, it is easy to be intimidated by the huge volume of information. Even though one can come up with hundreds of right ways to start analyzing the data, as a first step, we analyzed – HTML titles, server headers, and x-powered-by headers. These data points are a good source of information to understand what type of content is being served by a particular HTTP server.
Analyzing HTML Titles
HTML titles are simply the titles of HTML pages specified by the <title> tag. They can provide good insights into the nature and content of a webpage. Are you looking at a simple test page / a dashboard / a login page of an IoT device – the HTML title answers the question in most cases and is also a source for identifying asset type.
Open directory listings
A directory listing is a feature that lets a client browse files and directories of a web server. According to PortSwigger, “Web servers can be configured to automatically list the contents of directories that do not have an index page present. This can aid an attacker by enabling them to quickly identify the resources at a given path, and proceed directly to analyzing and attacking those resources”. It is not advised to turn on this feature unless the content is not sensitive and is intended to be accessed that way. We found out that there are 324,433 open directory listings on port 80.
Insecure login pages
SSL/TLS was introduced to protect communications from Man in The Middle attacks and ensure the integrity of the data. Websites can use encrypted SSL/TLS connections with the help of digital certificates. A decade ago, getting these digital certificates was a costly affair, and managing them required expertise. Today, we have Let’s Encrypt and ZeroSSL which issue digital certificates free of cost and a lot of documentation around how to manage them. In spite of this, we found that over 1,021,275 login pages are being served over HTTP.
Information leakage via phpinfo()
phpinfo is a web page that contains information about the PHP configuration and available predefined variables on a system. The information contained in it is usually valuable to attackers. We found that 22,896 hosts had a phpinfo page as their homepage.
Analyzing server headers
The server HTTP response header contains the server software name. It is a great source for finding the market share of various server software and their versions being used in the wild.
We found that out of the 65 million web servers we scanned, 83% of servers returned a server header while 53% also exposed a version number in the header. Here’s a breakdown of top 10 server technologies we identified:
By extracting and analyzing software versions from the server header, we identified a number of hosts running outdated server software with at least one known public vulnerability.
> Note: These statistics don’t take backports into consideration.
HTTP Server Version Stack vs Public Vulnerabilities
|Server version||Number of servers||Public vulnerabilities|
|Server version||Number of servers||Public vulnerabilities|
|Apache/2.3.0 – 2.3.*||1839||~ 40|
|Apache/2.0.0 – 2.2.*||1233721||~ 500|
|Apache/< 2.0.0||60506||~ 80|
Analyzing x-powered-by headers
The x-powered-by header isn’t a standard HTTP header. It can be used by any software running on the server to let clients know of its presence while the server header is strictly used by the server software.
State of PHP
PHP 7 has been a very important upgrade due to various flaws of PHP 5. We found that out of 3,151,787 web servers exposing their PHP version, only 1,197,904 (~38%) were using PHP 7.
x-runtime header & side-channel attacks
Side-channel attacks are a class of attacks that involve gaining valuable information from a system by analyzing its behaviour, and exploiting the system with that information if possible. Timing based attacks are a subset of side-channel attacks where the information is gained by analyzing the time taken by the target system to various operations.
Timing-based side-channel attacks aren’t usually reliable over the internet because of various factors that may affect the time taken by an HTTP request to complete such as bandwidth, server load, etc. However, the x-runtime HTTP header completely strips off this unintended protection as it contains the exact time the server took to process the request e.g. x-runtime: 0.3754
An attacker can analyze the runtime of the server to make assumptions about the logic of the application as well as exfiltrate data from the server. The simplest example would be identifying if a username exists on a web application or not. In a typical scenario, a malicious actor can carefully tweak the input to different variations to find one which makes the server process the request for the longest and then send that request repeatedly to cause Denial of Service.
During our analysis, we found that a total of 350,699 hosts were using x-runtime header.
While analyzing the most common HTML titles, we came across the title Directory listing for /. We found that most of the web pages with this title were served by Python’s SimpleHTTPServer library. This library is well known for its ability to start a web server quickly from the command line. However, it serves the directory it’s been run from which we found is a fact overlooked by a concerning amount of users. Unlike usual directory listing pages, most of these directory listings were root filesystems, leaving everything in the root partition publicly exposed.
The reason for such exposed python servers are usually humans who start a small service to share files, or test a prototype quickly, and then forget about the service exposed. While it is a small mistake, the impact could be as big as a data Breach since these directories contain critical and sensitive information quite often.
A publicly accessible root partition is as bad as it sounds. There are critical credentials such as SSH private keys and AWS access keys that may give attackers access to other assets. Most likely, the users who have committed this mistake do not realize that SimpleHTTPServer exposes the directory it is executed from.
Datasets released to Community for further research
There is always more to be found and always more possibilities, we merely scratched the surface. To facilitate new research being the focus of this study, we are releasing datasets from this internet scan which we think might be interesting/useful for the security researchers and infosec community.
Please feel free to use this for research/free purposes. In case you do something cool with this, give us a shout at @redhuntlabs or email@example.com. We would love to hear about them. Details for these datasets are shared below:
Description: Contains HTML titles sorted by (and containing) their frequency.
Description: Contains HTTP header names sorted by (and containing) their frequency.
HTTP Headers and Values – Complete
Description: Contains all HTTP headers (both name and value) returned from GET requests sent to port 80 and 443.
server & x-powered-by header
Download Link: server header (https://rhl-resonance.s3.eu-west-2.amazonaws.com/waves/2/servers.zip) &
X-Powered By headers (http://rhl-resonance.s3.eu-west-2.amazonaws.com/waves/2/x_powered_by.zip)
Description: Contains values of server and x-powered-by header sorted by (and containing) their frequency.
Size: 3.2M and 0.2M
We are stepping into an age where everything is starting to get connected to the internet, everything has a chip. In such times, it is essential to continuously evaluate the attack surface & security posture of the internet as a whole.
If you own technology that’s connected to the internet, it might be exposing data to the internet when directly accessed over its public IP address. To make sure there’s no unintentional data exposure, get a list of public IPs you own and scan for all ports to see if any port is open. You can do this manually or sign up for an NVADR account to visualize all your domains, subdomains, IPs, docker containers and open ports in a single place.