Octopii – An open-source, PII (Personally Identifiable Information) Scanner for Images

Comments Off on Octopii – An open-source, PII (Personally Identifiable Information) Scanner for Images
Octopii - An open-source, PII (Personally Identifiable Information) Scanner for Images

Today, given the number of services that collect Personal Identifiable Information (PII) for purposes such as ‘KYC’ (Know Your Customer) documents, bureaus keeping records of people, small businesses keeping records of their employees, and so on, PII faces a wide variety of threats. With increasing security breaches, protecting valuable data such as Personal Identifiable Information must be the top priority of all organizations. In order to secure PII from leakage and exposure, organizations need to ensure proper handling and disposal of such assets. The first step in accomplishing this is to identify the exposure of such assets. 

Enter Octopii, an AI-powered Personal Identifiable Information scanner that uses Tesseract’s Optical Character Recognition (OCR) and a MobileNet Convolutional Neural Network (CNN) model to detect various forms of Government IDs, passports, debit cards, driver’s licenses, photos, signatures, etc. Let’s take a closer look at how Octopii works and why it’s essential to look out for exposed PII throughout your assets.

Existing Solutions – What we found…

During our research, we stumbled upon a few solutions, such as Amazon Macie and ManageEngine DataSecurity Plus, each with its own quirks. Macie, for example, only supports S3 buckets and only integrates with Amazon’s own services. Data security Plus doesn’t support S3 and is paid. We couldn’t find an extensible open-source tool that could identify potentially exposed PII, so we at RedHunt Labs developed Octopii and open-sourced it to generate momentum for it in the open-source community. 

PII and its exposure

Personally Identifiable Information or ‘PII’ is any information that can be used to identify or trace back an individual. When pieces of data such as a person’s name, address, Social Security Number (SSN), phone number, email address, and so on may be used to identify a specific individual, they are deemed as PII. As organizations grow in size, so are their volumes of data linked with them. This makes identifying and protecting such sensitive resources at a scale becomes quite complex.

Protecting PII is solely the responsibility of the organization that’s handling the data. Several incidents have emerged in recent years. Organizations failed to implement appropriate security standards, putting customers’ confidential data at risk. In more than half of the cases, the PII was exposed through badly configured Amazon S3 buckets (a form of exposure that Octopii has the ability to look for). Strict rules have been enacted against organizations that put their customers’ data at risk.

How Octopii works

The main objective of Octopii is to identify PII documents which can be either the original electronic copy of the document or the documents scanned and uploaded by the people. If the image is manually captured and uploaded, there are difficulties in determining the document type. As with a manually captured document, the orientation would be crooked, the image might not be well-lit, the characters might not be readable owing to document degradation, it might be cropped, and so on. As a result, it becomes difficult to classify an image according to the type of document. 

The image classification accuracy highly depends on how well the model is trained. Octopii uses an open-source library for image classification called Keras. Let’s look at the broad steps Octopii performs to classify an image:

1. Importing and cleaning images

Octopii possesses the ability to scan for exposed PII from HTML-based open-directory listings, Amazon S3 buckets, or a local path on the webserver. Depending on the type of path specified, Octopii imports the images via OpenCV and the Python Imaging Library (PIL). To circumvent difficulties caused by manual scanning, images are cleaned and have their subpixels rearranged for scanning. When a directory or URL is provided, the tool will recursively traverse through its subdirectories and fetch the images.  Whether a file is an image or not is decided by OpenCV being able to read it. Currently, JPEG, PNG, and GIF files are confirmed to be working.

2. Image classification

The neural network needs a trained data model of some kind to work with to provide accurate results. This is where machine learning (ML) comes into the picture. We generated a simple data model using Google’s Teachable Machine, which uses a simple, 53-layer deep MobileNet model. This data model is generated from PII we blurred, and some standard-issue templates we generated after research.

This is how Octopii is able to “think” about how likely an image is to be a certain kind of image – because we bias its opinions on purpose via the model, therefore letting the machine tell us what an asset could be. It does not actually understand what these asset types are; it simply assigns an index number to it, which we cross with the labels file that we manually specify to understand the network’s output.

3. Optical Character Recognition (OCR)

Later a verification method is performed in order to determine the accuracy of the image classified in the above method. OCR in Octopii is powered by Tesseract, an open-source OCR engine. Tesseract – like our image classifier – uses a neural network subsystem that is optimized solely to recognize lines of text. 

Octopii simply calls Tesseract on a copy of the image we feed into the image classifier to look for strings of text. We manually specify the search strings that Octopii uses, which contain unique words that only a certain type of PII may have (for example, a banking document may have the word “Passbook” which a driver’s license doesn’t have). This functionality can also be enhanced with regular expressions, which can not only improve accuracy but can also help users understand how rapidly and easily identity theft can be automated.

How Octopii can help

There could be many scenarios where Octopii can be super useful:

  • There is a browsable directory on one of your web application and you want to identify if the exposed directory is leaking any PII information to make things worse.
  • To check if any of your s3 Buckets contain any PII information in order to be compliant and risk free as well.
  • You want to go through a heap of images in a directory and want to make sure no PII information is being stored in the directory.

Getting started with Octopii

Since Octopii is a python-based command-line tool, you need to have your python environment setup correctly. It is necessary that your system should have all the required dependencies for Octopii to function properly. You can install the required modules all at once using the requirements.txt file.

pip install -r requirements.txt

Octopii currently supports scanning local file system scanning, Amazon S3 buckets, and open directory listings via their URLs. For users to test out the tool and to get a better understanding, we have created a dummy environment with PII images hosted in it.

python3 octopii.py <target>

Here is a more comprehensive video regarding installation and execution.

How you can help

Since Octopii is an open-source project, we appreciate contributions from the community. Octopii relies heavily on machine learning, and there’s always room for improvement when training models are used. The image classification model can be improvised by providing more datasets to train on. Since PII is highly confidential information, obtaining a large dataset to train the model is highly limited. We envision Octopii supporting the classification of international documents, which isn’t possible with your support. 

Here’s a walkthrough video explaining how you could contribute to this project by uploading PII documents of your own country to extend the image classification Octopii supports. The video shows how easy is it to generate a model using Google’s Teachable machine, though you can also use Keras.

Storing sensitive customer, employee, or user data such as Government ID, photos, etc in a safe environment is extremely important. A lot of small businesses fail to understand this and end up storing them in unsafe, insecure environments such as open directories or S3s without proper authorization, prioritizing ease of access over better security. Since this area of security is less scrutinized than others, the solutions to detect PII are few and far apart. Octopii is RedHunt Labs’ effort to try and improve on this frontier of open-source intelligence.


We at RedHunt Labs help organizations discover untracked assets, data exposure, and external attack surface with NVADR, an all-in-one attack surface management SaaS solution.

New attack vectors and vulnerabilities keep originating quite often and might affect one (or many) assets across your organization. During such times, having a precise external asset inventory makes it easy to scan for systems affecting the newly published vulnerability.

NVADR also ‘continuously’ enumerates and lists all the technologies used across your external attack surface and thus helps identify affected assets right away. Don’t hesitate to get in touch with us to schedule your free trial today.