Password Strength Multinomial Classification

Natural language Processing with Logistic Classifier

In this project we build a model to predict the password strength supplied by a user, through the use of a logistic regression classifier and natural language processing using term frequency-inverse document frequency (TF-IDF) measure.

Data breaches and identity theft are on the rise, and the cause is often compromised passwords. As such, passwords provide the first line of defense against unauthorized access to your computer and personal information. The stronger your password, the more protected your computer will be from hackers and malicious software. You should maintain strong passwords for all accounts on your computer (Importance of Passwords). Interestingly, compromised passwords caused 80 percent of all data breaches in 20192, resulting in financial losses for both businesses and consumers.

There are plenty of website which seem to implement a similar working that is done in this project. For example look at the below sites:

Both of these sites have the user input a password (any password), and it will immediately evaluate the strength of the password. It also provides a rough estimate as to how long it may take to crack proposed password.


Aim of Project

We seek to build a classification model such that a user inputs a password, then we can classify the strength of the proposed password into 3 categories - weak (0), average (1) and strong (2). NLP comes into play for the processing of the text data, where we apply TF-IDF which is a NLP technique to preprocess the text data into vectors for use in our models.


Data Description

The data set is taken from Kaggle and contains over 670 000 different passwords of varying length and formats. Each password has a label associated with it, stipulating the strength of the password:

  • 1 if it is an average strength password
  • 2 if it is strong strength password
  • 0 if it is a weak strength password

The strength of the password is determined on rules such as lowercase english alphabetical letters, digits, upper case english alphabetical letters, special characters, etc.


Setup

We used several modules for this project, namely:

  • NumPy for scientific computing
  • Pandas for data analysis and manipulation
  • Seaborn for drawing attractive and informative statistical graphics
In [1]:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns
In [2]:
# Adjust for Displayed Warnings to False
import warnings
warnings.filterwarnings('ignore')
In [8]:
# Read in Data
data = pd.read_csv("data.csv")
print("Extract of Data: \n", data.head(), "\n")
print("Shape of Data: \n", data.shape)
Extract of Data: 
       password  strength
0     lkaj8899         1
1     cgsu5858         1
2    xyfy3y1qw         1
3      dyui159         1
4  qwertyuiop0         1 

Shape of Data: 
 (669640, 2)

The beginning 5 entries in the data set are shown. Password is given in one column and the labelled strength is shown in the second column. The first 5 passwords given in the data set have a strength of 1 - indicating they have average strength.

We also see that we have 669 640 passwords and their respective strengths.


Brief Data Exploration

We look at the data types, value counts and unique labels in our data set.

In [9]:
# Initial Data Analysis
print("Password Strenght Labels:" ,data['strength'].unique(), "\n")
print("Data Types: \n", data.dtypes, "\n")
print("Data Summary for Strength Labels: \n", data.describe().transpose(), "\n")
Password Strenght Labels: [1 2 0] 

Data Types: 
 password    object
strength     int64
dtype: object 

Data Summary for Strength Labels: 
              count      mean       std  min  25%  50%  75%  max
strength  669640.0  0.990196  0.507948  0.0  1.0  1.0  1.0  2.0 

We see a variety of information here, namely the unique labels for the data set - 1 for medium strength, 0 for weak strength and 2 for strong strength of password. We also see the data types in our data frame, where labels are numeric and password is a list of characters. This is acceptable and appropriate variable types. Finally, we see a summary of our labels, where the mean password strength is about medium (1). Below we investigate the value counts - the count of passwords for each label.

In [11]:
# Value Counts
print("Value Counts for Password Strengths: \n", data["strength"].value_counts(), "\n")
sns.set(rc = {'figure.figsize':(10,5)})                                 # used to control the theme and configurations of the seaborn plot
sns.countplot(x="strength", data=data, hue="strength");
Value Counts for Password Strengths: 
 1    496801
0     89702
2     83137
Name: strength, dtype: int64 

We find that we have a relatively imbalanced data set, without most passwords (observations) belonging to 1 category - average strenght passwords.

In [12]:
#Check for Missing Data
print("Missing Data: \n", data.isna().sum())

data[data['password'].isnull()]
Missing Data: 
 password    1
strength    0
dtype: int64
Out[12]:
password strength
367579 NaN 0

We have one missing observation - the password is missing, but the label is not. We shall simply remove this observation.

In [13]:
# Drop NA records
data.dropna(inplace=True)
print(data.isnull().sum())                # Check if removed
password    0
strength    0
dtype: int64

We now shall create a numpy array using our data frame from pandas. This will allow us to easily handle the data.

In [14]:
# Create Numpy array
password_tuple=np.array(data)
print(password_tuple)
[['lkaj8899' 1]
 ['cgsu5858' 1]
 ['xyfy3y1qw' 1]
 ...
 ['184520socram' 1]
 ['marken22a' 1]
 ['fxx4pw4g' 1]]

We also shall randomize the arrangement of data points or elements in the array - for robustness.

In [15]:
# Randomise Data
import random
random.shuffle(password_tuple)
password_tuple
Out[15]:
array([['lkaj8899', 1],
       ['lkaj8899', 1],
       ['xyfy3y1qw', 1],
       ...,
       ['65frydtw47', 1],
       ['sample728', 1],
       ['palacioian4', 1]], dtype=object)
In [16]:
# Set X and Y data
x=[passwords[0] for passwords in password_tuple]            # predictor
y=[strength[1] for strength in password_tuple]              # response / label
print("First few values from x: \n", x[1:6], "\n")
print("First few values from y: \n", y[1:6], "\n")
First few values from x: 
 ['lkaj8899', 'xyfy3y1qw', 'cgsu5858', 'cgsu5858', 'AVYq1lDE4MgAZfNt'] 

First few values from y: 
 [1, 1, 1, 1, 2] 


Text Processing

Now we begin this section by creating a function to split the input into characters of list.

In [17]:
# Create function: splits input into character list
def split(inputs):
    character=[]
    for i in inputs:
        character.append(i)
    return character
In [18]:
split('e.g.password123')
Out[18]:
['e', '.', 'g', '.', 'p', 'a', 's', 's', 'w', 'o', 'r', 'd', '1', '2', '3']

We can now import TF-IDF Vectorizer. The goal and idea of using TF-IDF here, is to scale down the impact of tokens that occur very frequently in a given corpus (a collection of written texts) and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

TF or Term Frequency tells us how frquently a term occurs in document. This can be shown below: $\text{TF(t)} = \frac{\text{Number of times term t appears in document}}{\text{Total number of terms in document}}$

Inverse Data Frequency (IDF) essentially tells us the weight of rare words. The words that occur rarely in the corpus have a high IDF score. This is shown here: $\text{IDF(t)} = \text{log}\left(\frac{\text{Total number of documents}}{\text{Number of documents with term t in it}}\right)$

After importing the module. We can instantiating TfidfVectorizer().

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer         # import module
vectorizer = TfidfVectorizer(tokenizer=split)                       #need to specificy tokenizer

We need to specify the argument tokenzier to be the function we have created, split. From here we can apply the TF-IDF vectorizer to the data (x) we have.

In [53]:
# Transform the data, X
Matrix=vectorizer.fit_transform(x)
In [55]:
# Return Dictionary
print(vectorizer.vocabulary_)
{'l': 53, 'k': 52, 'a': 42, 'j': 51, '8': 28, '9': 29, 'c': 44, 'g': 48, 's': 60, 'u': 62, '5': 25, 'd': 45, 'y': 66, 'i': 50, '1': 21, 'x': 65, 'f': 47, '3': 23, 'q': 58, 'w': 64, 'e': 46, 'r': 59, 't': 61, 'o': 56, 'p': 57, '0': 20, '6': 26, '2': 22, 'n': 55, 'v': 63, '4': 24, 'm': 54, 'z': 67, '@': 35, '-': 17, 'h': 49, 'b': 43, '7': 27, '.': 18, '&': 12, '?': 34, '>': 33, '<': 31, '!': 7, '_': 40, '$': 10, ' ': 6, ';': 30, '/': 19, '±': 76, '*': 15, '(': 13, ')': 14, '#': 9, '%': 11, '`': 41, '+': 16, '\\': 37, 'þ': 103, 'ó': 96, '\x1c': 4, '[': 36, ']': 38, 'ú': 100, '=': 32, '{': 68, '}': 70, '^': 39, '¿': 86, '~': 71, '³': 78, 'ô': 97, '\x05': 0, '\x1b': 3, '"': 8, '\x16': 1, 'ò': 95, '·': 82, '\x1e': 5, 'ä': 90, 'ß': 87, '\x19': 2, '´': 79, '°': 75, 'à': 88, 'å': 91, '‚': 105, 'õ': 98, '\x7f': 72, '|': 69, '²': 77, 'ð': 94, 'â': 89, '¡': 73, 'ý': 102, '÷': 99, '¨': 74, 'ÿ': 104, 'í': 93, '¾': 85, 'µ': 80, 'ü': 101, 'ç': 92, 'º': 84, '¹': 83, '¶': 81}

We see returned a dictionary comprised of the tokens and their respective indices in the array. So we see that all the words were made lowercase by default and that the punctuation was ignored (if there was).

In [57]:
Matrix.shape #We have 669639 rows (669639 passwords) and 106 columns (106 unique words).
Out[57]:
(669639, 106)

We have 669639 rows (669639 passwords) and 106 columns (106 unique words).

In [62]:
FirstDocVector=Matrix[0]
df=pd.DataFrame(FirstDocVector.T.todense(),index=vectorizer.get_feature_names(),columns=['TF-IDF'])
df.sort_values(by=['TF-IDF'],ascending=False).head(10)
Out[62]:
TF-IDF
8 0.595507
9 0.568080
j 0.337354
k 0.302192
l 0.281619
... ...
> 0.000000
= 0.000000
< 0.000000
; 0.000000
0.000000

106 rows × 1 columns


Model Training

To begin this section we shall start with construcitng our training and testing data sets. Our training set is essentially a subset of the original data to train our model, whilst the test set is remained (or subset) of orignal data set (not train) used to test the trained model.

In [65]:
from sklearn.model_selection import train_test_split                    #importing train test splitter module
Matrix_train, Matrix_test, y_train, y_test=train_test_split(Matrix,y,test_size=0.3)
Matrix_train.shape
Out[65]:
(468747, 106)

We have a 70/30 split for train to test data. So 70% of the data will be used for training and 30% will be used for testing.

We now seek to implement our logistic regression classification model.

In [66]:
# Import Logistic Regression
from sklearn.linear_model import LogisticRegression

# Multinomial Logistic Regression
clf=LogisticRegression(random_state=0,multi_class='multinomial')
clf.fit(Matrix_train,y_train)
Out[66]:
LogisticRegression(multi_class='multinomial', random_state=0)

We make sure to Set multi_class parameter as multinomial because we have more than 2 categories in the data, i.e. 0, 1 and 2 for password strength classifications. That is, we are considering a case of multinomial Logistic Regression.

Multinomial logistic regression is used when you have a categorical dependent variable with two or more unordered levels (i.e. two or more discrete outcomes). It is practically identical to logistic regression, except that you have multiple possible outcomes instead of just one.

We now look to check the accuracy of our model.

Prediction Using Test Data

We now assess the accuracy of our model by using our testing data set.

In [68]:
y_prediction=clf.predict(Matrix_test)
print(Matrix_test,y_prediction[0:10])
[1 2 1 1 1 1 1 1 1 2]

We can assess the accuracy =using certain measures, specifically a confusion matrix shall be assessed and accuracy score returned. The Confusion Matrix allows us to measure Recall, Precision, Accuracy and AUC-ROC curve are the metrics to measure the performance of the model.

Note that since we have a multinomial problem, our confusion matrix is a $N\times N$ matrix, where $N$ is the number of classes or outputs.

So we shall have a $3 \times 3$ confusion matrix

  • True Positive: the actual value and also the predicted values are the same

  • False Positive: the actual value is negative but the model has predicted it as positive

  • False Negative: the actual value is positive but the model has predicted it as negative

  • True Negative: the actual value and also the predicted values are the same

(Analytics Vidya, Confusion matrix for Multi Class Classification)

In [75]:
139473 + 3667 + 7666 + 16793
Out[75]:
167599
In [72]:
# Import confusion matrix and accuracy score
from sklearn.metrics import confusion_matrix,accuracy_score

# Accuracy Measures
cm=confusion_matrix(y_test,y_prediction)
print("Confusion Matrix: \n", cm, "\n")
print("Model Accuracy using testing Data: \n", accuracy_score(y_test,y_prediction))
Confusion Matrix: 
 [[  8170  19318     16]
 [  5742 139473   3667]
 [    47   7666  16793]] 

Model Accuracy using testing Data: 
 0.8185293590585987

The below analysis and calculations for TP, TN, FP and FN are for the class 0 (weak passwords):

  • TP: Concerning 0 (weak password) class (cell 1), the value of 8170 is the True Positive. 8170 were classified as 0 and were correctly predicted as 0.

  • FN: this shall be the sum of values of corresponding rows except TP value. Sow e shall have FN = (cell 2 + cell 3) $= 19318 + 16 = 19334$

  • FP: will be the sum of values of corresponding column except the TP value. So we shall have FP = (cell 4 + cell 7) $= 5742 + 47 = 5789$

  • TN: sum of values of all columns and row except the values of that class that we are calculating the values for. So we shall have that TN = (cell 5 + cell 6 + cell 8 + cell 9) $= 139473 + 3667 + 7666 + 16793 = 167599$


Closing Remarks

Finally, we now seek to perform predictions on the strenght of passwords which are not part of the data - are outside of the range of training or test data.

In [96]:
dt=np.array(['car'])
prediction=vectorizer.transform(dt)
predz1 = clf.predict(prediction)

dt=np.array(['car12'])
prediction2=vectorizer.transform(dt)
predz2 = clf.predict(prediction2)

dt=np.array(['car&%_12'])
prediction3=vectorizer.transform(dt)
predz3 = clf.predict(prediction3)

print("Password: car \n Password Strength:", predz1)
print("Password: car12 \n Password Strength:", predz2)
print("Password: car&%_12 \n Password Strength:", predz3)
Password: car 
 Password Strength: [0]
Password: car12 
 Password Strength: [1]
Password: car&%_12 
 Password Strength: [2]

Our supplied passwords have strengths of 0, 1 and 2 respectively. We notice that by adding digits or special characters increasing password strength.

How to Verify the Password Strength?

If you get (Array output as "0") that means your password is Weak. If you get (Array output as "1") that means your password is medium. If you get (Array output as "2") that means your password is Strong.

doing prediction on X-Test data

In [81]:
# Final Classifcation Report for Model
from sklearn.metrics import classification_report
print(classification_report(y_test,y_prediction))
              precision    recall  f1-score   support

           0       0.59      0.30      0.39     27504
           1       0.84      0.94      0.88    148882
           2       0.82      0.69      0.75     24506

    accuracy                           0.82    200892
   macro avg       0.75      0.64      0.68    200892
weighted avg       0.80      0.82      0.80    200892