Fraud Detection with Machine Learning using Python (numpy, pandas, matplotlib, and scikit-learn)
Fraud is a pervasive problem in many industries, including finance, insurance, and social media. With the increasing availability of data and the advancement of machine learning algorithms, it has become possible to leverage these tools to detect fraudulent activity more effectively.
In this post, I’ll explore how machine learning can be used for fraud detection. I’ll going to create a tutorial demonstrating how to implement a fraud detection model using Python.
I’ll discuss the key concepts and techniques involved in fraud detection with machine learning, such as preprocessing the data, selecting an appropriate machine learning algorithm, and evaluating the performance of the model.
Sounds cool, right? Let’s dive in!
Step 1. Import the required libraries:
First, you need to import the required libraries, including numpy, pandas, matplotlib, and scikit-learn.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier
Step 2. Load the data:
Next, you need to load the data that you will use for fraud detection. You can use a publicly available dataset such as the Credit Card Fraud Detection dataset from Kaggle.
df = pd.read_csv('creditcard.csv')
Step 3. Explore the data:
Once the data is loaded, you need to explore it to gain a better understanding of its features and distributions.
# Explore the data print(df.head()) print(df.describe()) print(df.info())
Step 4. Preprocess the data:
Once you have explored the data, you need to preprocess it so that it can be used for training the machine learning model. This involves tasks such as feature engineering, normalization, and splitting the data into training and validation sets.
# Preprocess the data # Remove the Time column as it is not useful for classification df = df.drop('Time', axis=1) # Normalize the Amount column from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df['Amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1)) # Split the data into features and labels X = df.drop('Class', axis=1) y = df['Class'] # Split the data into training and validation sets X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
In this preprocessing example, we first remove the
Time column from the dataset as it is not useful for classification. We then normalize the
Amount column using
StandardScaler, which scales the data to have a mean of 0 and a standard deviation of 1. This is an important preprocessing step as it ensures that all the features have similar scales, which can help improve the performance of the machine learning model.
Next, we split the data into features (
X) and labels (
X dataframe contains all the columns except the
Class column, which is the target variable we are trying to predict. The
y dataframe contains only the
Finally, we split the data into training and validation sets using
train_test_split from scikit-learn. We use a test size of 0.2, which means that 20% of the data is used for validation. We also use stratified sampling to ensure that the proportion of fraudulent and non-fraudulent transactions is the same in both the training and validation sets. This is important as it ensures that the machine learning model is trained on a representative sample of the data.
Step 5. Define the model:
Once the data is preprocessed, you need to define the architecture of the machine learning model. For this example, we will use a random forest classifier.
# Define the model model = RandomForestClassifier(n_estimators=100)
Step 6. Train the model:
Once the model is defined, you need to train it using the preprocessed data.
# Train the model model.fit(X_train, y_train)
Step 7. Evaluate the model:
After training the model, you need to evaluate its performance on the validation set.
# Evaluate the model on the validation set y_pred = model.predict(X_val) print(classification_report(y_val, y_pred))
Step 8. Test the model:
Once you are satisfied with the model’s performance on the validation set, you can test it on a new set of data to see how well it generalizes to unseen data.
# Test the model on new data # Load the new data new_data = pd.read_csv('new_data.csv') # Preprocess the new data new_data = new_data.drop('Time', axis=1) new_data['Amount'] = scaler.transform(new_data['Amount'].values.reshape(-1, 1)) X_new = new_data.drop('Class', axis=1) y_new = new_data['Class'] # Make predictions on the new data y_pred = model.predict(X_new) # Evaluate the performance on the new data print(classification_report(y_new, y_pred))
In this testing example, we first load the new data from a CSV file using
pd.read_csv(). We then preprocess the new data by dropping the
Time column and normalizing the
Amount column using the same
scaler object that we used for the training data.
Next, we split the new data into features (
X_new) and labels (
y_new). We then use the
model.predict() method to make predictions on the new data. Finally, we evaluate the performance of the model on the new data using
classification_report() from scikit-learn. This method prints a report that includes metrics such as precision, recall, and F1-score for both the fraudulent and non-fraudulent classes.
This allows us to get a better sense of how well it generalizes to unseen data and how effective it is at detecting fraudulent activity in real-world scenarios.
That’s it! This basic basic example should give you an idea of how to use machine learning for fraud detection using Python.