Active Learning: Learning with Limited Labeled Data in Python (Scikit-learn, Active Learning Lib)

Active Learning: Learning with Limited Labeled Data in Python (Scikit-learn, Active Learning Lib)

Active Learning is a machine learning approach that enables the selection of the most informative data points to be labeled by an oracle, thereby reducing the number of labeled data points required to train a model. Active Learning is useful in scenarios where labeled data is limited or expensive to acquire. Active Learning can help improve the accuracy of machine learning models with fewer labeled data points.

Learning with Limited Labeled Data in Python

Python is a popular language for machine learning, and several libraries support Active Learning. In this tutorial, we will use the Scikit-learn library to train a model and the Active Learning library to select informative data points to be labeled.

Import Libraries

We will start by importing the necessary libraries, including Scikit-learn for training the model, NumPy for numerical computations, and the Active Learning library for selecting informative data points to be labeled.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from modAL.uncertainty import uncertainty_sampling

Generate Data

Next, we will generate some random data for training and testing the model.

# Generate random data for training and testing
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_classes=2, random_state=1)

In this example, we generate 1000 data points with 10 features and 5 informative features for training and testing.

Split Data

Next, we will split the data into a training set and a test set.

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In this example, we split the data into a training set and a test set, with 20% of the data in the test set.

Train Initial Model

Next, we will train an initial logistic regression model on the labeled data.

# Train initial model
model = LogisticRegression()
model.fit(X_train[:10], y_train[:10])

In this example, we train an initial model on the first 10 labeled data points.

Active Learning

Next, we will use Active Learning to select informative data points to be labeled by an oracle.

# Active Learning
for i in range(10):
    # Select most informative data point to be labeled
    query_idx, query_inst = uncertainty_sampling(model, X_train)
    
    # Label data point
    y_new = np.array([int(input(f"Enter label for instance {j+1}: ")) for j in query_idx])
    
    # Add labeled data point to training data
    X_train = np.concatenate((X_train, query_inst.reshape(1, -1)))
    y_train = np.concatenate((y_train, y_new))
    
    # Retrain model
    model.fit(X_train, y_train)

In this example, we use the uncertainty_sampling function from the Active Learning library to select the most informative data point to be labeled by an oracle. We then ask the user to label the data point and add the labeled data point to the training data. We then retrain the model on the new labeled data.

Test Model

Finally, we will test the model on the test data.

# Test model
score = model.score(X_test, y_test)
print(f"Model accuracy: {score}")

In this example, we test the model on the test data and print the accuracy.

In this tutorial, we covered the basics of Active Learning and how to use it in Python to train machine learning models with limited labeled data. Active Learning is a useful approach in scenarios where labeled data is limited or expensive to acquire, and can help improve the accuracy of machine learning models with fewer labeled data points. I hope you found this tutorial useful in understanding Active Learning in Python.

Leave a comment

Your email address will not be published. Required fields are marked *