Quickstart and introduction by example

As discussed in the What is Active Learning? section, the PyRelationAL package decomposes the active learning workflow into five main components: 1) a data manager, 2) a model manager, 3) an acquisition strategy built around an informativeness measure, 4) an oracle and 5) a pipeline. In this section, we work through an example to illustrate how to instantiate and combine a data manager, a model manager, an acquisition strategy and an oracle.

Data Manager

The data manager (pyrelational.data_managers.data_manager.DataManager) wraps around a PyTorch Dataset and handles dataloader instantiation as well as tracking and updating of labelled and unlabelled sample pools. In this example, we consider the digit dataset from scikit-learn.

We first create a pytorch dataset for it. In order to use this dataset within the DataManager, we need to be aware of a few points:

  • The dataset must contain an attribute which stores the labels. The name of this attribute can then be passed to the label_attr input of the DataManager. By default this is specified as “y”. Some datasets, such as torch.utils.data.TensorDataset from pytorch, do not have this property and so these are not currently supported.

  • The dataset __getitem__() method must return a tuple of tensors. Most strategies and model managers within the package also assume that the features are contained within a single tensor, which is the first item that is returned in the tuple.

import torch
from torch.utils.data import Dataset
from sklearn.datasets import load_digits

class DigitDataset(Dataset):
    """ Sklearn digit dataset
    """
    def __init__(self):
        super(DigitDataset, self).__init__()
        sk_x, sk_y = load_digits(return_X_y=True)
        self.x = torch.FloatTensor(sk_x) # data
        self.y = torch.LongTensor(sk_y) # target

    def __len__(self):
        return self.x.shape[0]

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

We then use this dataset object to instantiate a data manager, providing it with train, validation, and test sets. Note that the train set is further split into labelled and unlabelled pools. The former corresponds to the samples whose labels are available at the start, and the latter to the set of samples whose labels are hidden from the model and that can be queried at each iteration by the active learning strategy.

from pyrelational.data_managers.data_manager import DataManager

def get_digit_data_manager():
    ds = DigitDataset()
    train_ds, valid_ds, test_ds = torch.utils.data.random_split(ds, [1400, 200, 197])
    train_indices = train_ds.indices
    valid_indices = valid_ds.indices
    test_indices = test_ds.indices
    labelled_indices = (
        train_indices[:labelled_size] if not labelled_size is None else None
    )

    return DataManager(
                        ds,
                        train_indices=train_indices,
                        validation_indices=valid_indices,
                        test_indices=test_indices,
                        labelled_indices=labelled_indices,
                        loader_batch_size=10,
                    )

See Using your own datasets with PyRelationAL for more details on how to interface datasets with PyRelationAL data manager.

Model Manager

Now that our data manager is ready, we demonstrate how to define a machine learning model to interact with it. A PyRelationAL model manager wraps a user defined ML model (e.g. PyTorch Module, Pytorch Lightning Module, or scikit-learn estimator) and handles instantiation, training, testing, as well as uncertainty quantification (e.g. ensembling, MC-dropout). It is also compatible with ML models that directly estimate their uncertainties such as Gaussian Processes (see demo on source repository). Continuing with our example, we define a pytorch lightning module to perform digit classification on the dataset defined in the previous section.

import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
from lightning.pytorch import LightningModule


class DigitClassifier(LightningModule):
    """Custom module for a simple convnet classifier"""

    def __init__(self, dropout_rate=0):
        super(DigitClassifier, self).__init__()
        self.layer_1 = nn.Linear(8*8, 16)
        self.layer_2 = nn.Linear(16, 32)
        self.dropout = nn.Dropout(dropout_rate)
        self.layer_3 = nn.Linear(32, 10)

    def forward(self, x):
        x = self.layer_1(x)
        x = F.relu(x)
        x = self.layer_2(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.layer_3(x)
        x = F.log_softmax(x, dim=1)
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("loss", loss.item())
        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("test_loss", loss)

        # compute accuracy
        _, y_pred = torch.max(logits.data, 1)
        accuracy = accuracy_score(y, y_pred)
        self.log("accuracy", accuracy)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

Once defined, the lightning model can then be wrapped into a PyRelationAL model manager to interact with the active learning strategies. Note that at the moment, PyRelationAL defines MCDropout and Ensemble wrapper to approximate Bayesian uncertainty of arbitrary models. You can find the existing models and templates in pyrelational.model_managers. The code snippet below demonstrates how to simply integrate the model above with either mc-dropout or ensembling PyRelationAL model managers.

from pyrelational.model_managers.mcdropout_model import LightningMCDropoutModelManager
model_manager = LightningMCDropoutModelManager(
            DigitClassifier,
            {"dropout_rate":0.3},
            {"epochs": 4},
            n_estimators=25,
            eval_dropout_prob=0.5,
        )

from pyrelational.model_managers.ensemble_model_manager import LightningEnsembleModelManager
model_manager = LightningEnsembleModelManager(
            DigitClassifier,
            {"dropout_rate":0.3},
            {"epochs": 4},
            n_estimators=25,
        )

See Defining learning models compatible with PyRelationAL for more examples on how to create custom models.

Strategy

We now need to choose an informativeness measure to define our strategy. The informativeness measure serves as the basis for the selection of the query sent to the oracle for labelling. We define various strategies in pyrelational.strategies for classification, regression, and task-agnostic scenarios based on different measure of informativeness defined in pyrelational.informativeness. For instance, here we choose to use a least confidence strategy for our digit classification problem

from pyrelational.strategies.classification import (
    LeastConfidenceStrategy,
)
strategy = LeastConfidenceStrategy()

See Creating your own active learning strategies with PyRelationAL for more examples.

Oracle

The oracle (extending pyrelational.oracles.abstract_oracle.Oracle) provides annotations given input observations from the dataset. Users may create custom oracles to utilize bespoke/external labelling tools. We provide a BenchmarkOracle (pyrelational.oracles.benchmark_oracle.BenchmarkOracle) for evaluating strategies in R&D settings, which assumes that all the data points in the dataset have been annotated prior to the AL workflow.

from pyrelational.oracles.benchmark_oracle import (
    BenchmarkOracle,
)
oracle = BenchmarkOracle()

Pipeline

After setting up the various components required (strategy, data manager, model manager, oracle), we now only need to instantiate a pipeline (pyrelational.pipeline.pipeline.Pipeline) to facilitate communication between the components, and run the active learning workflow. Here we run a full active learning run, which will label 250 data points at each iteration, until all points in the dataset have been labelled We obtain metrics for the performance of the method, eg performance of the model at each iteration, at the end of the run.

from pyrelational.pipeline.pipeline.Pipeline import (
    Pipeline,
)
data_manager = get_digit_data_manager()
pipeline = Pipeline(data_manager=data_manager, model=model, strategy=strategy, oracle=oracle)
pipeline.compute_theoretical_performance()
pipeline.run(num_annotate=250)
performance_history = pipeline.summary()

Comparing performances of different strategies

We can now compare the performances of different strategies on our digit classification problem

from pyrelational.data_managers.data_manager import DataManager
from pyrelational.strategies.classification import (
    LeastConfidenceStrategy,
    MarginalConfidenceStrategy,
    RatioConfidenceStrategy,
    EntropyClassificationStrategy,
)
from pyrelational.strategies.task_agnostic import RandomAcquisitionStrategy
from pyrelational.pipeline.pipeline.Pipeline import Pipeline
from pyrelational.oracles.benchmark_oracle import BenchmarkOracle

query = dict()
num_annotate = 50

# Least confidence strategy
dm = get_digit_data_manager()
strategy = LeastConfidenceStrategy()
oracle = BenchmarkOracle()
pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
pipeline.compute_theoretical_performance()
pipeline.run(num_annotate=num_annotate)
query['LeastConfidence'] = pipeline.summary()

# Maginal confidence
dm = get_digit_data_manager()
strategy = MarginalConfidenceStrategy(data_manager=dm, model_manager=model_manager)
oracle = BenchmarkOracle()
pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
pipeline.compute_theoretical_performance()
pipeline.run(num_annotate=num_annotate)
query['MarginalConfidence'] = pipeline.summary()

# Ratio confidence
dm = get_digit_data_manager()
strategy = RatioConfidenceStrategy(data_manager=dm, model_manager=model_manager)
oracle = BenchmarkOracle()
pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
pipeline.compute_theoretical_performance()
pipeline.run(num_annotate=num_annotate)
query['RatioConfidence'] = pipeline.summary()

# Entropy classification
dm = get_digit_data_manager()
strategy = EntropyClassificationStrategy(data_manager=dm, model_manager=model_manager)
oracle = BenchmarkOracle()
pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
pipeline.compute_theoretical_performance()
pipeline.run(num_annotate=num_annotate)
query['EntropyClassification'] = pipeline.summary()


# Random classification
dm = get_digit_data_manager()
strategy = RandomAcquisitionStrategy(data_manager=dm, model_manager=model_manager)
oracle = BenchmarkOracle()
pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
pipeline.compute_theoretical_performance()
pipeline.run(num_annotate=num_annotate)
query['RandomAcquistion'] = pipeline.summary()

Which give the results in the plot below, where we observe some improvement over a random strategy.

Comparison of strategies performances on digit classification.