Quickstart and introduction by example ======================================= As discussed in the :ref:`whatisal` section, the **PyRelationAL** package decomposes the active learning workflow into five main components: 1) a data manager, 2) a model manager, 3) an acquisition strategy built around an informativeness measure, 4) an oracle and 5) a pipeline. In this section, we work through an example to illustrate how to instantiate and combine a data manager, a model manager, an acquisition strategy and an oracle. Data Manager ------------- The data manager (:py:class:`pyrelational.data_managers.data_manager.DataManager`) wraps around a PyTorch Dataset and handles dataloader instantiation as well as tracking and updating of labelled and unlabelled sample pools. In this example, we consider the `digit dataset `_ from scikit-learn. We first create a pytorch dataset for it. In order to use this dataset within the DataManager, we need to be aware of a few points: * The dataset must contain an attribute which stores the labels. The name of this attribute can then be passed to the `label_attr` input of the DataManager. By default this is specified as "y". Some datasets, such as :py:class:`torch.utils.data.TensorDataset` from pytorch, do not have this property and so these are not currently supported. * The dataset :py:meth:`__getitem__` method must return a tuple of tensors. Most strategies and model managers within the package also assume that the features are contained within a single tensor, which is the first item that is returned in the tuple. .. code-block:: python import torch from torch.utils.data import Dataset from sklearn.datasets import load_digits class DigitDataset(Dataset): """ Sklearn digit dataset """ def __init__(self): super(DigitDataset, self).__init__() sk_x, sk_y = load_digits(return_X_y=True) self.x = torch.FloatTensor(sk_x) # data self.y = torch.LongTensor(sk_y) # target def __len__(self): return self.x.shape[0] def __getitem__(self, idx): return self.x[idx], self.y[idx] We then use this dataset object to instantiate a data manager, providing it with train, validation, and test sets. Note that the train set is further split into labelled and unlabelled pools. The former corresponds to the samples whose labels are available at the start, and the latter to the set of samples whose labels are hidden from the model and that can be queried at each iteration by the active learning strategy. .. code-block:: python from pyrelational.data_managers.data_manager import DataManager def get_digit_data_manager(): ds = DigitDataset() train_ds, valid_ds, test_ds = torch.utils.data.random_split(ds, [1400, 200, 197]) train_indices = train_ds.indices valid_indices = valid_ds.indices test_indices = test_ds.indices labelled_indices = ( train_indices[:labelled_size] if not labelled_size is None else None ) return DataManager( ds, train_indices=train_indices, validation_indices=valid_indices, test_indices=test_indices, labelled_indices=labelled_indices, loader_batch_size=10, ) See :ref:`using own data` for more details on how to interface datasets with **PyRelationAL** data manager. Model Manager -------------- Now that our data manager is ready, we demonstrate how to define a machine learning model to interact with it. A **PyRelationAL** model manager wraps a user defined ML model (e.g. PyTorch Module, Pytorch Lightning Module, or scikit-learn estimator) and handles instantiation, training, testing, as well as uncertainty quantification (e.g. ensembling, MC-dropout). It is also compatible with ML models that directly estimate their uncertainties such as Gaussian Processes (see `demo `_ on source repository). Continuing with our example, we define a pytorch lightning module to perform digit classification on the dataset defined in the previous section. .. code-block:: python import torch import torch.nn as nn import torch.nn.functional as F from sklearn.metrics import accuracy_score from lightning.pytorch import LightningModule class DigitClassifier(LightningModule): """Custom module for a simple convnet classifier""" def __init__(self, dropout_rate=0): super(DigitClassifier, self).__init__() self.layer_1 = nn.Linear(8*8, 16) self.layer_2 = nn.Linear(16, 32) self.dropout = nn.Dropout(dropout_rate) self.layer_3 = nn.Linear(32, 10) def forward(self, x): x = self.layer_1(x) x = F.relu(x) x = self.layer_2(x) x = F.relu(x) x = self.dropout(x) x = self.layer_3(x) x = F.log_softmax(x, dim=1) return x def training_step(self, batch, batch_idx): x, y = batch logits = self(x) loss = F.nll_loss(logits, y) return loss def validation_step(self, batch, batch_idx): x, y = batch logits = self(x) loss = F.nll_loss(logits, y) self.log("loss", loss.item()) return loss def test_step(self, batch, batch_idx): x, y = batch logits = self(x) loss = F.nll_loss(logits, y) self.log("test_loss", loss) # compute accuracy _, y_pred = torch.max(logits.data, 1) accuracy = accuracy_score(y, y_pred) self.log("accuracy", accuracy) def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) return optimizer Once defined, the lightning model can then be wrapped into a **PyRelationAL** model manager to interact with the active learning strategies. Note that at the moment, **PyRelationAL** defines MCDropout and Ensemble wrapper to approximate Bayesian uncertainty of arbitrary models. You can find the existing models and templates in :mod:`pyrelational.model_managers`. The code snippet below demonstrates how to simply integrate the model above with either mc-dropout or ensembling **PyRelationAL** model managers. .. code-block:: python from pyrelational.model_managers.mcdropout_model import LightningMCDropoutModelManager model_manager = LightningMCDropoutModelManager( DigitClassifier, {"dropout_rate":0.3}, {"epochs": 4}, n_estimators=25, eval_dropout_prob=0.5, ) from pyrelational.model_managers.ensemble_model_manager import LightningEnsembleModelManager model_manager = LightningEnsembleModelManager( DigitClassifier, {"dropout_rate":0.3}, {"epochs": 4}, n_estimators=25, ) See :ref:`build your own model` for more examples on how to create custom models. Strategy --------- We now need to choose an informativeness measure to define our strategy. The informativeness measure serves as the basis for the selection of the query sent to the oracle for labelling. We define various strategies in :mod:`pyrelational.strategies` for classification, regression, and task-agnostic scenarios based on different measure of informativeness defined in :mod:`pyrelational.informativeness`. For instance, here we choose to use a least confidence strategy for our digit classification problem .. code-block:: python from pyrelational.strategies.classification import ( LeastConfidenceStrategy, ) strategy = LeastConfidenceStrategy() See :ref:`using own strategy` for more examples. Oracle ------- The oracle (extending `pyrelational.oracles.abstract_oracle.Oracle`) provides annotations given input observations from the dataset. Users may create custom oracles to utilize bespoke/external labelling tools. We provide a BenchmarkOracle (pyrelational.oracles.benchmark_oracle.BenchmarkOracle) for evaluating strategies in R&D settings, which assumes that all the data points in the dataset have been annotated prior to the AL workflow. .. code-block:: python from pyrelational.oracles.benchmark_oracle import ( BenchmarkOracle, ) oracle = BenchmarkOracle() Pipeline --------- After setting up the various components required (strategy, data manager, model manager, oracle), we now only need to instantiate a pipeline (`pyrelational.pipeline.pipeline.Pipeline`) to facilitate communication between the components, and run the active learning workflow. Here we run a full active learning run, which will label 250 data points at each iteration, until all points in the dataset have been labelled We obtain metrics for the performance of the method, eg performance of the model at each iteration, at the end of the run. .. code-block:: python from pyrelational.pipeline.pipeline.Pipeline import ( Pipeline, ) data_manager = get_digit_data_manager() pipeline = Pipeline(data_manager=data_manager, model=model, strategy=strategy, oracle=oracle) pipeline.compute_theoretical_performance() pipeline.run(num_annotate=250) performance_history = pipeline.summary() Comparing performances of different strategies ----------------------------------------------- We can now compare the performances of different strategies on our digit classification problem .. code-block:: python from pyrelational.data_managers.data_manager import DataManager from pyrelational.strategies.classification import ( LeastConfidenceStrategy, MarginalConfidenceStrategy, RatioConfidenceStrategy, EntropyClassificationStrategy, ) from pyrelational.strategies.task_agnostic import RandomAcquisitionStrategy from pyrelational.pipeline.pipeline.Pipeline import Pipeline from pyrelational.oracles.benchmark_oracle import BenchmarkOracle query = dict() num_annotate = 50 # Least confidence strategy dm = get_digit_data_manager() strategy = LeastConfidenceStrategy() oracle = BenchmarkOracle() pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle) pipeline.compute_theoretical_performance() pipeline.run(num_annotate=num_annotate) query['LeastConfidence'] = pipeline.summary() # Maginal confidence dm = get_digit_data_manager() strategy = MarginalConfidenceStrategy(data_manager=dm, model_manager=model_manager) oracle = BenchmarkOracle() pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle) pipeline.compute_theoretical_performance() pipeline.run(num_annotate=num_annotate) query['MarginalConfidence'] = pipeline.summary() # Ratio confidence dm = get_digit_data_manager() strategy = RatioConfidenceStrategy(data_manager=dm, model_manager=model_manager) oracle = BenchmarkOracle() pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle) pipeline.compute_theoretical_performance() pipeline.run(num_annotate=num_annotate) query['RatioConfidence'] = pipeline.summary() # Entropy classification dm = get_digit_data_manager() strategy = EntropyClassificationStrategy(data_manager=dm, model_manager=model_manager) oracle = BenchmarkOracle() pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle) pipeline.compute_theoretical_performance() pipeline.run(num_annotate=num_annotate) query['EntropyClassification'] = pipeline.summary() # Random classification dm = get_digit_data_manager() strategy = RandomAcquisitionStrategy(data_manager=dm, model_manager=model_manager) oracle = BenchmarkOracle() pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle) pipeline.compute_theoretical_performance() pipeline.run(num_annotate=num_annotate) query['RandomAcquistion'] = pipeline.summary() Which give the results in the plot below, where we observe some improvement over a random strategy. .. image:: performance_comparison.png :width: 100% :alt: Comparison of strategies performances on digit classification.