Quickstart and introduction by example
=======================================

As discussed in the :ref:`whatisal` section, the **PyRelationAL** package decomposes the active learning workflow into five
main components: 1) a data manager, 2) a model manager, 3) an acquisition strategy built around an informativeness measure, 4) an oracle and 5) a pipeline.
In this section, we work through an example to illustrate how to instantiate and combine a data manager, a model manager, an acquisition strategy and an oracle.

Data Manager
-------------

The data manager (:py:class:`pyrelational.data_managers.data_manager.DataManager`) wraps around a PyTorch
Dataset and handles dataloader instantiation as well as tracking and updating of labelled and unlabelled sample pools.
In this example, we consider the `digit dataset <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html>`_
from scikit-learn.

We first create a pytorch dataset for it. In order to use this dataset within the DataManager, we need to be aware of a few points:

* The dataset must contain an attribute which stores the labels. The name of this attribute can then be passed to the `label_attr` input of the DataManager.
  By default this is specified as "y". Some datasets, such as :py:class:`torch.utils.data.TensorDataset` from pytorch, do not have this property and so these are
  not currently supported.
* The dataset :py:meth:`__getitem__` method must return a tuple of tensors. Most strategies and model managers within the package also assume that the features are contained within a single tensor,
  which is the first item that is returned in the tuple.


.. code-block:: python

    import torch
    from torch.utils.data import Dataset
    from sklearn.datasets import load_digits

    class DigitDataset(Dataset):
        """ Sklearn digit dataset
        """
        def __init__(self):
            super(DigitDataset, self).__init__()
            sk_x, sk_y = load_digits(return_X_y=True)
            self.x = torch.FloatTensor(sk_x) # data
            self.y = torch.LongTensor(sk_y) # target

        def __len__(self):
            return self.x.shape[0]

        def __getitem__(self, idx):
            return self.x[idx], self.y[idx]

We then use this dataset object to instantiate a data manager, providing it with train, validation, and test sets.
Note that the train set is further split into labelled and unlabelled pools. The former corresponds to the samples whose labels
are available at the start, and the latter to the set of samples whose labels are hidden from the model and that can be queried
at each iteration by the active learning strategy.

.. code-block:: python

    from pyrelational.data_managers.data_manager import DataManager

    def get_digit_data_manager():
        ds = DigitDataset()
        train_ds, valid_ds, test_ds = torch.utils.data.random_split(ds, [1400, 200, 197])
        train_indices = train_ds.indices
        valid_indices = valid_ds.indices
        test_indices = test_ds.indices
        labelled_indices = (
            train_indices[:labelled_size] if not labelled_size is None else None
        )

        return DataManager(
                            ds,
                            train_indices=train_indices,
                            validation_indices=valid_indices,
                            test_indices=test_indices,
                            labelled_indices=labelled_indices,
                            loader_batch_size=10,
                        )

See :ref:`using own data` for more details on how to interface datasets with **PyRelationAL** data manager.

Model Manager
--------------

Now that our data manager is ready, we demonstrate how to define a machine learning model to interact with it.
A **PyRelationAL** model manager wraps a user defined ML model (e.g. PyTorch Module, Pytorch Lightning Module, or scikit-learn estimator) and
handles instantiation, training, testing, as well as uncertainty quantification (e.g. ensembling, MC-dropout).
It is also compatible with ML models that directly estimate their uncertainties such as Gaussian Processes
(see `demo <https://github.com/RelationRx/pyrelational/examples/demo/model_gaussianprocesses.py>`_ on source repository).
Continuing with our example, we define a pytorch lightning module to perform digit classification on the dataset defined
in the previous section.

.. code-block:: python

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    from sklearn.metrics import accuracy_score
    from lightning.pytorch import LightningModule


    class DigitClassifier(LightningModule):
        """Custom module for a simple convnet classifier"""

        def __init__(self, dropout_rate=0):
            super(DigitClassifier, self).__init__()
            self.layer_1 = nn.Linear(8*8, 16)
            self.layer_2 = nn.Linear(16, 32)
            self.dropout = nn.Dropout(dropout_rate)
            self.layer_3 = nn.Linear(32, 10)

        def forward(self, x):
            x = self.layer_1(x)
            x = F.relu(x)
            x = self.layer_2(x)
            x = F.relu(x)
            x = self.dropout(x)
            x = self.layer_3(x)
            x = F.log_softmax(x, dim=1)
            return x

        def training_step(self, batch, batch_idx):
            x, y = batch
            logits = self(x)
            loss = F.nll_loss(logits, y)
            return loss

        def validation_step(self, batch, batch_idx):
            x, y = batch
            logits = self(x)
            loss = F.nll_loss(logits, y)
            self.log("loss", loss.item())
            return loss

        def test_step(self, batch, batch_idx):
            x, y = batch
            logits = self(x)
            loss = F.nll_loss(logits, y)
            self.log("test_loss", loss)

            # compute accuracy
            _, y_pred = torch.max(logits.data, 1)
            accuracy = accuracy_score(y, y_pred)
            self.log("accuracy", accuracy)

        def configure_optimizers(self):
            optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
            return optimizer

Once defined, the lightning model can then be wrapped into a **PyRelationAL** model manager to interact with the active learning strategies.
Note that at the moment, **PyRelationAL** defines MCDropout and Ensemble wrapper to approximate Bayesian uncertainty of arbitrary models.
You can find the existing models and templates in :mod:`pyrelational.model_managers`. The code snippet below
demonstrates how to simply integrate the model above with either mc-dropout or ensembling **PyRelationAL** model managers.

.. code-block:: python

    from pyrelational.model_managers.mcdropout_model import LightningMCDropoutModelManager
    model_manager = LightningMCDropoutModelManager(
                DigitClassifier,
                {"dropout_rate":0.3},
                {"epochs": 4},
                n_estimators=25,
                eval_dropout_prob=0.5,
            )

    from pyrelational.model_managers.ensemble_model_manager import LightningEnsembleModelManager
    model_manager = LightningEnsembleModelManager(
                DigitClassifier,
                {"dropout_rate":0.3},
                {"epochs": 4},
                n_estimators=25,
            )

See :ref:`build your own model` for more examples on how to create custom models.

Strategy
---------

We now need to choose an informativeness measure to define our strategy. The informativeness measure serves as the basis for the selection of the query sent to the
oracle for labelling. We define various strategies in :mod:`pyrelational.strategies` for classification, regression, and task-agnostic scenarios based on
different measure of informativeness defined in :mod:`pyrelational.informativeness`.
For instance, here we choose to use a least confidence strategy for our digit classification problem

.. code-block:: python

    from pyrelational.strategies.classification import (
        LeastConfidenceStrategy,
    )
    strategy = LeastConfidenceStrategy()

See :ref:`using own strategy` for more examples.

Oracle
-------
The oracle (extending `pyrelational.oracles.abstract_oracle.Oracle`) provides annotations given input observations from the dataset.
Users may create custom oracles to utilize bespoke/external labelling tools. We provide a BenchmarkOracle (pyrelational.oracles.benchmark_oracle.BenchmarkOracle) for evaluating strategies in R&D settings,
which assumes that all the data points in the dataset have been annotated prior to the AL workflow.

.. code-block:: python

    from pyrelational.oracles.benchmark_oracle import (
        BenchmarkOracle,
    )
    oracle = BenchmarkOracle()

Pipeline
---------

After setting up the various components required (strategy, data manager, model manager, oracle), we now only need to instantiate
a pipeline (`pyrelational.pipeline.pipeline.Pipeline`) to facilitate communication between the components, and run the active learning workflow.
Here we run a full active learning run, which will label 250 data points at each iteration, until all points in the dataset have been labelled
We obtain metrics for the performance of the method, eg performance of the model at each iteration, at the end of the run.

.. code-block:: python

    from pyrelational.pipeline.pipeline.Pipeline import (
        Pipeline,
    )
    data_manager = get_digit_data_manager()
    pipeline = Pipeline(data_manager=data_manager, model=model, strategy=strategy, oracle=oracle)
    pipeline.compute_theoretical_performance()
    pipeline.run(num_annotate=250)
    performance_history = pipeline.summary()

Comparing performances of different strategies
-----------------------------------------------

We can now compare the performances of different strategies on our digit classification problem

.. code-block:: python

    from pyrelational.data_managers.data_manager import DataManager
    from pyrelational.strategies.classification import (
        LeastConfidenceStrategy,
        MarginalConfidenceStrategy,
        RatioConfidenceStrategy,
        EntropyClassificationStrategy,
    )
    from pyrelational.strategies.task_agnostic import RandomAcquisitionStrategy
    from pyrelational.pipeline.pipeline.Pipeline import Pipeline
    from pyrelational.oracles.benchmark_oracle import BenchmarkOracle

    query = dict()
    num_annotate = 50

    # Least confidence strategy
    dm = get_digit_data_manager()
    strategy = LeastConfidenceStrategy()
    oracle = BenchmarkOracle()
    pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
    pipeline.compute_theoretical_performance()
    pipeline.run(num_annotate=num_annotate)
    query['LeastConfidence'] = pipeline.summary()

    # Maginal confidence
    dm = get_digit_data_manager()
    strategy = MarginalConfidenceStrategy(data_manager=dm, model_manager=model_manager)
    oracle = BenchmarkOracle()
    pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
    pipeline.compute_theoretical_performance()
    pipeline.run(num_annotate=num_annotate)
    query['MarginalConfidence'] = pipeline.summary()

    # Ratio confidence
    dm = get_digit_data_manager()
    strategy = RatioConfidenceStrategy(data_manager=dm, model_manager=model_manager)
    oracle = BenchmarkOracle()
    pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
    pipeline.compute_theoretical_performance()
    pipeline.run(num_annotate=num_annotate)
    query['RatioConfidence'] = pipeline.summary()

    # Entropy classification
    dm = get_digit_data_manager()
    strategy = EntropyClassificationStrategy(data_manager=dm, model_manager=model_manager)
    oracle = BenchmarkOracle()
    pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
    pipeline.compute_theoretical_performance()
    pipeline.run(num_annotate=num_annotate)
    query['EntropyClassification'] = pipeline.summary()


    # Random classification
    dm = get_digit_data_manager()
    strategy = RandomAcquisitionStrategy(data_manager=dm, model_manager=model_manager)
    oracle = BenchmarkOracle()
    pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy, oracle=oracle)
    pipeline.compute_theoretical_performance()
    pipeline.run(num_annotate=num_annotate)
    query['RandomAcquistion'] = pipeline.summary()

Which give the results in the plot below, where we observe some improvement over a random strategy.

.. image:: performance_comparison.png
  :width: 100%
  :alt: Comparison of strategies performances on digit classification.