pyrelational.strategies

Abstract strategy

This module defines the interface for an abstract active learning strategy.

It is composed of defining a __call__ function which suggests observations to be labelled. In the default case the __call__ is the composition of a informativeness function which assigns a measure of informativeness to unlabelled observations and a selection algorithm which chooses what observations to present to the oracle.

class Strategy(scorer: AbstractScorer | AbstractRegressionScorer | AbstractClassificationScorer, sampler: BatchModeSampler)[source]

Bases: ABC

This module defines an abstract active learning strategy.

Any strategy should be a subclass of this class and override the __call__ method to suggest observations to be labeled. In the general case __call__ would be the composition of an informativeness function, which assigns a measure of informativeness to unlabelled observations, and a selection algorithm which chooses what observations to present to the oracle.

The user defined __call__ method must have a “num_annotate” argument

Initialize the strategy with a scorer and a sampler.

Parameters:
  • scorer – instance of a scorer class

  • sampler – instance of a sampler class

suggest(num_annotate: int, **kwargs: Any) List[int][source]

Filter kwargs and feed arguments to the __call__ method.

Parameters:
  • num_annotate – number of samples to annotate

  • kwargs – any kwargs (filtered to match internal suggest inputs)

Returns:

list of indices of samples to query from oracle

static train_and_infer(data_manager: DataManager, model_manager: ModelManager[Any, Any]) Any[source]

Train the model on the currently labelled subset of the data.

Return an output that can be used in model uncertainty based strategies. :param data_manager: reference to data_manager which will supply data to train model

and the unlabelled observations

Parameters:

model_manager – Model with generic model interface that will be trained and used to produce output of this method

Returns:

output of the model

Strategies for regression tasks

Abstract regression strategy

Regression strategy class implementing __call__ logic.

class RegressionStrategy(scorer: AbstractScorer | AbstractRegressionScorer | AbstractClassificationScorer, sampler: BatchModeSampler)[source]

Bases: Strategy

A base active learning strategy class for regression.

Initialize the strategy with a scorer and a sampler.

Parameters:
  • scorer – instance of a scorer class

  • sampler – instance of a sampler class

__call__(num_annotate: int, data_manager: DataManager, model_manager: ModelManager[Any, Any]) List[int][source]

Identify samples for labelling based on user defined scoring and sampling function.

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which wraps a user defined ML model to handle instantiation, training, testing, as well as uncertainty quantification

Returns:

list of indices to annotate

Expected improvement

Implement Expected Improvement Strategy for regression tasks.

class ExpectedImprovementStrategy(xi: float = 0.01, axis: int = 0)[source]

Bases: Strategy

Implement Expected Improvement Strategy.

Unlabelled sample is scored based on the expected improvement scoring function.

Initialize the strategy with the expected improvement scorer and a deterministic sampler for regression.

__call__(num_annotate: int, data_manager: DataManager, model_manager: ModelManager[Any, Any]) List[int][source]

Identify samples which need to be labelled.

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which wraps a user defined ML model to handle instantiation, training, testing, as well as uncertainty quantification

Returns:

list of indices to annotate

scorer: ExpectedImprovement

Mean Prediction

Least confidence

Thompson sampling

Thomas Sampling Strategy for Regression.

class ThompsonSamplingStrategy(axis: int = 0)[source]

Bases: RegressionStrategy

Implements Thompson Sampling Strategy.

Unlabelled samples are scored and queried based on the thompson sampling scorer.

Initialize the strategy with the thompson sampling scorer and a deterministic scorer for regression.

Upper confidence bound

Upper Confidence Bound Strategy.

class UpperConfidenceBoundStrategy(kappa: float = 1.0, axis: int = 0)[source]

Bases: RegressionStrategy

Implements Upper Confidence Bound Strategy.

Unlabelled samples are scored and queried based on the UCB scorer.

Initialize the strategy with the UCB scorer and a deterministic scorer for regression.

Parameters:

kappa – trade-off parameter between exploitation and exploration

Strategies for classification tasks

Abstract Classification strategy

ClassificationStrategy class for active learning in classification tasks.

class ClassificationStrategy(scorer: AbstractScorer | AbstractRegressionScorer | AbstractClassificationScorer, sampler: BatchModeSampler)[source]

Bases: Strategy

A base active learning strategy class for classification.

Initialize the strategy with a scorer and a sampler.

Parameters:
  • scorer – instance of a scorer class

  • sampler – instance of a sampler class

__call__(num_annotate: int, data_manager: DataManager, model_manager: ModelManager[Any, Any]) List[int][source]

Identify samples for labelling based on user defined scoring and sampling function.

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which wraps a user defined ML model to handle instantiation, training, testing, as well as uncertainty quantification

Returns:

list of indices to annotate

softmax(scores: Tensor, base: float = 2.718281828459045, axis: int = -1) Tensor[source]

Return softmax array for array of scores.

Converts a set of raw scores from a model (logits) into a probability distribution via softmax.

The probability distribution will be a set of real numbers such that each is in the range 0-1.0 and the sum is 1.0.

Assumes input is a pytorch tensor: tensor([1.0, 4.0, 2.0, 3.0])

Parameters:
  • scores – (pytorch tensor) a pytorch tensor of any positive/negative real numbers.

  • base – the base for the exponential (default e)

Param:

axis to apply softmax on scores

Returns:

tensor of softmaxed scores

Entropy

Active learning using entropy based confidence uncertainty measure.

The score is computed between classes in the posterior predictive distribution to choose which observations to propose to the oracle.

class EntropyClassificationStrategy(axis: int = -1)[source]

Bases: ClassificationStrategy

Implements Entropy Classification Strategy.

Initialise the strategy with entropy scorer and deterministic sampler.

Least confidence

Active learning using least confidence uncertainty measure.

class LeastConfidenceStrategy(axis: int = -1)[source]

Bases: ClassificationStrategy

Implements Least Confidence Strategy.

Unlabelled samples are scored and queried based on the least confidence for classification scorer.

Initialize the strategy with the least confidence scorer and a deterministic scorer for classification.

Marginal confidence

Active learning using marginal confidence uncertainty measure.

class MarginalConfidenceStrategy(axis: int = -1)[source]

Bases: ClassificationStrategy

Implements Marginal Confidence Strategy.

Unlabelled samples are scored and queried based on the marginal confidence for classification scorer.

Initialize the strategy with the marginal confidence scorer and a deterministic scorer for classification.

Confidence ratio

Active learning using ratio based confidence uncertainty measure.

class RatioConfidenceStrategy(axis: int = -1)[source]

Bases: ClassificationStrategy

Implements Ratio Confidence Strategy.

Unlabelled samples are scored and queried based on the ratio confidence for classification scorer.

Initialize the strategy with the ratio confidence scorer and a deterministic scorer for classification.

Task-agnostic strategies

Random acquisition

Defines and implements a random acquisition active learning strategy.

class RandomAcquisitionStrategy[source]

Bases: Strategy

Implements RandomAcquisition whereby random samples from unlabelled set are chosen at each step.

Override init method to do nothing. This strategy does not require any initialization.

__call__(num_annotate: int, data_manager: DataManager) List[int][source]

Identify samples for labelling based on random sampling.

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

Returns:

list of indices to annotate

Relative distance

Relative distance based active learning strategy.

class RelativeDistanceStrategy(metric: str = 'euclidean')[source]

Bases: Strategy

Diversity sampling based active learning strategy.

Initialise the strategy with a distance metric.

Parameters:

metric – Name of distance metric to use. This should be supported by scikit-learn pairwise_distances function.

__call__(num_annotate: int, data_manager: DataManager) List[int][source]

Identify samples which need to be labelled.

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

Returns:

list of indices to annotate

scorer: RelativeDistanceScorer

Representative sampling

Representative sampling based active learning strategy.

class RepresentativeSamplingStrategy(clustering_method: str | ClusterMixin = 'KMeans', **clustering_kwargs: Any)[source]

Bases: Strategy

Representative sampling based active learning strategy.

Initialise the strategy with a clustering method and its arguments.

Parameters:
  • clustering_method – name, or instantiated class, of the clustering method to use

  • clustering_kwargs – arguments to be passed to instantiate clustering class if a string is passed to clustering_method

__call__(data_manager: DataManager, num_annotate: int) List[int][source]

Identify samples for labelling based on representative sampling informativeness measure.

Parameters:
  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • num_annotate – number of samples to annotate

Returns:

list of indices to annotate

representative_sampling(query_set: Tensor | ndarray[Any, dtype[float64]] | List[float] | DataLoader[Any], num_annotate: int, clustering_method: str | ClusterMixin = 'KMeans', **clustering_kwargs: Any) List[int][source]

Select representative samples from the query set using clustering algorithms from scikit-learn.

Parameters:
  • query_set – The query set, either as an array-like object or a PyTorch DataLoader. If a DataLoader, the first element of each batch should be the features of the samples.

  • num_annotate – Number of representative samples to select.

  • clustering_method – The clustering method to use, either as a string (name of the clustering algorithm) or as an instantiated clustering class.

  • clustering_kwargs – Additional arguments for the clustering method, used if clustering_method is a string.

Returns:

A list of indices representing the selected samples.