pyrelational.strategies

Abstract strategy

This module defines the interface for an abstract active learning strategy which is composed of defining a __call__ function which suggests observations to be labelled. In the default case the __call__ is the composition of a informativeness function which assigns a measure of informativeness to unlabelled observations and a selection algorithm which chooses what observations to present to the oracle

class Strategy(*args: Any, **kwargs: Any)[source]

Bases: ABC

This module defines an abstract active learning strategy.

Any strategy should be a subclass of this class and override the __call__ method to suggest observations to be labeled. In the general case __call__ would be the composition of an informativeness function, which assigns a measure of informativeness to unlabelled observations, and a selection algorithm which chooses what observations to present to the oracle.

The user defined __call__ method must have a “num_annotate” argument

suggest(num_annotate: int, **kwargs: Any) List[int][source]

Filter kwargs and feed arguments to the __call__ method to return unlabelled observations to be labelled as a list of dataset indices.

Parameters:
  • num_annotate – number of samples to annotate

  • kwargs – any kwargs (filtered to match internal suggest inputs)

Returns:

list of indices of samples to query from oracle

static train_and_infer(data_manager: DataManager, model_manager: ModelManager[Any, Any]) Any[source]

Train the model on the currently labelled subset of the data and produces an output that can be used in model uncertainty based strategies.

Parameters:
  • data_manager – reference to data_manager which will supply data to train model and the unlabelled observations

  • model_manager – Model with generic model interface that will be trained and used to produce output of this method

Returns:

output of the model

Strategies for regression tasks

Abstract regression strategy

class RegressionStrategy[source]

Bases: Strategy, ABC

A base active learning strategy class for regression in which the top n indices, according to user-specified scoring function, are queried at each iteration

__call__(num_annotate: int, data_manager: DataManager, model_manager: ModelManager[Any, Any]) List[int][source]

Call function which identifies samples which need to be labelled based on user defined scoring function.

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which wraps a user defined ML model to handle instantiation, training, testing, as well as uncertainty quantification

Returns:

list of indices to annotate

abstract scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Expected improvement

class ExpectedImprovementStrategy[source]

Bases: Strategy

Implement Expected Improvement Strategy whereby each unlabelled sample is scored based on the expected improvement scoring function. The top samples according to this score are selected at each step

__call__(num_annotate: int, data_manager: DataManager, model_manager: ModelManager[Any, Any]) List[int][source]

Call function which identifies samples which need to be labelled

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which wraps a user defined ML model to handle instantiation, training, testing, as well as uncertainty quantification

Returns:

list of indices to annotate

Mean Prediction

class MeanPredictionStrategy[source]

Bases: RegressionStrategy

Implements Mean Prediction Strategy whereby unlabelled samples are queried based on their predicted mean value by the model. ie samples with the highest predicted mean values are queried.

scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Least confidence

class LeastConfidenceStrategy[source]

Bases: RegressionStrategy

Implements Least Confidence Strategy whereby unlabelled samples are queried based on their predicted variance by the model

scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Thompson sampling

class ThompsonSamplingStrategy[source]

Bases: RegressionStrategy

Implements Thompson Sampling Strategy whereby unlabelled samples are scored and queried based on the thompson sampling scorer

scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Upper confidence bound

class UpperConfidenceBoundStrategy(kappa: float = 1.0)[source]

Bases: Strategy

Implements Upper Confidence Bound Strategy whereby unlabelled samples are scored and queried based on the UCB scorer

Parameters:

kappa – trade-off parameter between exploitation and exploration

__call__(num_annotate: int, data_manager: DataManager, model_manager: ModelManager[Any, Any]) List[int][source]

Call function which identifies samples which need to be labelled

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which wraps a user defined ML model to handle instantiation, training, testing, as well as uncertainty quantification

Returns:

list of indices to annotate

Strategies for classification tasks

Abstract Classification strategy

class ClassificationStrategy[source]

Bases: Strategy, ABC

A base active learning strategy class for classification in which the top n indices, according to user-specified scoring function, are queried at each iteration.

__call__(num_annotate: int, data_manager: DataManager, model_manager: ModelManager[Any, Any]) List[int][source]

Call function which identifies samples which need to be labelled based on user defined scoring function.

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which wraps a user defined ML model to handle instantiation, training, testing, as well as uncertainty quantification

Returns:

list of indices to annotate

abstract scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Entropy

Active learning using entropy based confidence uncertainty measure between classes in the posterior predictive distribution to choose which observations to propose to the oracle

class EntropyClassificationStrategy[source]

Bases: ClassificationStrategy

Implements Entropy Classification Strategy whereby unlabelled samples are scored and queried based on entropy

scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Least confidence

Active learning using least confidence uncertainty measure between classes in the posterior predictive distribution to choose which observations to propose to the oracle

class LeastConfidenceStrategy[source]

Bases: ClassificationStrategy

Implements Least Confidence Strategy whereby unlabelled samples are scored and queried based on the least confidence for classification scorer

scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Marginal confidence

Active learning using marginal confidence uncertainty measure between classes in the posterior predictive distribution to choose which observations to propose to the oracle

class MarginalConfidenceStrategy[source]

Bases: ClassificationStrategy

Implements Marginal Confidence Strategy whereby unlabelled samples are scored and queried based on the marginal confidence for classification scorer

scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Confidence ratio

Active learning using ratio based confidence uncertainty measure between classes in the posterior predictive distribution to choose which observations to propose to the oracle

class RatioConfidenceStrategy[source]

Bases: ClassificationStrategy

Implements Ratio Confidence Strategy whereby unlabelled samples are scored and queried based on the ratio confidence for classification scorer

scoring_function(predictions: Tensor) Tensor[source]

Compute score of each sample.

Parameters:

predictions – model predictions for each sample

Returns:

scores for each sample

Task-agnostic strategies

Random acquisition

Defines and implements a random acquisition active learning strategy.

class RandomAcquisitionStrategy[source]

Bases: Strategy

Implements RandomAcquisition whereby random samples from unlabelled set are chosen at each step

__call__(num_annotate: int, data_manager: DataManager) List[int][source]

Call function which identifies samples which need to be labelled

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

Returns:

list of indices to annotate

Relative distance

class RelativeDistanceStrategy(metric: str = 'euclidean')[source]

Bases: Strategy

Diversity sampling based active learning strategy.

Parameters:

metric – Name of distance metric to use. This should be supported by scikit-learn pairwise_distances function.

__call__(num_annotate: int, data_manager: DataManager) List[int][source]

Call function which identifies samples which need to be labelled

Parameters:
  • num_annotate – number of samples to annotate

  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

Returns:

list of indices to annotate

Representative sampling

Representative sampling based active learning strategy

class RepresentativeSamplingStrategy(clustering_method: str | ClusterMixin = 'KMeans', **clustering_kwargs: Any)[source]

Bases: Strategy

Representative sampling based active learning strategy

Parameters:
  • clustering_method – name, or instantiated class, of the clustering method to use

  • clustering_kwargs – arguments to be passed to instantiate clustering class if a string is passed to clustering_method

__call__(data_manager: DataManager, num_annotate: int) List[int][source]

Call function which identifies samples which need to be labelled

Parameters:
  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • num_annotate – number of samples to annotate

Returns:

list of indices to annotate