pyrelational.informativeness

Informativeness functions for regression tasks

This module contains methods for scoring samples based on model uncertainty in regression tasks

Most of these functions are simple but giving them a name and implementation in PyTorch is useful for defining the different active learning strategies

regression_bald(x: Tensor, axis: int = 0) Tensor[source]

Implementation of Bayesian Active Learning by Disagreement (BALD) for regression task (reference)

Parameters:
  • x – pytorch Tensor

  • axis – index of the axis along which the repeats are

Returns:

pytorch tensor of scores

regression_expected_improvement(x: Tensor | Distribution | None = None, mean: Tensor | None = None, std: Tensor | None = None, max_label: float | Tensor = 0.0, axis: int = 0, xi: float = 0.01) Tensor[source]

Implements expected improvement based on max_label in the currently available data (reference). Either x or mean and std should be provided as input.

Parameters:
  • x – pytorch tensor or pytorch Distribution

  • mean – pytorch tensor corresponding to a model’s mean predictions for each sample

  • std – pytorch tensor corresponding to the standard deviation of a model’s predictions for each sample

  • max_label – max label in the labelled dataset

  • axis – index of the axis along which the repeats are

  • xi – pytorch tensor or pytorch Distribution

Returns:

pytorch tensor of scores

regression_least_confidence(x: Tensor | Distribution | None = None, std: Tensor | None = None, axis: int = 0) Tensor[source]

Implements least confidence scoring of based on input x returns std score for each sample across repeats. Either x or std should be provided as input.

Parameters:
  • x – pytorch tensor of repeat by scores (or scores by repeat) or pytorch Distribution

  • std – pytorch tensor corresponding to the standard deviation of a model’s predictions for each sample

  • axis – index of the axis along which the repeats are

Returns:

pytorch tensor of scores

regression_mean_prediction(x: Tensor | Distribution | None = None, mean: Tensor | None = None, axis: int = 0) Tensor[source]

Returns mean score for each sample across repeats. Either x or mean should be provided as input.

Parameters:
  • x – pytorch tensor of repeat by scores (or scores by repeat) or pytorch Distribution

  • mean – pytorch tensor corresponding to the mean of a model’s predictions for each sample

  • axis – index of the axis along which the repeats are

Returns:

pytorch tensor of scores

regression_thompson_sampling(x: Tensor, axis: int = 0) Tensor[source]

Implements thompson sampling scoring (reference).

Parameters:
  • x – pytorch tensor

  • axis – index of the axis along which the repeats are

Returns:

pytorch tensor of scores

regression_upper_confidence_bound(x: Tensor | Distribution | None = None, mean: Tensor | None = None, std: Tensor | None = None, kappa: float = 1, axis: int = 0) Tensor[source]

Implements Upper Confidence Bound (UCB) scoring (reference) Either x or mean and std should be provided as input.

Parameters:
  • x – pytorch tensor or pytorch Distribution

  • mean – pytorch tensor corresponding to a model’s mean predictions for each sample

  • std – pytorch tensor corresponding to the standard deviation of a model’s predictions for each sample

  • kappa – trade-off parameter between exploitation and exploration

  • axis – index of the axis along which the repeats are

Returns:

pytorch tensor of scores

Informativeness functions for classification tasks

This module contains methods for scoring samples based on model uncertainty in classication tasks

This module contains functions for computing the informativeness values of a given probability distribution (outputs of a model/mc-dropout prediction, etc.)

classification_bald(prob_dist: Tensor) Tensor[source]

Implementation of Bayesian Active Learning by Disagreement (BALD) for classification task

reference :param x: 3D pytorch Tensor of shape n_estimators x n_samples x n_classes :return: 1D pytorch tensor of scores

classification_entropy(prob_dist: Tensor, axis: int = -1) Tensor[source]

Returns the informativeness score of a probability distribution using entropy

The entropy based uncertainty is defined as

\(- \frac{1}{\log(n)} \sum_{i}^{n} p_i \log (p_i)\)

Parameters:
  • prob_dist – real number tensor whose elements add to 1.0 along an axis

  • axis – axis of prob_dist where probabilities add to 1

Returns:

tensor of entropy based uncertainties

classification_least_confidence(prob_dist: Tensor, axis: int = -1) Tensor[source]

Returns the informativeness score of an array using least confidence sampling in a 0-1 range where 1 is the most uncertain

The least confidence uncertainty is the normalised difference between the most confident prediction and 100 percent confidence

Parameters:
  • prob_dist – real number tensor whose elements add to 1.0 along an axis

  • axis – axis of prob_dist where probabilities add to 1

Returns:

tensor with normalised least confidence scores

classification_margin_confidence(prob_dist: Tensor, axis: int = -1) Tensor[source]

Returns the informativeness score of a probability distribution using margin of confidence sampling in a 0-1 range where 1 is the most uncertain The margin confidence uncertainty is the difference between the top two most confident predictions

Parameters:
  • prob_dist – real number tensor whose elements add to 1.0 along an axis

  • axis – axis of prob_dist where probabilities add to 1

Returns:

tensor with margin confidence scores

classification_ratio_confidence(prob_dist: Tensor, axis: int = -1) Tensor[source]

Returns the informativeness score of a probability distribution using ratio of confidence sampling in a 0-1 range where 1 is the most uncertain The ratio confidence uncertainty is the ratio between the top two most confident predictions

Parameters:
  • prob_dist – real number tensor whose elements add to 1.0 along an axis

  • axis – axis of prob_dist where probabilities add to 1

Returns:

tensor of ratio confidence uncertainties

softmax(scores: Tensor, base: float = 2.718281828459045, axis: int = -1) Tensor[source]

Returns softmax array for array of scores

Converts a set of raw scores from a model (logits) into a probability distribution via softmax.

The probability distribution will be a set of real numbers such that each is in the range 0-1.0 and the sum is 1.0.

Assumes input is a pytorch tensor: tensor([1.0, 4.0, 2.0, 3.0])

Parameters:
  • scores – (pytorch tensor) a pytorch tensor of any positive/negative real numbers.

  • base – the base for the exponential (default e)

Param:

axis to apply softmax on scores

Returns:

tensor of softmaxed scores

Task agnostic informativeness functions

This module contains methods for scoring samples based on distances between featurization of samples. These scorers are task-agnostic.

get_closest_query_to_centroids(centroids: ndarray[Any, dtype[float64]], query: ndarray[Any, dtype[float64]], cluster_assignment: ndarray[Any, dtype[int64]]) List[int][source]

Find the closest sample in query to centroids.

Parameters:
  • centroids – array containing centroids

  • query – array containing query samples

  • cluster_assignment – indicate what cluster each query sample is associated with

Returns:

list of indices of query samples

get_random_query_from_cluster(cluster_assignment: ndarray[Any, dtype[int64]]) List[int][source]

Get random indices drawn from each cluster.

Parameters:

cluster_assignment – array indicating what cluster each sample is associated with.

Returns:

list of indices of query samples

relative_distance(query_set: Tensor | ndarray[Any, dtype[Any]] | List[Any] | DataLoader[Any], reference_set: Tensor | ndarray[Any, dtype[Any]] | List[Any] | DataLoader[Any], metric: str | Callable[[...], Any] | None = 'euclidean', axis: int = 1) Tensor[source]

Function that return the minimum distance, according to input metric, from each sample in the query_set to the samples in the reference set.

Parameters:
  • query_set – input containing the features of samples in the queryable pool. query set should either be an array-like object or a pytorch dataloader whose first element in each bactch is a featurisation of the samples in the batch.

  • reference_set – input containing the features of samples already queried samples against which the distances are computed. reference set should either be an array-like object or a pytorch dataloader whose first element in each bactch is a featurisation of the samples in the batch.

  • metric – defines the metric to be used to compute the distance. This should be supported by scikit-learn pairwise_distances function.

  • axis – integer indicating which dimension the features are

Returns:

pytorch tensor of dimension the number of samples in query_set containing the minimum distance from each sample to the reference set

representative_sampling(query_set: Tensor | ndarray[Any, dtype[Any]] | List[Any] | DataLoader[Any], num_annotate: int, clustering_method: str | ClusterMixin = 'KMeans', **clustering_kwargs: Any | None) List[int][source]

Function that selects representative samples of the query set. Representative selection relies on clustering algorithms in scikit-learn.

Parameters:
  • query_set – input containing the features of samples in the queryable pool. query set should either be an array-like object or a pytorch dataloader whose first element in each bactch is a featurisation of the samples in the batch

  • num_annotate – number of representative samples to identify

  • clustering_method – name, or instantiated class, of the clustering method to use

  • clustering_kwargs – arguments to be passed to instantiate clustering class if a string is passed to clustering_method

Returns:

array-like containing the indices of the representative samples identified