pyrelational.informativeness¶
Informativeness functions for regression tasks¶
This module contains methods for scoring samples based on model uncertainty in regression tasks
Most of these functions are simple but giving them a name and implementation in PyTorch is useful for defining the different active learning strategies
- regression_bald(x: Tensor, axis: int = 0) Tensor [source]¶
Implementation of Bayesian Active Learning by Disagreement (BALD) for regression task (reference)
- Parameters:
x – pytorch Tensor
axis – index of the axis along which the repeats are
- Returns:
pytorch tensor of scores
- regression_expected_improvement(x: Tensor | Distribution | None = None, mean: Tensor | None = None, std: Tensor | None = None, max_label: float | Tensor = 0.0, axis: int = 0, xi: float = 0.01) Tensor [source]¶
Implements expected improvement based on max_label in the currently available data (reference). Either x or mean and std should be provided as input.
- Parameters:
x – pytorch tensor or pytorch Distribution
mean – pytorch tensor corresponding to a model’s mean predictions for each sample
std – pytorch tensor corresponding to the standard deviation of a model’s predictions for each sample
max_label – max label in the labelled dataset
axis – index of the axis along which the repeats are
xi – pytorch tensor or pytorch Distribution
- Returns:
pytorch tensor of scores
- regression_least_confidence(x: Tensor | Distribution | None = None, std: Tensor | None = None, axis: int = 0) Tensor [source]¶
Implements least confidence scoring of based on input x returns std score for each sample across repeats. Either x or std should be provided as input.
- Parameters:
x – pytorch tensor of repeat by scores (or scores by repeat) or pytorch Distribution
std – pytorch tensor corresponding to the standard deviation of a model’s predictions for each sample
axis – index of the axis along which the repeats are
- Returns:
pytorch tensor of scores
- regression_mean_prediction(x: Tensor | Distribution | None = None, mean: Tensor | None = None, axis: int = 0) Tensor [source]¶
Returns mean score for each sample across repeats. Either x or mean should be provided as input.
- Parameters:
x – pytorch tensor of repeat by scores (or scores by repeat) or pytorch Distribution
mean – pytorch tensor corresponding to the mean of a model’s predictions for each sample
axis – index of the axis along which the repeats are
- Returns:
pytorch tensor of scores
- regression_thompson_sampling(x: Tensor, axis: int = 0) Tensor [source]¶
Implements thompson sampling scoring (reference).
- Parameters:
x – pytorch tensor
axis – index of the axis along which the repeats are
- Returns:
pytorch tensor of scores
- regression_upper_confidence_bound(x: Tensor | Distribution | None = None, mean: Tensor | None = None, std: Tensor | None = None, kappa: float = 1, axis: int = 0) Tensor [source]¶
Implements Upper Confidence Bound (UCB) scoring (reference) Either x or mean and std should be provided as input.
- Parameters:
x – pytorch tensor or pytorch Distribution
mean – pytorch tensor corresponding to a model’s mean predictions for each sample
std – pytorch tensor corresponding to the standard deviation of a model’s predictions for each sample
kappa – trade-off parameter between exploitation and exploration
axis – index of the axis along which the repeats are
- Returns:
pytorch tensor of scores
Informativeness functions for classification tasks¶
This module contains methods for scoring samples based on model uncertainty in classication tasks
This module contains functions for computing the informativeness values of a given probability distribution (outputs of a model/mc-dropout prediction, etc.)
- classification_bald(prob_dist: Tensor) Tensor [source]¶
Implementation of Bayesian Active Learning by Disagreement (BALD) for classification task
reference :param x: 3D pytorch Tensor of shape n_estimators x n_samples x n_classes :return: 1D pytorch tensor of scores
- classification_entropy(prob_dist: Tensor, axis: int = -1) Tensor [source]¶
Returns the informativeness score of a probability distribution using entropy
The entropy based uncertainty is defined as
\(- \frac{1}{\log(n)} \sum_{i}^{n} p_i \log (p_i)\)
- Parameters:
prob_dist – real number tensor whose elements add to 1.0 along an axis
axis – axis of prob_dist where probabilities add to 1
- Returns:
tensor of entropy based uncertainties
- classification_least_confidence(prob_dist: Tensor, axis: int = -1) Tensor [source]¶
Returns the informativeness score of an array using least confidence sampling in a 0-1 range where 1 is the most uncertain
The least confidence uncertainty is the normalised difference between the most confident prediction and 100 percent confidence
- Parameters:
prob_dist – real number tensor whose elements add to 1.0 along an axis
axis – axis of prob_dist where probabilities add to 1
- Returns:
tensor with normalised least confidence scores
- classification_margin_confidence(prob_dist: Tensor, axis: int = -1) Tensor [source]¶
Returns the informativeness score of a probability distribution using margin of confidence sampling in a 0-1 range where 1 is the most uncertain The margin confidence uncertainty is the difference between the top two most confident predictions
- Parameters:
prob_dist – real number tensor whose elements add to 1.0 along an axis
axis – axis of prob_dist where probabilities add to 1
- Returns:
tensor with margin confidence scores
- classification_ratio_confidence(prob_dist: Tensor, axis: int = -1) Tensor [source]¶
Returns the informativeness score of a probability distribution using ratio of confidence sampling in a 0-1 range where 1 is the most uncertain The ratio confidence uncertainty is the ratio between the top two most confident predictions
- Parameters:
prob_dist – real number tensor whose elements add to 1.0 along an axis
axis – axis of prob_dist where probabilities add to 1
- Returns:
tensor of ratio confidence uncertainties
- softmax(scores: Tensor, base: float = 2.718281828459045, axis: int = -1) Tensor [source]¶
Returns softmax array for array of scores
Converts a set of raw scores from a model (logits) into a probability distribution via softmax.
The probability distribution will be a set of real numbers such that each is in the range 0-1.0 and the sum is 1.0.
Assumes input is a pytorch tensor: tensor([1.0, 4.0, 2.0, 3.0])
- Parameters:
scores – (pytorch tensor) a pytorch tensor of any positive/negative real numbers.
base – the base for the exponential (default e)
- Param:
axis to apply softmax on scores
- Returns:
tensor of softmaxed scores
Task agnostic informativeness functions¶
This module contains methods for scoring samples based on distances between featurization of samples. These scorers are task-agnostic.
- get_closest_query_to_centroids(centroids: ndarray[Any, dtype[float64]], query: ndarray[Any, dtype[float64]], cluster_assignment: ndarray[Any, dtype[int64]]) List[int] [source]¶
Find the closest sample in query to centroids.
- Parameters:
centroids – array containing centroids
query – array containing query samples
cluster_assignment – indicate what cluster each query sample is associated with
- Returns:
list of indices of query samples
- get_random_query_from_cluster(cluster_assignment: ndarray[Any, dtype[int64]]) List[int] [source]¶
Get random indices drawn from each cluster.
- Parameters:
cluster_assignment – array indicating what cluster each sample is associated with.
- Returns:
list of indices of query samples
- relative_distance(query_set: Tensor | ndarray[Any, dtype[Any]] | List[Any] | DataLoader[Any], reference_set: Tensor | ndarray[Any, dtype[Any]] | List[Any] | DataLoader[Any], metric: str | Callable[[...], Any] | None = 'euclidean', axis: int = 1) Tensor [source]¶
Function that return the minimum distance, according to input metric, from each sample in the query_set to the samples in the reference set.
- Parameters:
query_set – input containing the features of samples in the queryable pool. query set should either be an array-like object or a pytorch dataloader whose first element in each bactch is a featurisation of the samples in the batch.
reference_set – input containing the features of samples already queried samples against which the distances are computed. reference set should either be an array-like object or a pytorch dataloader whose first element in each bactch is a featurisation of the samples in the batch.
metric – defines the metric to be used to compute the distance. This should be supported by scikit-learn pairwise_distances function.
axis – integer indicating which dimension the features are
- Returns:
pytorch tensor of dimension the number of samples in query_set containing the minimum distance from each sample to the reference set
- representative_sampling(query_set: Tensor | ndarray[Any, dtype[Any]] | List[Any] | DataLoader[Any], num_annotate: int, clustering_method: str | ClusterMixin = 'KMeans', **clustering_kwargs: Any | None) List[int] [source]¶
Function that selects representative samples of the query set. Representative selection relies on clustering algorithms in scikit-learn.
- Parameters:
query_set – input containing the features of samples in the queryable pool. query set should either be an array-like object or a pytorch dataloader whose first element in each bactch is a featurisation of the samples in the batch
num_annotate – number of representative samples to identify
clustering_method – name, or instantiated class, of the clustering method to use
clustering_kwargs – arguments to be passed to instantiate clustering class if a string is passed to clustering_method
- Returns:
array-like containing the indices of the representative samples identified