pyrelational.pipeline¶

Pipeline¶

Active learning pipeline.

class Pipeline(data_manager: DataManager, model_manager: ModelManager[Any, Any], strategy: Strategy, oracle: Oracle | None = None)[source]¶

Bases: object

Implementation of the active learning pipeline.

The pipeline facilitates the communication between

DataManager

ModelManager,

Strategy,

Oracle (Optional)

To enact a generic active learning cycle.

Instantiate an active learning pipeline.

Parameters:

data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning
model_manager – A pyrelational model manager which handles the instantiation, training, testing of a machine learning model for the data in the data manager
strategy – A pyrelational active learning strategy implements the informativeness measure and the selection algorithm being used
oracle – An oracle instance interfaces with various concrete oracle to obtain labels for observations suggested by the strategy

compute_current_performance(test_loader: DataLoader[Any] | None = None, query: List[int] | None = None) → None[source]¶

Compute performance of model.

Parameters:

test_loader – Pytorch Data Loader with test data compatible with model, optional as often the test loader can be generated from data_manager but is here for case when it hasn’t been defined or there is a new test set.
query – List of indices selected for labelling. Used for calculating hit ratio metric

Returns:

dictionary containing metric results on test set

compute_hit_ratio(result: Dict[str, float], query: List[int] | None = None) → Dict[str, float][source]¶

Compute the hit ratio metric.

Used within the current performance and theoretical performance methods. :param result: Dict or Dict-like of metrics :param query: List of indices selected for labelling. Used for calculating hit ratio metric :return: updated result dictionary with “hit_ratio” key, corresponding to hit ratio result

compute_theoretical_performance(test_loader: DataLoader[Any] | None = None) → Dict[str, float][source]¶

Return the performance of the full labelled dataset against the test data.

Typically used for evaluation to establish theoretical benchmark of model performance given all available training data is labelled. The “horizontal” line in area under learning curve plots for active learning.

Would not make much sense when doing active learning in real world situation, hence not part of __init__.

Parameters:: test_loader – Pytorch Data Loader with test data compatible with model, optional as often the test loader can be generated from data_manager but is here for case when it hasn’t been defined or there is a new test set.
Returns:: dictionary containing metric results on test set

property dataset_size: int¶: Return the number of total data points.

property l_indices: List[int]¶: Return the indices of labelled samples.

property l_loader: DataLoader[Any]¶: Return the dataloader containing labelled data.

log_labelled_by(indices: List[int], tag: str | None = None) → None[source]¶

Update the dictionary that records what the observation was labelled by.

Default behaviour is to map observation to iteration at which it was labelled. :param indices: list of indices selected for labelling :param tag: string which indicates what the observations where labelled by

property percentage_labelled: float¶: Return the percentage of total available dataset labelled.

query(indices: List[int]) → None[source]¶

Update labels based on indices selected for labelling.

Parameters:: indices – List of indices selected for labelling

run(num_annotate: int, num_iterations: int | None = None, test_loader: DataLoader[Any] | None = None, *strategy_args: Any, **strategy_kwargs: Any) → None[source]¶

Run the pipeline.

Given the number of samples to annotate and a test loader this method will go through the entire active learning process of training the model on the labelled set, and recording the current performance based on this. Then it will proceed to compute uncertainties for the unlabelled observations, rank them, and get the top num_annotate observations labelled to be added to the next iteration’s labelled dataset L’. This process repeats until there are no observations left in the unlabelled set.

Parameters:

num_annotate – number of observations to get annotated per iteration
num_iterations – number of active learning loop to perform
test_loader – test data with which we evaluate the current state of the model given the labelled set L
strategy_args – optional additional args for strategy call
strategy_kwargs – optional additional kwargs for strategy call

step(num_annotate: int, *args: Any, **kwargs: Any) → List[int][source]¶

Ask the strategy to provide indices of unobserved observations for labelling by the oracle.

Parameters:: num_annotate – Number of points to annotate
Returns:: list of indexes to label from dataset

summary() → DataFrame[source]¶: Construct a pandas table of performances of the model over the active learning iterations.

property test_loader: DataLoader[Any]¶: Return the dataloader containing test data.

property train_loader: DataLoader[Any]¶: Return the dataloader containing train data.

property u_indices: List[int]¶: Return the indices of unlabelled samples.

property u_loader: DataLoader[Any]¶: Return the dataloader containing unlabelled data.

property valid_loader: DataLoader[Any] | None¶: Return the dataloader containing validation data.