pyrelational.pipeline

Pipeline

This module defines the acquisition manager which utilises the data manager, sampling functions, and model to create acquisition functions and general arbiters of the active learning pipeline

class Pipeline(data_manager: DataManager, model_manager: ModelManager[Any, Any], strategy: Strategy, oracle: Oracle | None = None)[source]

Bases: object

The pipeline facilitates the communication between

  • DataManager

  • ModelManager,

  • Strategy,

  • Oracle (Optional)

To enact a generic active learning cycle.

Parameters:
  • data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning

  • model_manager – A pyrelational model manager which handles the instantiation, training, testing of a machine learning model for the data in the data manager

  • strategy – A pyrelational active learning strategy implements the informativeness measure and the selection algorithm being used

  • oracle – An oracle instance interfaces with various concrete oracle to obtain labels for observations suggested by the strategy

compute_current_performance(test_loader: DataLoader[Any] | None = None, query: List[int] | None = None) None[source]

Compute performance of model.

Parameters:
  • test_loader – Pytorch Data Loader with test data compatible with model, optional as often the test loader can be generated from data_manager but is here for case when it hasn’t been defined or there is a new test set.

  • query – List of indices selected for labelling. Used for calculating hit ratio metric

Returns:

dictionary containing metric results on test set

compute_hit_ratio(result: Dict[str, float], query: List[int] | None = None) Dict[str, float][source]

Utility function for computing the hit ratio as used within the current performance and theoretical performance methods.

Parameters:
  • result – Dict or Dict-like of metrics

  • query – List of indices selected for labelling. Used for calculating hit ratio metric

Returns:

updated result dictionary with “hit_ratio” key, corresponding to hit ratio result

compute_theoretical_performance(test_loader: DataLoader[Any] | None = None) Dict[str, float][source]

Returns the performance of the full labelled dataset against the test data. Typically used for evaluation to establish theoretical benchmark of model performance given all available training data is labelled. The “horizontal” line in area under learning curve plots for active learning

Would not make much sense when we are doing active learning for the real situation, hence not part of __init__

Parameters:

test_loader – Pytorch Data Loader with test data compatible with model, optional as often the test loader can be generated from data_manager but is here for case when it hasn’t been defined or there is a new test set.

Returns:

dictionary containing metric results on test set

property dataset_size: int

Number of total data points.

property l_indices: List[int]

Indices of labelled samples.

property l_loader: DataLoader[Any]

Dataloader containing labelled data.

log_labelled_by(indices: List[int], tag: str | None = None) None[source]

Update the dictionary that records what the observation was labelled by. Default behaviour is to map observation to iteration at which it was labelled

Parameters:
  • indices – list of indices selected for labelling

  • tag – string which indicates what the observations where labelled by

property percentage_labelled: float

Percentage of total available dataset labelled.

query(indices: List[int]) None[source]

Updates labels based on indices selected for labelling

Parameters:

indices – List of indices selected for labelling

run(num_annotate: int, num_iterations: int | None = None, test_loader: DataLoader[Any] | None = None, *strategy_args: Any, **strategy_kwargs: Any) None[source]

Given the number of samples to annotate and a test loader this method will go through the entire active learning process of training the model on the labelled set, and recording the current performance based on this. Then it will proceed to compute uncertainties for the unlabelled observations, rank them, and get the top num_annotate observations labelled to be added to the next iteration’s labelled dataset L’. This process repeats until there are no observations left in the unlabelled set.

Parameters:
  • num_annotate – number of observations to get annotated per iteration

  • num_iterations – number of active learning loop to perform

  • test_loader – test data with which we evaluate the current state of the model given the labelled set L

  • strategy_args – optional additional args for strategy call

  • strategy_kwargs – optional additional kwargs for strategy call

step(num_annotate: int, *args: Any, **kwargs: Any) List[int][source]

Ask the strategy to provide indices of unobserved observations for labelling by the oracle

Parameters:

num_annotate – Number of points to annotate

Returns:

list of indexes to label from dataset

summary() DataFrame[source]

Construct a pandas table of performances of the model over the active learning iterations.

property test_loader: DataLoader[Any]

Dataloader containing test data.

property train_loader: DataLoader[Any]

Dataloader containing train data.

property u_indices: List[int]

Indices of unlabelled samples.

property u_loader: DataLoader[Any]

Dataloader containing unlabelled data.

property valid_loader: DataLoader[Any] | None

Dataloader containing validation data.