pyrelational.pipeline¶
Pipeline¶
This module defines the acquisition manager which utilises the data manager, sampling functions, and model to create acquisition functions and general arbiters of the active learning pipeline
- class Pipeline(data_manager: DataManager, model_manager: ModelManager[Any, Any], strategy: Strategy, oracle: Oracle | None = None)[source]¶
Bases:
object
The pipeline facilitates the communication between
DataManager
ModelManager,
Strategy,
Oracle (Optional)
To enact a generic active learning cycle.
- Parameters:
data_manager – A pyrelational data manager which keeps track of what has been labelled and creates data loaders for active learning
model_manager – A pyrelational model manager which handles the instantiation, training, testing of a machine learning model for the data in the data manager
strategy – A pyrelational active learning strategy implements the informativeness measure and the selection algorithm being used
oracle – An oracle instance interfaces with various concrete oracle to obtain labels for observations suggested by the strategy
- compute_current_performance(test_loader: DataLoader[Any] | None = None, query: List[int] | None = None) None [source]¶
Compute performance of model.
- Parameters:
test_loader – Pytorch Data Loader with test data compatible with model, optional as often the test loader can be generated from data_manager but is here for case when it hasn’t been defined or there is a new test set.
query – List of indices selected for labelling. Used for calculating hit ratio metric
- Returns:
dictionary containing metric results on test set
- compute_hit_ratio(result: Dict[str, float], query: List[int] | None = None) Dict[str, float] [source]¶
Utility function for computing the hit ratio as used within the current performance and theoretical performance methods.
- Parameters:
result – Dict or Dict-like of metrics
query – List of indices selected for labelling. Used for calculating hit ratio metric
- Returns:
updated result dictionary with “hit_ratio” key, corresponding to hit ratio result
- compute_theoretical_performance(test_loader: DataLoader[Any] | None = None) Dict[str, float] [source]¶
Returns the performance of the full labelled dataset against the test data. Typically used for evaluation to establish theoretical benchmark of model performance given all available training data is labelled. The “horizontal” line in area under learning curve plots for active learning
Would not make much sense when we are doing active learning for the real situation, hence not part of __init__
- Parameters:
test_loader – Pytorch Data Loader with test data compatible with model, optional as often the test loader can be generated from data_manager but is here for case when it hasn’t been defined or there is a new test set.
- Returns:
dictionary containing metric results on test set
- property dataset_size: int¶
Number of total data points.
- property l_indices: List[int]¶
Indices of labelled samples.
- property l_loader: DataLoader[Any]¶
Dataloader containing labelled data.
- log_labelled_by(indices: List[int], tag: str | None = None) None [source]¶
Update the dictionary that records what the observation was labelled by. Default behaviour is to map observation to iteration at which it was labelled
- Parameters:
indices – list of indices selected for labelling
tag – string which indicates what the observations where labelled by
- property percentage_labelled: float¶
Percentage of total available dataset labelled.
- query(indices: List[int]) None [source]¶
Updates labels based on indices selected for labelling
- Parameters:
indices – List of indices selected for labelling
- run(num_annotate: int, num_iterations: int | None = None, test_loader: DataLoader[Any] | None = None, *strategy_args: Any, **strategy_kwargs: Any) None [source]¶
Given the number of samples to annotate and a test loader this method will go through the entire active learning process of training the model on the labelled set, and recording the current performance based on this. Then it will proceed to compute uncertainties for the unlabelled observations, rank them, and get the top num_annotate observations labelled to be added to the next iteration’s labelled dataset L’. This process repeats until there are no observations left in the unlabelled set.
- Parameters:
num_annotate – number of observations to get annotated per iteration
num_iterations – number of active learning loop to perform
test_loader – test data with which we evaluate the current state of the model given the labelled set L
strategy_args – optional additional args for strategy call
strategy_kwargs – optional additional kwargs for strategy call
- step(num_annotate: int, *args: Any, **kwargs: Any) List[int] [source]¶
Ask the strategy to provide indices of unobserved observations for labelling by the oracle
- Parameters:
num_annotate – Number of points to annotate
- Returns:
list of indexes to label from dataset
- summary() DataFrame [source]¶
Construct a pandas table of performances of the model over the active learning iterations.
- property test_loader: DataLoader[Any]¶
Dataloader containing test data.
- property train_loader: DataLoader[Any]¶
Dataloader containing train data.
- property u_indices: List[int]¶
Indices of unlabelled samples.
- property u_loader: DataLoader[Any]¶
Dataloader containing unlabelled data.
- property valid_loader: DataLoader[Any] | None¶
Dataloader containing validation data.