pyrelational.data_managers¶
Data Manager¶
- class DataManager(dataset: Dataset[Tuple[Tensor, ...]], label_attr: str = 'y', train_indices: List[int] | None = None, labelled_indices: List[int] | None = None, unlabelled_indices: List[int] | None = None, validation_indices: List[int] | None = None, test_indices: List[int] | None = None, random_label_size: float | int = 0.1, hit_ratio_at: int | float | None = None, random_seed: int = 1234, loader_batch_size: int | str = 1, loader_shuffle: bool = True, loader_sampler: Sampler[int] | None = None, loader_batch_sampler: Sampler[List[Any]] | Iterable[List[Any]] | None = None, loader_num_workers: int = 0, loader_collate_fn: Callable[[List[T]], Any] | None = None, loader_pin_memory: bool = False, loader_drop_last: bool = False, loader_timeout: float = 0)[source]¶
Bases:
objectDataManager for active learning pipelines.
A diagram showing how the train/test indices are resolved:
Instantiate a DataManager.
- Parameters:
dataset – A PyTorch dataset whose indices refer to individual samples of study. This dataset must have an attribute containing the labels, and the __getitem__ method must return a tuple of tensors. In general, pyrelational assumes that the first item of this tuple is a tensor of features.
label_attr – string indicating name of attribute in the dataset class that correspond to the tensor containing the labels/values to be predicted; by default, pyrelational assumes it correspond to dataset.y
train_indices – An iterable of indices mapping to training sample indices in the dataset
labelled_indices – An iterable of indices mapping to labelled training samples
unlabelled_indices – An iterable of indices to unlabelled observations in the dataset
validation_indices – An iterable of indices to observations used for model validation
test_indices – An iterable of indices to observations in the input dataset used for test performance of the model
random_label_size – Only used when labelled and unlabelled indices are not provided. Sets the size of labelled set (should either be the number of samples or ratio w.r.t. train set)
hit_ratio_at – optional argument setting the top percentage threshold to compute hit ratio metric
random_seed – random seed used to generate labelled/unlabelled splits when none are provided.
loader_batch_size – batch size for dataloader
loader_shuffle – shuffle flag for labelled dataloader
loader_sampler – a sampler for the dataloaders
loader_batch_sampler – a batch sampler for the dataloaders
loader_num_workers – number of cpu workers for dataloaders
loader_collate_fn – collate fn for dataloaders
loader_pin_memory – pin memory flag for dataloaders
loader_drop_last – drop last flag for dataloaders
loader_timeout – timeout value for dataloaders
- get_labelled_loader() DataLoader[Any][source]¶
Get labelled dataloader.
- Returns:
Pytorch Dataloader containing labelled subset from dataset
- get_percentage_labelled() float[source]¶
Get percentage of total available dataset labelled.
- Returns:
percentage value
- get_sample_feature_vector(ds_index: int) Any[source]¶
Get feature vector for sample index.
To be reviewed for deprecation (for datasets without tensors).
- get_sample_feature_vectors(ds_indices: List[int]) List[Tensor][source]¶
Get features for sample indices.
To be reviewed for deprecation (for datasets without tensors).
- get_sample_labels(ds_indices: List[int]) Tensor[source]¶
Get sample labels.
This assumes that labels are last element in output of dataset. :param ds_indices: collection of indices for accessing samples in dataset. :return: list of labels for provided indexes
- get_test_loader() DataLoader[Any][source]¶
Get test dataloader.
- Returns:
Pytorch Dataloader containing test set
- get_train_loader(full: bool = False) DataLoader[Any][source]¶
Return full train loader if full, else return labelled loader.
- Parameters:
full – whether to use full dataset with unlabelled included
- Returns:
Pytorch Dataloader containing labelled training data for model
- get_train_set() Dataset[Tuple[Tensor, ...]][source]¶
Get train set from full dataset and train indices.
- get_unlabelled_loader() DataLoader[Any][source]¶
Get unlabelled dataloader.
- Returns:
Pytorch Dataloader containing unlabelled subset from dataset
- get_validation_loader() DataLoader[Any] | None[source]¶
Get validation dataloader if validation set exists, else returns None.
- Returns:
Pytorch Dataloader containing validation set
- get_validation_set() Subset[Tuple[Tensor, ...]] | None[source]¶
Get validation set from full dataset and validation indices.
- set_target_value(idx: int, value: Any) None[source]¶
Set the value of a sample given an index.
Sets a value to the y value of the corresponding observation denoted by idx in the underlying dataset with the supplied value :param idx: index value to the observation :param value: new value for the observation
- update_train_labels(indices: List[int]) None[source]¶
Update the labelled and unlabelled sets of the dataset.
Different behaviour based on whether this is done in evaluation mode or real mode. The difference is that in evaluation mode the dataset already has the label, so it is a matter of making sure the observations are moved from the unlabelled set to the labelled set. :param indices: list of indices corresponding to samples which have been labelled