pyrelational.data_managers¶
Data Manager¶
- class DataManager(dataset: Dataset[Tuple[Tensor, ...]], label_attr: str = 'y', train_indices: List[int] | None = None, labelled_indices: List[int] | None = None, unlabelled_indices: List[int] | None = None, validation_indices: List[int] | None = None, test_indices: List[int] | None = None, random_label_size: float | int = 0.1, hit_ratio_at: int | float | None = None, random_seed: int = 1234, loader_batch_size: int | str = 1, loader_shuffle: bool = True, loader_sampler: Sampler[int] | None = None, loader_batch_sampler: Sampler[List[Any]] | Iterable[List[Any]] | None = None, loader_num_workers: int = 0, loader_collate_fn: Callable[[List[T]], Any] | None = None, loader_pin_memory: bool = False, loader_drop_last: bool = False, loader_timeout: float = 0)[source]¶
Bases:
object
DataManager for active learning pipelines
A diagram showing how the train/test indices are resolved:
- Parameters:
dataset – A PyTorch dataset whose indices refer to individual samples of study. This dataset must have an attribute containing the labels, and the __getitem__ method must return a tuple of tensors. In general, pyrelational assumes that the first item of this tuple is a tensor of features.
label_attr – string indicating name of attribute in the dataset class that correspond to the tensor containing the labels/values to be predicted; by default, pyrelational assumes it correspond to dataset.y
train_indices – An iterable of indices mapping to training sample indices in the dataset
labelled_indices – An iterable of indices mapping to labelled training samples
unlabelled_indices – An iterable of indices to unlabelled observations in the dataset
validation_indices – An iterable of indices to observations used for model validation
test_indices – An iterable of indices to observations in the input dataset used for test performance of the model
random_label_size – Only used when labelled and unlabelled indices are not provided. Sets the size of labelled set (should either be the number of samples or ratio w.r.t. train set)
hit_ratio_at – optional argument setting the top percentage threshold to compute hit ratio metric
random_seed – random seed used to generate labelled/unlabelled splits when none are provided.
loader_batch_size – batch size for dataloader
loader_shuffle – shuffle flag for labelled dataloader
loader_sampler – a sampler for the dataloaders
loader_batch_sampler – a batch sampler for the dataloaders
loader_num_workers – number of cpu workers for dataloaders
loader_collate_fn – collate fn for dataloaders
loader_pin_memory – pin memory flag for dataloaders
loader_drop_last – drop last flag for dataloaders
loader_timeout – timeout value for dataloaders
- get_labelled_loader() DataLoader[Any] [source]¶
Get labelled dataloader.
- Returns:
Pytorch Dataloader containing labelled subset from dataset
- get_percentage_labelled() float [source]¶
Percentage of total available dataset labelled.
- Returns:
percentage value
- get_sample_feature_vector(ds_index: int) Any [source]¶
To be reviewed for deprecation (for datasets without tensors)
- get_sample_feature_vectors(ds_indices: List[int]) List[Tensor] [source]¶
To be reviewed for deprecation (for datasets without tensors)
- get_sample_labels(ds_indices: List[int]) Tensor [source]¶
Get sample labels. This assumes that labels are last element in output of dataset
- Parameters:
ds_indices – collection of indices for accessing samples in dataset.
- Returns:
list of labels for provided indexes
- get_test_loader() DataLoader[Any] [source]¶
Get test dataloader.
- Returns:
Pytorch Dataloader containing test set
- get_test_set() Subset[Tuple[Tensor, ...]] [source]¶
Get test set from full dataset and test indices.
- get_train_loader(full: bool = False) DataLoader[Any] [source]¶
Get train dataloader. Returns full train loader, else return labelled loader
- Parameters:
full – whether to use full dataset with unlabelled included
- Returns:
Pytorch Dataloader containing labelled training data for model
- get_train_set() Dataset[Tuple[Tensor, ...]] [source]¶
Get train set from full dataset and train indices.
- get_unlabelled_loader() DataLoader[Any] [source]¶
Get unlabelled dataloader.
- Returns:
Pytorch Dataloader containing unlabelled subset from dataset
- get_validation_loader() DataLoader[Any] | None [source]¶
Get validation dataloader if validation set exists, else returns None.
- Returns:
Pytorch Dataloader containing validation set
- get_validation_set() Subset[Tuple[Tensor, ...]] | None [source]¶
Get validation set from full dataset and validation indices.
- set_target_value(idx: int, value: Any) None [source]¶
Sets a value to the y value of the corresponding observation denoted by idx in the underlying dataset with the supplied value
- Parameters:
idx – index value to the observation
value – new value for the observation
- update_train_labels(indices: List[int]) None [source]¶
Updates the labelled and unlabelled sets of the dataset.
Different behaviour based on whether this is done in evaluation mode or real mode. The difference is that in evaluation mode the dataset already has the label, so it is a matter of making sure the observations are moved from the unlabelled set to the labelled set.
- Parameters:
indices – list of indices corresponding to samples which have been labelled