pyrelational.data_managers

Data Manager

class DataManager(dataset: Dataset[Tuple[Tensor, ...]], label_attr: str = 'y', train_indices: List[int] | None = None, labelled_indices: List[int] | None = None, unlabelled_indices: List[int] | None = None, validation_indices: List[int] | None = None, test_indices: List[int] | None = None, random_label_size: float | int = 0.1, hit_ratio_at: int | float | None = None, random_seed: int = 1234, loader_batch_size: int | str = 1, loader_shuffle: bool = True, loader_sampler: Sampler[int] | None = None, loader_batch_sampler: Sampler[List[Any]] | Iterable[List[Any]] | None = None, loader_num_workers: int = 0, loader_collate_fn: Callable[[List[T]], Any] | None = None, loader_pin_memory: bool = False, loader_drop_last: bool = False, loader_timeout: float = 0)[source]

Bases: object

DataManager for active learning pipelines

A diagram showing how the train/test indices are resolved:

../_images/data_indices_diagram.png

Parameters:
  • dataset – A PyTorch dataset whose indices refer to individual samples of study. This dataset must have an attribute containing the labels, and the __getitem__ method must return a tuple of tensors. In general, pyrelational assumes that the first item of this tuple is a tensor of features.

  • label_attr – string indicating name of attribute in the dataset class that correspond to the tensor containing the labels/values to be predicted; by default, pyrelational assumes it correspond to dataset.y

  • train_indices – An iterable of indices mapping to training sample indices in the dataset

  • labelled_indices – An iterable of indices mapping to labelled training samples

  • unlabelled_indices – An iterable of indices to unlabelled observations in the dataset

  • validation_indices – An iterable of indices to observations used for model validation

  • test_indices – An iterable of indices to observations in the input dataset used for test performance of the model

  • random_label_size – Only used when labelled and unlabelled indices are not provided. Sets the size of labelled set (should either be the number of samples or ratio w.r.t. train set)

  • hit_ratio_at – optional argument setting the top percentage threshold to compute hit ratio metric

  • random_seed – random seed used to generate labelled/unlabelled splits when none are provided.

  • loader_batch_size – batch size for dataloader

  • loader_shuffle – shuffle flag for labelled dataloader

  • loader_sampler – a sampler for the dataloaders

  • loader_batch_sampler – a batch sampler for the dataloaders

  • loader_num_workers – number of cpu workers for dataloaders

  • loader_collate_fn – collate fn for dataloaders

  • loader_pin_memory – pin memory flag for dataloaders

  • loader_drop_last – drop last flag for dataloaders

  • loader_timeout – timeout value for dataloaders

get_labelled_loader() DataLoader[Any][source]

Get labelled dataloader.

Returns:

Pytorch Dataloader containing labelled subset from dataset

get_percentage_labelled() float[source]

Percentage of total available dataset labelled.

Returns:

percentage value

get_sample_feature_vector(ds_index: int) Any[source]

To be reviewed for deprecation (for datasets without tensors)

get_sample_feature_vectors(ds_indices: List[int]) List[Tensor][source]

To be reviewed for deprecation (for datasets without tensors)

get_sample_labels(ds_indices: List[int]) Tensor[source]

Get sample labels. This assumes that labels are last element in output of dataset

Parameters:

ds_indices – collection of indices for accessing samples in dataset.

Returns:

list of labels for provided indexes

get_test_loader() DataLoader[Any][source]

Get test dataloader.

Returns:

Pytorch Dataloader containing test set

get_test_set() Subset[Tuple[Tensor, ...]][source]

Get test set from full dataset and test indices.

get_train_loader(full: bool = False) DataLoader[Any][source]

Get train dataloader. Returns full train loader, else return labelled loader

Parameters:

full – whether to use full dataset with unlabelled included

Returns:

Pytorch Dataloader containing labelled training data for model

get_train_set() Dataset[Tuple[Tensor, ...]][source]

Get train set from full dataset and train indices.

get_unlabelled_loader() DataLoader[Any][source]

Get unlabelled dataloader.

Returns:

Pytorch Dataloader containing unlabelled subset from dataset

get_validation_loader() DataLoader[Any] | None[source]

Get validation dataloader if validation set exists, else returns None.

Returns:

Pytorch Dataloader containing validation set

get_validation_set() Subset[Tuple[Tensor, ...]] | None[source]

Get validation set from full dataset and validation indices.

set_target_value(idx: int, value: Any) None[source]

Sets a value to the y value of the corresponding observation denoted by idx in the underlying dataset with the supplied value

Parameters:
  • idx – index value to the observation

  • value – new value for the observation

update_train_labels(indices: List[int]) None[source]

Updates the labelled and unlabelled sets of the dataset.

Different behaviour based on whether this is done in evaluation mode or real mode. The difference is that in evaluation mode the dataset already has the label, so it is a matter of making sure the observations are moved from the unlabelled set to the labelled set.

Parameters:

indices – list of indices corresponding to samples which have been labelled