pyrelational.datasets

Classification datasets

The following classes contain a variety of classic classification datasets that have been used in different active learning papers. Each behaves the same as a PyTorch Dataset.

Regression datasets

The following classes contain a variety of classic regression datasets that have been used in different active learning papers. Each behaves the same as a PyTorch Dataset.

Benchmark DataManager

The following functions accept the datasets defined in this package to produce DataManagers containing labelling initialisations that correspond to cold and warm start active learning tasks. These can be used for benchmarking strategies quickly.

Utility to create datamanagers corresponding to different AL tasks

create_classification_cold_start(dataset: Dataset[Any], train_indices: List[int], test_indices: List[Any], **dm_args: Any) DataManager[source]

Returns an AL task for benchmarking classification datasets. The AL task will sample an example from each of the classes in the training subset of the data.

Please note the current iteration does not utilise a validation set as described in the paper

Parameters:
  • dataset – A pytorch dataset in the style described pyrelational.datasets

  • train_indices – [int] indices corresponding to observations of dataset used for training set

  • test_indices – [int] indices corresponding to observations of dataset used for holdout test set

  • dm_args – kwargs for any additional keyword arguments to be passed into the initialisation of the datamanager.

create_regression_cold_start(dataset: Dataset[Any], train_indices: List[int], test_indices: List[Any], **dm_args: Any) DataManager[source]

Create data manager with 2 labelled data samples, where the data samples labelled are the pair that have the largest distance between them

Please note the current iteration does not utilise a validation set as described in the paper

Parameters:
  • dataset – A pytorch dataset in the style described pyrelational.datasets

  • train_indices – [int] indices corresponding to observations of dataset used for training set

  • test_indices – [int] indices corresponding to observations of dataset used for holdout test set

  • dm_args – kwargs for any additional keyword arguments to be passed into the initialisation of the datamanager.

create_warm_start(dataset: Dataset[Any], **dm_args: Any) DataManager[source]

Returns a datamanager with 10% randomly labelled data from the train indices. The rest of the observations in the training set comprise the unlabelled set of observations. We call this initialisation a ‘warm start’ AL task inspired by Konyushkova et al. (2017)

This can be used both for classification and regression type datasets.

From Ksenia Konyushkova, Raphael Sznitman, Pascal Fua ‘Learning Active Learning from Data’, NIPS 2017

Parameters:
  • dataset – A pytorch dataset in the style described pyrelational.datasets

  • dm_args – kwargs for any additional keyword arguments to be passed into the initialisation of the datamanager.

pick_one_sample_per_class(dataset: Dataset[Any], train_indices: List[int]) List[int][source]

Utility function to randomly pick one sample per class in the training subset of dataset and return their index in the dataset. This is used for defining an initial state of the labelled subset in the active learning task

Parameters:
  • dataset – input dataset

  • train_indices – list or iterable with the indices corresponding to the training samples in the dataset