pyrelational.datasets¶
Classification datasets¶
The following classes contain a variety of classic classification datasets that have been used in different active learning papers. Each behaves the same as a PyTorch Dataset.
Classification datasets that can be used for benchmarking AL strategies
- class BreastCancerDataset(n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]UCI ML Breast Cancer Wisconsin (Diagnostic) dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class Checkerboard2x2Dataset(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Checkerboard2x2 dataset from Konyushkova et al. 2017
From Ksenia Konyushkova, Raphael Sznitman, Pascal Fua ‘Learning Active Learning from Data’, NIPS 2017
- Parameters:
data_dir – path where to save the raw data default to /tmp/
n_splits – an int describing the number of class stratified splits to compute
- class Checkerboard4x4Dataset(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Checkerboard 4x4 dataset from Konyushkova et al. 2017
From Ksenia Konyushkova, Raphael Sznitman, Pascal Fua ‘Learning Active Learning from Data’, NIPS 2017
- Parameters:
data_dir – path where to save the raw data default to /tmp/
n_splits – an int describing the number of class stratified splits to compute
- class CreditCardDataset(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Credit card fraud dataset, highly unbalanced and challenging.
From Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. Calibrating probability with undersampling for unbalanced classification. In 2015 IEEE Symposium Series on Computational Intelligence, pages 159–166, 2015.
We use the original data from http://www.ulb.ac.be/di/map/adalpozz/data/creditcard.Rdata processed using pyreadr
- Parameters:
data_dir – path where to save the raw data default to /tmp/
n_splits – an int describing the number of class stratified splits to compute
- class DigitDataset(n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]UCI ML hand-written digits datasets
From C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University.
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class FashionMNIST(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Fashion MNIST Dataset
From Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class GaussianCloudsDataset(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]GaussianClouds from Konyushkova et al. 2017 basically a imbalanced binary classification task created from multivariate gaussian blobs
From Ksenia Konyushkova, Raphael Sznitman, Pascal Fua ‘Learning Active Learning from Data’, NIPS 2017
- Parameters:
data_dir – path where to save the raw data default to /tmp/
n_splits – an int describing the number of class stratified splits to compute
- class StriatumDataset(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Striatum dataset as used in Konyushkova et al. 2017
From Ksenia Konyushkova, Raphael Sznitman, Pascal Fua ‘Learning Active Learning from Data’, NIPS 2017
- Parameters:
data_dir – path where to save the raw data default to /tmp/
n_splits – an int describing the number of class stratified splits to compute
- class SynthClass1(n_splits: int = 5, size: int = 500, random_seed: int = 1234)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Synth1 dataset as described in Yang and Loog
Consists of a binary classification task of positive and negative class samples being generated by a multivariate gaussian distribution centered at [1,1] and [-1,-1] respectively.
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
size – an int describing the number of observations the dataset is to have
random_seed – random seed for reproducibility on splits
- class SynthClass2(n_splits: int = 5, size: int = 500, random_seed: int = 1234)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Synth2 dataset as described in Yang and Loog
Originally proposed by Huang et al in: Active learning by querying informative and representative examples
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
size – an int describing the number of observations the dataset is to have
random_seed – random seed for reproducibility on splits
- class SynthClass3(n_splits: int = 5, size: int = 500, random_seed: int = 1234)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]SynthClass3 dataset as described in Yang and Loog
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
size – an int describing the number of observations the dataset is to have
random_seed – random seed for reproducibility on splits
- class UCIClassification(name: str, data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]UCI classification abstract class
- Parameters:
name – string denotation for dataset to download as specified in uci_datasets.UCIDatasets
n_splits – an int describing the number of class stratified splits to compute
- class UCIGlass(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIClassification
UCI Glass dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCIParkinsons(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIClassification
UCI Parkinsons dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCISeeds(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIClassification
UCI Seeds dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
Regression datasets¶
The following classes contain a variety of classic regression datasets that have been used in different active learning papers. Each behaves the same as a PyTorch Dataset.
Regression datasets that can be used for benchmarking AL strategies
- class DiabetesDataset(n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]A small regression dataset for examples
From Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499.
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class SynthReg1(n_splits: int = 5, size: int = 1000, random_seed: int = 1234)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Synthetic dataset for active learning on a regression based task
Simple 1 dof regression problem that can be placed into two types of AL situations as described in the module docstring
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
size – an int describing the number of observations the dataset is to have
random_seed – random seed for reproducibility on splits
- class SynthReg2(n_splits: int = 5, size: int = 1000, random_seed: int = 1234)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]Synthetic dataset for active learning on a regression based task
A more challenging dataset than SynthReg1 wherein we see a periodic pattern with 2 degrees of freedom.
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
size – an int describing the number of observations the dataset is to have
random_seed – random seed for reproducibility on splits
- class UCIAirfoil(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIRegression
UCI Airfoil dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCIConcrete(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIRegression
UCI housing dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCIEnergy(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIRegression
UCI housing dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCIPower(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIRegression
UCI housing dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCIRegression(name: str, data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
Dataset
[Tuple
[Tensor
,Tensor
]]UCI regression dataset base class
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCIWine(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIRegression
UCI housing dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
- class UCIYacht(data_dir: str = '/tmp/', n_splits: int = 5)[source]¶
Bases:
UCIRegression
UCI housing dataset
- Parameters:
n_splits – an int describing the number of class stratified splits to compute
Benchmark DataManager¶
The following functions accept the datasets defined in this package to produce DataManagers containing labelling initialisations that correspond to cold and warm start active learning tasks. These can be used for benchmarking strategies quickly.
Utility to create datamanagers corresponding to different AL tasks
- create_classification_cold_start(dataset: Dataset[Any], train_indices: List[int], test_indices: List[Any], **dm_args: Any) DataManager [source]¶
Returns an AL task for benchmarking classification datasets. The AL task will sample an example from each of the classes in the training subset of the data.
Please note the current iteration does not utilise a validation set as described in the paper
- Parameters:
dataset – A pytorch dataset in the style described pyrelational.datasets
train_indices – [int] indices corresponding to observations of dataset used for training set
test_indices – [int] indices corresponding to observations of dataset used for holdout test set
dm_args – kwargs for any additional keyword arguments to be passed into the initialisation of the datamanager.
- create_regression_cold_start(dataset: Dataset[Any], train_indices: List[int], test_indices: List[Any], **dm_args: Any) DataManager [source]¶
Create data manager with 2 labelled data samples, where the data samples labelled are the pair that have the largest distance between them
Please note the current iteration does not utilise a validation set as described in the paper
- Parameters:
dataset – A pytorch dataset in the style described pyrelational.datasets
train_indices – [int] indices corresponding to observations of dataset used for training set
test_indices – [int] indices corresponding to observations of dataset used for holdout test set
dm_args – kwargs for any additional keyword arguments to be passed into the initialisation of the datamanager.
- create_warm_start(dataset: Dataset[Any], **dm_args: Any) DataManager [source]¶
Returns a datamanager with 10% randomly labelled data from the train indices. The rest of the observations in the training set comprise the unlabelled set of observations. We call this initialisation a ‘warm start’ AL task inspired by Konyushkova et al. (2017)
This can be used both for classification and regression type datasets.
From Ksenia Konyushkova, Raphael Sznitman, Pascal Fua ‘Learning Active Learning from Data’, NIPS 2017
- Parameters:
dataset – A pytorch dataset in the style described pyrelational.datasets
dm_args – kwargs for any additional keyword arguments to be passed into the initialisation of the datamanager.
- pick_one_sample_per_class(dataset: Dataset[Any], train_indices: List[int]) List[int] [source]¶
Utility function to randomly pick one sample per class in the training subset of dataset and return their index in the dataset. This is used for defining an initial state of the labelled subset in the active learning task
- Parameters:
dataset – input dataset
train_indices – list or iterable with the indices corresponding to the training samples in the dataset