.. _using own data:

Using your own datasets with PyRelationAL
=========================================

The :py:class:`pyrelational.data_managers.data_manager.DataManager` module enables users to integrate any pytorch Dataset
into PyRelationAL easily. The module expects the full dataset, i.e. the union of labelled, unlabelled,
validation (optional), and test sets. The indices of each sets should be provided to the class constructor that
then proceeds to construct the subset Datasets object under the hood. Throughout the experiment, the data manager will
keep track of indices and handle updates to the labelled/unlabelled pools of samples. For instance, using the Mnist dataset

.. code-block:: python

   import torch
   from torchvision import datasets, transforms
   from pyrelational.data_managers.data_manager import DataManager

    mnist_dataset = datasets.MNIST(
        "mnist_data",
        download=True,
        train=True,
        transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]),
    )
    train_ds, val_ds, test_ds = torch.utils.data.random_split(mnist_dataset, [50000, 5000, 5000])
    train_indices = train_ds.indices
    validation_indices = val_ds.indices
    test_indices = test_ds.indices
    labelled_indices = train_indices[:10000]

    data_manager = DataManager(
        mnist_dataset,
        train_indices=train_indices,
        labelled_indices=labelled_indices,
        validation_indices=validation_indices,
        test_indices=test_indices,
    )

Customizing dataloader
______________________

Users can customize the dataloaders in the same way as any pytorch dataloader by passing Pytorch DataLoader arguments to
the data manager constructor, such as

.. code-block:: python
    :emphasize-lines: 7,8,9

    data_manager = DataManager(
        mnist_dataset,
        train_indices=train_indices,
        labelled_indices=labelled_indices,
        validation_indices=validation_indices,
        test_indices=test_indices,
        loader_batch_size=10000,
        loader_num_workers=2,
        loader_shuffle=True,
    )

Interacting with non-pytorch estimators
_______________________________________

Importantly, this enables using pytorch Dataset and DataLoaders to interact with other libraries by taking advantage of
the collate function. For instance, using the following collate function enables conversion to numpy array

.. code-block:: python
    :emphasize-lines: 11

    def numpy_collate(batch):
        """Collate function for a Pytorch to Numpy DataLoader"""
        return [np.stack(el) for el in zip(*batch)]

    data_manager = DataManager(
        mnist_dataset,
        train_indices=train_indices,
        labelled_indices=labelled_indices,
        validation_indices=validation_indices,
        test_indices=test_indices,
        loader_collate_fn=numpy_collate,
    )


Returning single batch
___________________________

In some instances, for instance when using Gaussian Processes or scikit-learn estimators, the dataloader should return the
entire underlying dataset. This can be specified as such,

.. code-block:: python
    :emphasize-lines: 7

    data_manager = DataManager(
        mnist_dataset,
        train_indices=train_indices,
        labelled_indices=labelled_indices,
        validation_indices=validation_indices,
        test_indices=test_indices,
        loader_batch_size="full",
    )