# Tutorial 3: Datasets In this note, we present how to use the out-of-the-box datasets to simulate different federated learning (FL) scenarios. Besides, we introduce how to use the customized dataset in COALA. We currently provide four out-of-the-box datasets: FEMNIST, Shakespeare, CIFAR-10, and CIFAR-100. FEMNIST and Shakespeare are adopted from [LEAF benchmark](https://leaf.cmu.edu/). We plan to integrate and provide more out-of-the-box datasets in the future. ## Out-of-the-box Datasets The simulation of different FL scenarios is configured in the configurations. You can refer to the other [tutorial](config.md) to learn more about how to modify configs. In this note, we focus on how to config the datasets with different simulations. The following are dataset configurations. ```yaml data: # The root directory where datasets are stored. root: "./data/" # The name of the dataset, support: femnist, shakespeare, cifar10, and cifar100. dataset: femnist # The data distribution of each client, support: iid, niid (for femnist and shakespeare), and dir and class (for cifar datasets). # `iid` means independent and identically distributed data. # `niid` means non-independent and identically distributed data for FEMNIST and Shakespeare. # `dir` means using Dirichlet process to simulate non-iid data, for CIFAR-10 and CIFAR-100 datasets. # `class` means partitioning the dataset by label classes, for datasets like CIFAR-10, CIFAR-100. split_type: "iid" # The minimal number of samples in each client. It is applicable for LEAF datasets and dir simulation of CIFAR-10 and CIFAR-100. min_size: 10 # The fraction of data sampled for LEAF datasets. e.g., 10% means that only 10% of the total dataset size is used. data_amount: 0.05 # The fraction of the number of clients used when the split_type is 'iid'. iid_fraction: 0.1 # Whether partition users of the dataset into train-test groups. Only applicable to femnist and shakespeare datasets. # True means partitioning users of the dataset into train-test groups. # False means partitioning each users' samples into train-test groups. user: False # The fraction of data for training; the rest are for testing. train_test_split: 0.9 # The number of classes in each client. Only applicable when the split_type is 'class'. class_per_client: 1 # The targeted number of clients to construct.used in non-leaf dataset, number of clients split into. for leaf dataset, only used when split type class. num_of_clients: 100 # The parameter for Dirichlet distribution simulation, applicable only when split_type is `dir` for CIFAR datasets. alpha: 0.5 # The targeted distribution of quantities to simulate data quantity heterogeneity. # The values should sum up to 1. e.g., [0.1, 0.2, 0.7]. # The `num_of_clients` should be divisible by `len(weights)`. # None means clients are simulated with the same data quantity. weights: NULL ``` Among them, `root` is applicable to all datasets. It specifies the directory to store datasets. COALA automatically downloads a dataset if it is not exist in the root directory. Next, we introduce the simulation and configuration for specific datasets. ### FEMNIST and Shakespeare Datasets The following are basic stats of these two datasets. FEMNIST * Overview: Image Dataset * Details: 3500 users, 62 different classes (10 digits, 26 lowercase, 26 uppercase), images are 28 by 28 pixels (with option to make them all 128 by 128 pixels) * Task: Image Classification Shakespeare * Overview: Text Dataset of Shakespeare Dialogues * Details: 1129 users (reduced to 660 with our choice of sequence length.) * Task: Next-Character Prediction The datasets are non-IID (independent and identically distributed) in nature. `split_type`: There are two options for these two datasets: `iid` and `niid`, representing IID data simulation and non-IID data simulation. Five hyper-parameters determine the simulated dataset: `min_size`, `data_amount`, `iid_fraction`, `tran_test_split`, and `user`. `user` is a boolean that determines whether to partition the dataset to train test group by user or samples. `user: True` means partitioning users of the dataset into train-test groups, i.e. some users are for training, some users are for testing. `user: False` means partitioning each users' samples into train-test groups, i.e. data in each client is partitioned into training set and testing set. Note: we normally use `test_mode: test_in_clients` for these two datasets. #### IID Simulation In IID simulation, data are randomly partitioned into multiple clients. The number of clients is determined by `data_amount` and `iid_fraction`. #### Non-IID Simulation Since FEMNIST and Shakespeare are non-IID in nature, each user of the dataset is regarded as a client. `data_amount` determine the number of clients participate in training. ### CIFAR-10 and CIFAR-100 Datasets > The **CIFAR-10** dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. > The **CIFAR-100** dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. `split_type`: There are three options for CIFAR datasets: `iid`, `dir`, and `class`. Three hyper-parameters determine the simulated dataset: `num_of_clients`, `class_per_client`, and `alpha`. #### IID Simulation In IID simulation, the training images of the datasets are randomly partitioned into `num_of_clients` clients. #### Non-IID Simulation We can simulate non-IID CIFAR datasets by Dirichlet process (`dir`) or by label class (`class`). `alpha` controls the level of heterogeneity for `dir` simulation. `class_per_client` determines the number of classes in each client. ## Customize Datasets COALA also supports integrating with customized dataset to simulate federated learning. You can use the following classes to integrate customized dataset: [FederatedImageDataset](../api.html#coala.datasets.FederatedImageDataset), [FederatedTensorDataset](../api.html#coala.datasets.FederatedTensorDataset), and [FederatedTorchDataset](../api.html#coala.datasets.FederatedTorchDataset). The simplest way is to use [FederatedTorchDataset](../api.html#coala.datasets.FederatedTorchDataset). Here is the pseudo code for constructing new datasets with this class. ```python # Define client ids clients = ["client_1", "client_2"] # Construct the dataloader for each client. # The dataloader is the default PyTorch DataLoader type. train_sets = {} for client in clients: train_loader = DataLoader(training_data, batch_size=64, shuffle=True) train_sets[client] = train_loader # Suppose there is only only one test data on the server test_set = DataLoader(test_data, batch_size=64, shuffle=False) # Construct federated training datasets. train_data = FederatedTorchDataset(train_sets, clients, is_loaded=False) test_data = FederatedTorchDataset(test_set, clients, is_loaded=False) # Then you can use the train_data and test_data by registering them via high-level apis. coala.register_dataset(train_data, test_data) ``` You can refer to [application folder](https://github.com/SonyResearch/COALA/tree/main/application) under the root directory for specific examples. ### Create Your Own Federated Dataset In case that the provided federated dataset class is not enough, you can implement your own federated dataset by inherit and implement [FederatedDataset](../api.html#coala.datasets.FederatedDataset). You can refer to [FederatedImageDataset](../api.html#coala.datasets.FederatedImageDataset), [FederatedTensorDataset](../api.html#coala.datasets.FederatedTensorDataset), and [FederatedTorchDataset](../api.html#coala.datasets.FederatedTorchDataset) on how to implement.