Deep learning models require large amounts of data for training, and organizing this data can be a complex task. PyTorch provides the DataSet and DataLoader classes to help with this task. In this blog post, we will go over how to use these classes to build custom datasets and dataloaders for deep learning models.
DataSet
The DataSet class is a PyTorch class that represents a dataset. It provides an interface to access the data samples in the dataset. You can create a custom DataSet class to represent your data.
Here is an example implementation of a custom DataSet class:
import torch
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
x = torch.tensor(sample[0])
y = torch.tensor(sample[1])
return x, y
In this example, MyDataset is a custom DataSet class that takes a list of data samples as input. The __init__ method initializes the dataset with the provided data, and the __len__ method returns the length of the dataset. The __getitem__ method returns a single sample from the dataset at the given index. In this case, it returns two tensors: x and y.
DataLoader
The DataLoader class is a PyTorch class that provides an iterable over a dataset. It can be used to load data samples in parallel while training a deep learning model. The DataLoader class takes a DataSet object as input and provides an iterable over the samples in the dataset.
Here is an example implementation of a DataLoader:
from torch.utils.data import DataLoader
batch_size = 32
dataset = MyDataset(data)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
In this example, MyDataset is the custom dataset that we created earlier. The DataLoader is created by passing the dataset as input along with the batch size and shuffle parameters. The batch_size parameter determines the number of samples to be loaded in each batch, while the shuffle parameter determines whether to shuffle the samples before loading.
Usage
Let's create a sample dataset and use it with the DataLoader.
from torch.utils.data import DataLoader
data = [(i, 2*i) for i in range(100)]
dataset = MyDataset(data)
dataloader = DataLoader(dataset, batch_size=10, shuffle=True)
for batch_idx, (x_batch, y_batch) in enumerate(dataloader):
print("Batch", batch_idx)
print("X:", x_batch)
print("Y:", y_batch)
In this example, we create a dataset of 100 samples, where each sample is a pair of integers (i, 2*i). We then create a DataLoader with a batch size of 10 and shuffle the data. Finally, we iterate over the DataLoader and print each batch of samples.
Conclusion
In this blog post, we went over how to use the DataSet and DataLoader classes in PyTorch to build custom datasets and dataloaders for deep learning models. By using these classes, we can easily organize and load large datasets for training our models.
Related posts :
2023.02.22 - PyTorch : #1 Tensors
2023.02.23 - PyTorch : #2 Autograd
2023.02.25 - PyTorch : #4 Building Neural Network
'Data Science > Deep Learning' 카테고리의 다른 글
[NLP] 자연어 작업 종류 (0) | 2023.03.04 |
---|---|
PyTorch : #4 Building Neural Network (0) | 2023.02.25 |
PyTorch : #2 Autograd (0) | 2023.02.23 |
PyTorch : #1 Tensors (0) | 2023.02.22 |
Image Generate AI Testing (0) | 2023.02.14 |
최근댓글