HuggingFace datasets

HuggingFace datasets

A guidance of usage for Hugging Face datasets.

Install

Start by installing 🤗 Datasets:

1
pip install datasets

🤗 Datasets also support audio and image data formats:

1
2
pip install datasets[audio]
pip install datasets[vision]

Load Dataset

Import load_dataset:

1
from datasets import load_dataset

Load remote dataset:

1
dataset = load_dataset("glue", "mrpc", split="train")

Load local CSV files:

1
2
3
dataset1 = load_dataset('csv', data_files='data.csv')
dataset2 = load_dataset('csv', data_files=['train.csv', 'test.csv'])
dataset3 = load_dataset('csv', data_files='/path/to/directory/*.csv')

Load local JSON files:

1
dataset = load_dataset('json', data_files='data.json')

Load local TXT files:

1
dataset = load_dataset('text', data_files='data.txt')

Load from disk

The dataset downloaded to the local machine via the 🤗 datasets library can be loaded using the load_from_disk() function. This function allows you to load datasets that have been previously cached or downloaded locally, without the need to fetch them again from the 🤗 database.

1
2
3
4
5
from datasets import load_from_disk

local_dataset = load_from_disk("./local_imdb_dataset")

print(local_dataset)
Author

Breynald

Posted on

2024-10-14

Updated on

2025-05-28

Licensed under

Comments