About Datapluck

Learn more about Datapluck, a tool for exporting and importing datasets with ease.

What is Datapluck?

`datapluck` is a command-line tool and Python library for exporting datasets from the Hugging Face Hub to various file formats and importing datasets back to the Hugging Face Hub. Supported formats include CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets.

It serves several purposes, such as previewing datasets in your preferred format, annotating datasets using external editors, simplifying dataset management in CLI and CI/CD contexts, and backing up datasets.

Features

Export datasets from the Hugging Face Hub
Import datasets to the Hugging Face Hub
Support multiple output formats: CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets
Handle different dataset splits and subsets
Connect to Google Sheets for import/export operations
Filter columns during import
Support for private datasets on Hugging Face

Installation

Install `datapluck` from PyPI:

pip install datapluck

Authentication

Before using `datapluck`, ensure you are logged in to the Hugging Face Hub. This is required for authentication when accessing private datasets or updating yours. You can log in using the Hugging Face CLI:

huggingface-cli login

This will prompt you to enter your Hugging Face access token. Once logged in, `datapluck` will use your credentials for operations that require authentication.

Learn More

View Datapluck on PyPI