About Datapluck
Learn more about Datapluck, a tool for exporting and importing datasets with ease.
What is Datapluck?
`datapluck` is a command-line tool and Python library for exporting datasets from the Hugging Face Hub to various file formats and importing datasets back to the Hugging Face Hub. Supported formats include CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets.
It serves several purposes, such as previewing datasets in your preferred format, annotating datasets using external editors, simplifying dataset management in CLI and CI/CD contexts, and backing up datasets.
Features
- Export datasets from the Hugging Face Hub
- Import datasets to the Hugging Face Hub
- Support multiple output formats: CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets
- Handle different dataset splits and subsets
- Connect to Google Sheets for import/export operations
- Filter columns during import
- Support for private datasets on Hugging Face
Installation
Install `datapluck` from PyPI:
pip install datapluck
Authentication
Before using `datapluck`, ensure you are logged in to the Hugging Face Hub. This is required for authentication when accessing private datasets or updating yours. You can log in using the Hugging Face CLI:
huggingface-cli login
This will prompt you to enter your Hugging Face access token. Once logged in, `datapluck` will use your credentials for operations that require authentication.