Loading…

Croissant: A Metadata Format for ML-Ready Datasets

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, port...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-12
Main Authors: Akhtar, Mubashara, Benjelloun, Omar, Conforti, Costanza, Foschini, Luca, Giner-Miguelez, Joan, Gijsbers, Pieter, Goswami, Sujata, Jain, Nitisha, Karamousadakis, Michalis, Kuchnik, Michael, Satyapriya Krishna, Lesage, Sylvain, Lhoest, Quentin, Marcenac, Pierre, Maskey, Manil, Mattson, Peter, Oala, Luis, Oderinwale, Hamidah, Ruyssen, Pierre, Santos, Tim, Shinde, Rajat, Simperl, Elena, Suresh, Arjun, Goeffry, Thomas, Tykhonov, Slava, Vanschoren, Joaquin, Varma, Susheel, van der Velde, Jos, Vogler, Steffen, Carole-Jean Wu, Zhang, Luyao
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.
ISSN:2331-8422
DOI:10.48550/arxiv.2403.19546