PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages

Contents

PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages#

Conference Paper MSR 2023 Dataset

Authors#

Co-First Author

Wenxin Jiang

Co-First Author

Nicholas M. Synovic

ORCID Google Scholar

Purvish Jajal

Taylor R. Schorlemmer

Arav Tewari

Bhavesh Pareek

George K. Thiruvathukal

ORCID Google Scholar

James C. Davis

Abstract#

Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as “model hubs” support engineers in distributing and reusing deep learning models. PTM packages include pre-trained weights, documentation, model architectures, datasets, and metadata. Mining the information in PTM packages will enable the discovery of engineering phenomena and tools to support software engineers. However, accessing this information is difficult — there are many PTM registries, and both the registries and the individual packages may have rate limiting for accessing the data.

We present an open-source dataset, PTMTorrent, to facilitate the evaluation and understanding of PTM packages. This paper describes the creation, structure, usage, and limitations of the dataset. The dataset includes a snapshot of 5 model hubs and a total of 15,913 PTM packages. These packages are represented in a uniform data schema for cross-hub mining. We describe prior uses of this data and suggest research opportunities for mining using our dataset.

The PTMTorrent dataset (v1) is available at: https://app.globus.org/file-manager?origin_id=55e17a6e-9d8f-11ed-a2a2-8383522b48d9&origin_path=%2F%7E%2F.

Our dataset generation tools are available on GitHub: https://doi.org/10.5281/zenodo.7570357.

Artifacts#

Todo

Add the paper preprint
Add the poster
Add link to the source code

Paper Preprint

Published Paper

Poster

Source Code

BibTex

@inproceedings{jiang_ptmtorrent_2023,
   title = {{PTMTorrent}: {A} {Dataset} for {Mining} {Open}-source {Pre}-trained {Model} {Packages}},
   copyright = {All rights reserved},
   shorttitle = {{PTMTorrent}},
   doi = {10.1109/MSR59073.2023.00021},
   abstract = {Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as “model hubs” support engineers in distributing and reusing deep learning models. PTM packages include pre-trained weights, documentation, model architectures, datasets, and metadata. Mining the information in PTM packages will enable the discovery of engineering phenomena and tools to support software engineers. However, accessing this information is difficult — there are many PTM registries, and both the registries and the individual packages may have rate limiting for accessing the data.We present an open-source dataset, PTMTorrent, to facilitate the evaluation and understanding of PTM packages. This paper describes the creation, structure, usage, and limitations of the dataset. The dataset includes a snapshot of 5 model hubs and a total of 15,913 PTM packages. These packages are represented in a uniform data schema for cross-hub mining. We describe prior uses of this data and suggest research opportunities for mining using our dataset.The PTMTorrent dataset (v1) is available at: https://app.globus.org/file-manager?origin\_id=55e17a6e-9d8f-11ed-a2a2-8383522b48d9\&origin\_path=\%2F\%7E\%2F.Our dataset generation tools are available on GitHub: https://doi.org/10.5281/zenodo.7570357},
   booktitle = {2023 {IEEE}/{ACM} 20th {International} {Conference} on {Mining} {Software} {Repositories} ({MSR})},
   author = {Jiang, Wenxin and Synovic, Nicholas and Jajal, Purvish and Schorlemmer, Taylor R. and Tewari, Arav and Pareek, Bhavesh and Thiruvathukal, George K. and Davis, James C.},
   month = may,
   year = {2023},
   note = {ISSN: 2574-3864},
   keywords = {Software, Metadata, Data mining, Training, Deep learning, Empirical software engineering, Machine learning, Documentation, Limiting, Data Mining, Open-Source Software},
   pages = {57--61},
   file = {IEEE Xplore Abstract Record:/home/nicholas/Zotero/storage/IZWAMKYP/10173952.html:text/html;IEEE Xplore Full Text PDF:/home/nicholas/Zotero/storage/695U7APD/Jiang et al. - 2023 - PTMTorrent A Dataset for Mining Open-source Pre-t.pdf:application/pdf},
}