PeaTMOSS: Mining Pre-Trained Models in Open-Source Software

PeaTMOSS: Mining Pre-Trained Models in Open-Source Software#

Preprint Manuscript arXiv 2023 Dataset

Authors#

Co-First Author

Wenxin Jiang

Co-First Author

Jason Jones

Co-First Author

Jerin Yasmin
Nicholas M. Synovic
Rajeev Sashti
Sophie Chen
George K. Thiruvathukal
Yuan Tian
James C. Davis

Abstract#

Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges.

To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: PurdueDualityLab/PeaTMOSS-Demos.

Artifacts#

Todo

  • Add the paper preprint

  • Add the poster

  • Add link to the source code

Paper Preprint

Download

Published Paper

View

Poster

Download

Source Code

View

BibTex
@misc{jiang_peatmoss_2023,
   title = {{PeaTMOSS}: {Mining} {Pre}-{Trained} {Models} in {Open}-{Source} {Software}},
   copyright = {All rights reserved},
   shorttitle = {{PeaTMOSS}},
   url = {http://arxiv.org/abs/2310.03620},
   doi = {10.48550/arXiv.2310.03620},
   abstract = {Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.},
   urldate = {2024-01-29},
   publisher = {arXiv},
   author = {Jiang, Wenxin and Jones, Jason and Yasmin, Jerin and Synovic, Nicholas and Sashti, Rajeev and Chen, Sophie and Thiruvathukal, George K. and Tian, Yuan and Davis, James C.},
   month = oct,
   year = {2023},
   note = {arXiv:2310.03620 [cs]},
   keywords = {Computer Science - Software Engineering, Computer Science - Artificial Intelligence}
}

Video#