What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims

What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims#

Conference Paper ESEM 2024 Literature Review

Authors#

Jason Jones
Wenxin Jiang
Nicholas M. Synovic
James C. Davis

Abstract#

Background: Software Package Registries (SPRs) are an integral part of the software supply chain. These collaborative platforms unite contributors, users, and code for streamlined package management. Prior work has characterized the SPRs associated with traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. A growing body of empirical research has examined PTM registries from various angles, such as vulnerabilities, reuse processes, and evolution. However, no synthesis provides a systematic understanding of current knowledge. Furthermore, much of the existing research includes non-quantified qualitative observations.

Aims: First, we aim to provide a systematic knowledge synthesis. Second, we quantify qualitative claims. Methods: We conducted a systematic literature review (SLR). We then observed that some of the claims are qualitative, lacking quantitative evidence. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims.

Results: We identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative support. We tested 3 of these claims through a quantitative analysis, and directly compare the fourth with traditional software. Our most notable findings are: (1) PTMs have a significantly higher turnover rate than traditional software, indicating more rapid evolution; and (2) There is a strong correlation between documentation quality and PTM popularity.

Conclusions: Our findings validate several qualitative research claims with concrete metrics, confirming prior research. Our measures motivate further research on the dynamics of PTM reuse.

Artifacts#

Todo

  • Add the paper preprint

  • Add the poster

  • Add link to the source code

  • Update the bibtex

Paper Preprint

Download

Published Paper

View

Poster

Download

Source Code

View

BibTex
@inproceedings{jones_what_2024,

address = {New York, NY, USA}, series = {{ESEM} ‘24}, title = {What do we know about {Hugging} {Face}? {A} systematic literature review and quantitative validation of qualitative claims}, isbn = {979-8-4007-1047-6}, shorttitle = {What do we know about {Hugging} {Face}?}, url = {https://dl.acm.org/doi/10.1145/3674805.3686665}, doi = {10.1145/3674805.3686665}, abstract = {Background: Software Package Registries (SPRs) are an integral part of the software supply chain. These collaborative platforms unite contributors, users, and code for streamlined package management. Prior work has characterized the SPRs associated with traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. A growing body of empirical research has examined PTM registries from various angles, such as vulnerabilities, reuse processes, and evolution. However, no synthesis provides a systematic understanding of current knowledge. Furthermore, much of the existing research includes non-quantified qualitative observations. Aims: First, we aim to provide a systematic knowledge synthesis. Second, we quantify qualitative claims. Methods: We conducted a systematic literature review (SLR). We then observed that some of the claims are qualitative, lacking quantitative evidence. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: We identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative support. We tested 3 of these claims through a quantitative analysis, and directly compare the fourth with traditional software. Our most notable findings are: (1) PTMs have a significantly higher turnover rate than traditional software, indicating more rapid evolution; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: Our findings validate several qualitative research claims with concrete metrics, confirming prior research. Our measures motivate further research on the dynamics of PTM reuse.}, urldate = {2024-10-31}, booktitle = {Proceedings of the 18th {ACM}/{IEEE} {International} {Symposium} on {Empirical} {Software} {Engineering} and {Measurement}}, publisher = {Association for Computing Machinery}, author = {Jones, Jason and Jiang, Wenxin and Synovic, Nicholas and Thiruvathukal, George and Davis, James}, month = oct, year = {2024}, pages = {13–24},

}

Video#