What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims#
Conference Paper ESEM 2024 Literature Review
Abstract#
Background: Software Package Registries (SPRs) are an integral part of the software supply chain. These collaborative platforms unite contributors, users, and code for streamlined package management. Prior work has characterized the SPRs associated with traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. A growing body of empirical research has examined PTM registries from various angles, such as vulnerabilities, reuse processes, and evolution. However, no synthesis provides a systematic understanding of current knowledge. Furthermore, much of the existing research includes non-quantified qualitative observations.
Aims: First, we aim to provide a systematic knowledge synthesis. Second, we quantify qualitative claims. Methods: We conducted a systematic literature review (SLR). We then observed that some of the claims are qualitative, lacking quantitative evidence. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims.
Results: We identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative support. We tested 3 of these claims through a quantitative analysis, and directly compare the fourth with traditional software. Our most notable findings are: (1) PTMs have a significantly higher turnover rate than traditional software, indicating more rapid evolution; and (2) There is a strong correlation between documentation quality and PTM popularity.
Conclusions: Our findings validate several qualitative research claims with concrete metrics, confirming prior research. Our measures motivate further research on the dynamics of PTM reuse.
Artifacts#
Todo
Add the paper preprint
Add the poster
Add link to the source code
Update the bibtex
BibTex
- @inproceedings{jones_what_2024,
address = {New York, NY, USA}, series = {{ESEM} ‘24}, title = {What do we know about {Hugging} {Face}? {A} systematic literature review and quantitative validation of qualitative claims}, isbn = {979-8-4007-1047-6}, shorttitle = {What do we know about {Hugging} {Face}?}, url = {https://dl.acm.org/doi/10.1145/3674805.3686665}, doi = {10.1145/3674805.3686665}, abstract = {Background: Software Package Registries (SPRs) are an integral part of the software supply chain. These collaborative platforms unite contributors, users, and code for streamlined package management. Prior work has characterized the SPRs associated with traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. A growing body of empirical research has examined PTM registries from various angles, such as vulnerabilities, reuse processes, and evolution. However, no synthesis provides a systematic understanding of current knowledge. Furthermore, much of the existing research includes non-quantified qualitative observations. Aims: First, we aim to provide a systematic knowledge synthesis. Second, we quantify qualitative claims. Methods: We conducted a systematic literature review (SLR). We then observed that some of the claims are qualitative, lacking quantitative evidence. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: We identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative support. We tested 3 of these claims through a quantitative analysis, and directly compare the fourth with traditional software. Our most notable findings are: (1) PTMs have a significantly higher turnover rate than traditional software, indicating more rapid evolution; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: Our findings validate several qualitative research claims with concrete metrics, confirming prior research. Our measures motivate further research on the dynamics of PTM reuse.}, urldate = {2024-10-31}, booktitle = {Proceedings of the 18th {ACM}/{IEEE} {International} {Symposium} on {Empirical} {Software} {Engineering} and {Measurement}}, publisher = {Association for Computing Machinery}, author = {Jones, Jason and Jiang, Wenxin and Synovic, Nicholas and Thiruvathukal, George and Davis, James}, month = oct, year = {2024}, pages = {13–24},
}