Git Mining Basics
Git Mining Basics
Working with the
git
version control system to process revision logs
Table of contents
Prerequisites
I recommend reading my previous article Version Control Mining Basics which explains what is version control mining, the definitions of base and derived metrics, and why we care about this with respect to software engineering.
Additionally, we will be computing longitudinal metrics in this series from the torvalds/linux repository.
Finally, we will be using git
, python3.13
, and gitpython
to interface with
the software repository. However, we will be developing tooling that should be
easily extensible and modular to support different git
implementations and
libraries, or even different version control systems entirely.
The idea
For this post, we are going to extract all of the git
revision log information
into a SQLite3 relational database. This is to showcase how to extract data from
a git
repository and showcase how much faster it is to store revision log
infomration outside of git
for future processing.
The VersionControlSystem
interface
To start this project off, we are going to creat a generic
VersionControlSystem
class that our version control system implementations can
inherit from. This generic will have methods to:
- Get the total number of revisions in a repository,
- Get revision log objects, and
- Parse revision logs,
Since we are using python
for this work, we can implement this class as
follows:
from abc import ABC, abstractmethod
from collections.abc import Iterator
class VersionControlSystem(ABC):
def __init__(self) -> None:
pass
@abstractmethod
def count_repository_revisions(self) -> int: ...
@abstractmethod
def get_revision_logs(self) -> list | Iterator: ...
@abstractmethod
def parse_revision_logs(self) -> list[dict]: ...
Implementing Git
Now that we have an interface, we can implement it to support git
. To do so,
we will be utilizing the gitpython
library as it provides performant python
tooling for working with git
repositories.
The __init__
function
Our __init__
is fairly simple:
from vcs import VersionControlSystem
from pathlib import Path
from git import Repo, Commit
from collections.abc import Iterator
class Git(VersionControlSystem):
def __init__(self, repository_path: Path) -> None:
self.repository_path: Path = repository_path.resolve()
self.repository: Repo = Repo(path=self.repository_path)
We start by importing our VersionControlSystem
interface we created in the
previous section, as well as the pathlib.Path
and git.Repo
objects.
pathlib.Path
enables us to resolve relative file paths to absolute file paths.
git.Repo
provides gitpython
functionality. We then connect to a git
repository using a filesystem path, and create the git.Repo
object.
Counting the number of revisions
While our __init__
method was fairly simple, our count_repository_revisions
implementation is even simpler.
...
def count_repository_revisions(self) -> int:
return int(self.repository.git.rev_list("--count", "--all"))
This method utilizes a feature of gitpython
to readily call git
commands.
Here we leverage git rev-list --all
which returns a list of all git
revisions. Adding the --count
flag then counts all of the revisions for us. We
then wrap the output of the command in an int
method to type-cast to an
integer value.
Extracting revision logs
gitpython
once again makes our lives easy. One of the libraries methods,
Repo.iter_commits
provides trivial access to a git
repository’s revision
logs.
...
def get_revision_logs(self) -> Iterator[Commit]:
return self.repository.iter_commits(rev=self.repository.head.commit)
iter_commits
enables the application to yield
commit information from a
repository. This enables both concise application writing (in our case), but
also memory effeciancy as we do not have to buffer all of the revision logs
prior to reading them.
Parsing revision logs
Here is where we need to do some work. Once we get a revision loaded into
memory, we now need to parse the metadata. However, this is also pretty simple
to do thanks to iter_commits
.
...
def parse_revision_logs(self, revision_logs: list | Iterator[Commit]) -> list[dict]:
data: list[dict] = []
revision_log: Commit
for revision_log in revision_logs:
data.append(
{
"author": revision_log.author,
"authored_date": revision_log.authored_date,
"author_tz_offset": revision_log.author_tz_offset,
"committer": revision_log.committer,
"committed_date": revision_log.committed_date,
"committer_tz_offset": revision_log.committer_tz_offset,
"message": revision_log.message,
"encoding": revision_log.encoding,
"parents": revision_log.parents,
"sha": revision_log.hexsha,
}
)
return data
Results
Here is an example output from the torvalds/linux
project using the code we
wrote:
{
'author': <git.Actor "Linus Torvalds <torvalds@linux-foundation.org>">,
'author_tz_offset': 25200,
'authored_date': 1755270154,
'committed_date': 1755270154,
'committer': <git.Actor "Linus Torvalds <torvalds@linux-foundation.org>">,
'committer_tz_offset': 25200,
'encoding': 'UTF-8',
'message': "Merge tag 'io_uring-6.17-20250815' of git://git.kernel.dk/linux\n"
'\n'
'Pull io_uring fixes from Jens Axboe:\n'
'\n'
' - Tweak for the fairly recent changes of minimizing io-wq '
'worker\n'
" creations when it's pointless to create them.\n"
'\n'
' - Fix for an issue with ring provided buffers, which could '
'cause issues\n'
' with reuse or corrupt application data.\n'
'\n'
"* tag 'io_uring-6.17-20250815' of git://git.kernel.dk/linux:\n"
' io_uring/io-wq: add check free worker before create new '
'worker\n'
' io_uring/net: commit partial buffers on retry\n',
'parents': (<git.Commit "8d084337a32fde0ffa59d5f70d07a54987911ba1">,
<git.Commit "9d83e1f05c98bab5de350bef89177e2be8b34db0">),
'sha': '4ad976b0e8ea3247c607cd37abb09440806f898d'}
While this output does not stringify gitpython
captured objects (e.g.,
git.Commit
, git.Actor
), it is perfectly usable for future contexts.
Wrapping up
We’ve successfully leveraged gitpython
to mine revision log information from a
git
repository. Additionally, we created a generic VersionControlSystem
class that we will extend later on.
Hopefully this blog post provides a practical introduction to mining version
control systems. The next post will focus on extracting metrics from git
repositories as well