How to shrink down a git(hub) repository

Starting point

With my Vulkan C++ example github repository approaching 200 MB in size I decided it was about time to shrink it down to a reasonable size again. Shrinking a git(hub) repository isn’t just about deleting locally present files but requires cleaning up the history as files that have been removed are still present in the repository’s history and therefore still contribute to it’s size.

A big chunk of the repo’s size is caused by binary assets like textures and 3d models. When I started out with my Vulkan example there were only a few assets so I just added them to the repository. In hindsight this was the wrong decision, so one of my primary goals was to remove all those assets from the repository and it’s history. I already stopped adding assets while I did some examples using HDR textures and moved them into a separate asset pack that needs to be downloaded to actually run these examples. After removing the assets I’ll no longer add any of them to the repo but rather put them into the separate asset pack.

So in this article I’ll try to describe how to shrink down a long running repository without having to recreate it. For my Vulkan examples this resulted in a much smaller repository that’s a lot faster to clone.

One of the most important things I wasn’t sure about before starting with this: Will github reflect my repository size changes? Yes!

They seem to run house keeping tasks (git gc) at a pretty quick rate, so pushing after removing files from history will also shrink the repository on the github server.

Before:

After:

(I’m using this chrome extension to get the size of a github repository displayed at it’s landing page)

Important note

This process involves rewriting the history of your repository, so everyone that is collaborating needs to rebase or (better) do a fresh clone before doing pull requests again!

Preparations

Once you’re ready to do this clean up on your actual repository consider the following:

Clean up your branches (and tags)

The less branches the faster clean up processes will run. So it’s a good idea to remove all branches that are no longer active and see which branch can be merged into master (and removed). In my case I finished work on the develop branch, merged it into master and removed it. Same goes for tags. Remove those that you don’t need anymore.

Take care of open pull requests

As we are rewriting history you should take care of all open pull requests. Either merge before starting to clean up or close those that you don’t want to merge. For PRs you’re unsure about drop a note that you’ll be rebasing and ask the author to resubmit after the rebase.

Test run

As the changes we’re going to do can’t easily be reverted, it may be a good idea to test this on a copy of the repository and just do the changes in one single run on the live repository at a later point. Creating a copy of your repository (with a different name) in github is pretty easy using the import function (which also works with github repositories):

Tools used

I’m going to use rtyley’s BFG Repo-Cleaner to remove the files from git history. The other option would be using git-filter-branch, but BFG is much faster and easier to use, especially on larger repositories and also adds some safety checks and outputs detailed log files.

Setup

For the cleaning process we’ll be working with two versions of the repository we want to clean up. For this I created a separate folder with only these two repositories.

Clone the bare repository

Cleanup will be run on a bare repository that doesn’t contain the actual files but rather only the administrative and control files normally hidden in the .git sub folder of your full repository:

$ git clone --mirror repository_url

This will create a folder named repository.git`.

Clone the full repository

We also clone a full copy of our repository so we can remove files still present and push changes to the remotes:

$ git clone repository_url

In my case:

$ git clone --mirror https://github.com/SaschaWillems/vulkan_slim.git
Cloning into bare repository 'vulkan_slim.git'...

$ git clone https://github.com/SaschaWillems/vulkan_slim.git
Cloning into 'vulkan_slim'...

We also use the bare repository to check progress on shrinking the repository:

$ cd vulkan_slim.git\
$ du -sh .
199M    . 

This gives us an initial size of 199M to start with.

This results in the following structure for my Vulkan cleanup test run:

cleanup/
    vulkan_slim/ 
    vulkan_slim.git/

Step 1: Removing files still present

Textures and 3d models currently make up a huge chunk of the repository size so removing them is the first step in getting the size down. BFG will only remove files that are not longer present (and therefore protected).

Before we can run BFG to remove them from the history we need to remove them locally on the full clone and push the changes to the remote:

$ cd vulkan_slim\
$ rm -rf data/textures
$ rm -rf data/models
$ git commit -am "Removed textures and models from assets"
[master 5b6dac7] Removed textures and models from assets
    167 files changed, 109777 deletions(-)
    delete mode 100644 data/models/angryteapot.3ds
    delete mode 100644 data/models/armor/armor.dae
    ...
$ git push

Now we move over to the bare repository and fetch the changes we just pushed from remote:

$ cd vulkan_slim.git\
$ git fetch
$ git log
commit 5b6dac7a98a0bc097271fba7feee90cc262d4afd (HEAD -> master)
Author: saschawillems <xxx>
Date:   Sat Sep 9 15:12:35 2017 +0200

    Removed textures and models from assets

Checking the size of the bare repo still returns the same size as the files are still present in the history. So our next step is to remove them from the history using BFG. The current version of BFG doesn’t support of removal by folder name but works fine with wildcard masks. As a positive side effect this will also remove assets deleted at an earlier point:

$ cd ..
$ java -jar v:/bfg.jar --delete-files "*.{dds,DDS,ktx,KTX,dae,x,X,obj,3ds,fbx}" vulkan_slim.git

BFG will now clean up and update all commits (including branches and tags). If there would still be a file present with one of the above file extensions it wouldn’t get removed. When done BFG will output a small summary and also saves a full report to disk. As a result you should get a (partial) list of deleted files that include the assets we just removed from the full repository:

...
Deleted files
-------------

        Filename                                  Git id
        ------------------------------------------------------------------------------------------------------
        angryteapot.3ds           | 609205f6 (155,1 KB)
        angryteapot.X             | 58280a85 (320,6 KB), 9bdb2864 (320,6 KB)
        armor.dae                 | 55b8d7cb (1,4 MB)
...
        KAMEN-stup.ktx            | fae86aa3 (170,8 KB)

        ...
...
In total, 2693 object ids were changed. Full details are logged here:

        V:\cleanup\vulkan_slim.git.bfg-report\2017-09-09\15-29-19

If you now check the repository size you may notice that it hasn’t really changed. That’s because BFG doesn’t delete anything when cleaning the commits, to strip the no longer needed files from the repository we’ll be using git’s gc command for this:

$ cd vulkan_slim.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

The git reflog expire command prunes all entries older than the current time while git gc removes unreachable files and recompresses the repository.

Checking the size of the bare repo:

$ du -sh .
103M    . 

Removing the assets reduced the size by ~46%, cutting the size almost in half!

If you’re sure about the changes push them to the remote repository via git push. This will force all refs (branches and tags included) to be updated, so it may take a while.

Step 2: Removing deleted files

With a long running repository chances are that you deleted files months or years ago. Even though these files are no longer present in your local repository they are still stored in the git database (and is the case with other source versioning systems too) adding to the repository’s size. Having binary files like textures, dlls, static libraries stored in the repo isn’t of much value so we want to get rid of those too.

So what we need is a list of file deletions on our repository which can easily be done with git’s log feature:

$ git log --diff-filter=D --summary | grep "delete mode" > deleted_files.txt

The --diff-filter=D only lists commits with file deletions, grep is then used to only filter lines that contain deleted file names, stripping away commit messages. We pipe the output to a text file that we can then search for files we want to be removed.

Without the grep:

$ git log --diff-filter=D --summary
Author: saschawillems <xxx>
Date:   Sat Sep 9 15:12:35 2017 +0200

    Removed textures and models from assets

    delete mode 100644 data/models/armor/license.txt
    delete mode 100644 data/models/retroufo_license.txt

With grep:

$ git log --diff-filter=D --summary | grep "delete mode"
    delete mode 100644 libs/assimp/armeabi-v7a/libassimp.a
    delete mode 100644 data/models/armor/license.txt
    delete mode 100644 data/models/retroufo_license.txt
    delete mode 100644 data/models/sibenik/copyright.txt
    delete mode 100644 data/models/voyager/copyright.txt
    ...
    delete mode 100644 libs/assimp/libassimp.dll.a

We can now go through that list, find the files we want to be permanently removed from our history and tell BFG to remove them. In my case I want all those pre-built libraries to be removed due to their file size. So I went through that list and made up a file name filter for BFG:

$ java -jar v:/bfg.jar --delete-files "{libassimp, libzlibstatic}.*" vulkan_slim.git

Which results in the following:

Deleted files
-------------

        Filename          Git id
        -------------------------------------------------------------
        libassimp.a     | 66d2f0f2 (56,5 MB), 0b6abf20 (93,8 MB), ...
        libassimp.dll.a | d9e1961f (16,3 MB)
        libassimp.dylib | 308054fe (14,9 MB)

Checking the size after running git gc to clean-up and compress the repo again shows another huge change in repository size:

$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ du -sh .
50M     .

Down to ~25% of the initial repository size!

If you want to scrape a few megabytes walk through the list of deleted files and remove them using the same commands as above. Either with single BFG runs or by putting them into grouped file name filters. In my case I got my repo down to 35 MB, which is about 18% of the initial repository size.

Wrapping it up

If you ran these commands on a separate copy of your repository, like I did, the next step is applying these changes to your actual repository. While doing this on a copy I saved all the commands I ran into a single script so I can now run all this on my actual Vulkan repository once I took care of all the open pull requests.

Once that’s done it’s time to put the binary files that are still required somewhere else. There are multiple options here so go with the one that suits you best:

I did try git lfs, but aside from the technical troubles it gave me (like download errors, etc.) most hosters have tight limits and quotes on the large-file-storage that limit it’s usefulness unless you pay for it.

Another option would be storing the assets in a separate directory that is included as a submodule in the main repository.

Though I just went with a separate asset pack hosted on my own web space and included a simple python download (and install) script that makes it easy to fetch the assets required to run the examples.