Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
How to delete old files in GIT while keeping history?
I'm far from an expert GIT user. I'm usually the only one working in a repository at a time, and know how to do the basics like committing snapshots. I still have to look up how to create and merge branches on the relatively rare occasions I need to use them.
We have some repositories that have gotten large, past the limit of the free hosting service we are using. Early on, some large files that don't really need to be tracked (like .EXEs) were accidentally included in the files GIT tracks. There are also a lot of old versions of some large files we'll never get back to. It would be nice to delete both these kinds of files, but still keep the commit history.
Of course we could copy the GIT repository, clean it out properly, make sure only the files we really need to be tracked are tracked, create a whole new repository and delete the old one (after archiving it on long-term media, of course). However, that looses the history being easily accessible in one place.
Is there a way to effectively fully delete files with all their versions from a GIT repository while still retaining the history and old versions of all other files?
2 answers
You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.
History of text changes is highly compressible
This is why Git is good for source code and other text. You can make hundreds of changes to one file and git gc
will compact that down by an astonishing amount. Usually you don't need extra flags on it.
For other readers mostly storing text and seeing heavy disk usage with du
, git gc
is the answer. Some CPU and temporary disk space usage for a while, then you have the same data in less space.
Large files in branches and tags
If you had put your larger files (e.g. binary releases) on a branch, and then decided you didn't want them, you could delete the branch and run git gc
to drop the underlying data. You may need flags if you haven't cleared out the reflogs or waited the relevant number of days.
Where to put large, non-text files?
For next time, consider git annex
or some of the similar tools on what git-annex is not.
Shallow clone
You may find the git clone --depth ...
or git clone --shallow-since ...
options useful. Keep the original repository somewhere cheaper, and use a shallow clone as your working copy on the free hosting service.
Here's an example. This repository is huge because I've been committing a JSON file to it every two seconds for days. There are also 339,963 - 339,925 = 38 other commits of code I've written.
proj$ du -hs original/
246M original/
proj$ (cd original; git log --oneline) | grep -c 'data fetch'
339925
proj$ (cd original; git log --oneline) | wc
339963 1019990 7479802
git clone
takes shortcuts on local filesystems, for efficiency of storage and time, so we have to tell it not to do that. In addition there is some other restriction on shallow local clones, so I'm going via ssh to localhost.
proj$ rm -rf partial/
proj$ git clone -q --shallow-since 2025-03-15 localhost:$PWD/original/ partial/
mcast@localhost's password:
proj$ du -hs partial/
432K partial/
proj$ (cd partial; git log --oneline) | wc
395 1194 7555
The history in both is the same - so far as it goes.
proj$ (cd partial; git log -5 --oneline)
c8453c3 (HEAD -> main, origin/main, origin/HEAD) data fetch
94e7b04 fix (?) bitrot in login, after switch from Foo to Bar
808a171 data fetch
d967985 data fetch
96b4f33 data fetch
proj$ (cd original; git log -5 --oneline)
c8453c380e (HEAD -> main) data fetch
94e7b04382 fix (?) bitrot in login, after switch from Foo to Bar
808a171204 data fetch
d96798598a data fetch
96b4f33733 data fetch
Looking at the early history works too,
proj$ (cd original; git log --oneline) | tail -n5
ecf012dfcb data fetch
fe78378c1c data fetch
284f0f48af data fetch
2a7c39fcc1 current quality 37~51, bimodal, peak at 40
1e928135f7 initial empty commit
proj$ (cd partial; git log --oneline) | tail -n5
99ab1c1 data fetch
95519da data fetch
b15e3ee data fetch
73d06a0 new session
fcdf48a note bug
but you can't reach the missing history from the shallow clone,
proj$ (cd partial; git log 284f0f48af)
fatal: ambiguous argument '284f0f48af': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
proj$ (cd partial; git log 284f0f48af -- )
fatal: bad revision '284f0f48af'
Shallow is destructive. How to avoid?
proj$ rm -rf might-be-partial/
proj$ git clone -q localhost:$PWD/partial/ might-be-partial
mcast@localhost's password:
proj$ du -hs might-be-partial/
432K might-be-partial/
Yes, lots of data is missing.
proj$ rm -rf whole-again
proj$ git clone -q --reject-shallow localhost:$PWD/partial/ whole-again
mcast@localhost's password:
fatal: source repository is shallow, reject to clone.
proj$ echo $?
128
proj$ ls whole-again
ls: cannot access 'whole-again': No such file or directory
That is good - I end up with an error, not a quietly broken repository. That is a safe default to set and probably a good idea when you start using shallow copies anywhere.
proj$ git config --global clone.rejectShallow 1
proj$ git clone -q localhost:$PWD/partial/ whole-again
mcast@localhost's password:
fatal: source repository is shallow, reject to clone.
The converse is git clone --no-reject-shallow ...
.
Purging large/binary files from history
Michael's answer about this is great. Here is another way to think about the same facts,
The history is implicit in the commitid of the head of your branch. This is because the HEAD commitid is a hash of the current project state and the parent commit(s). You cannot remove data from the repository without damaging the history and risking breaking the repository.
However, you can proceed with shallow history as above. Or you can omit branches which you don't need. Use the supplied tools and you will get sensible error messages at the right times.
Redacting secrets vs saving storage vs commit history errors
The three common reasons for rewriting history are
- Redacting secrets. Normally you would need to track down and delete all previous clones which contained the secrets. It is much safer to just change the affected passwords and be more careful next time.
- Saving storage. You can delete the large clones when you're happy with the small ones, or you can have the satisfaction of seeing
git gc
do it. - "I made a mistake in my earlier commits and I want to fix it". This is usually unnecessary - it is quicker and often better to accept the mistakes as part of history. In the future, you will make better mistakes!
Purging hybrid
I haven't done it or seen it done, but there is another possibility.
git branch -m master large-master
- Generate rewritten history with
git filter-branch
or whatever, which omits the large data. Call itsmall-master
. - Clone
small-master
onto the free site hosting. In a traditional rewrite, especially for redacting secrets - Maintain parallel branches where you have more storage, and merge
small-master
intolarge-master
when needed.
This is probably a recipe for great confusion, but it would allow a less destructive history rewrite and might assist your understanding of the process.
0 comment threads
Depending on some semantics, it is possible to do what you want, for a sufficiently motivated cohort of "we." The tricky part is that you're rewriting history, and everyone has to agree on that new history.
Specifically, when you say
It would be nice to delete both these kinds of files, but still keep the commit history.
I'm interpreting that as "otherwise keep the commit messages with the diffs of files not to be removed." If you want to get technical, there is no way to purge the files from history without rewriting that history.
Background
The commit SHA, the basic identifier of a specific commit, is dependent on the state of the files in the repository.[1] The implication here is that every single commit forward of your oldest one with an unwanted file will have a different SHA than you use now.
When you're the only consumer of the repository,[2] this isn't too much of a concern. Just make the change, git push --force
, and move on.
If you are not the only user, everyone else will need to reset all of their local and remote branches to stem from your changed history instead of the common commit.
Rewriting history
Interactive rebase
Recent commits[3] can be fixed with an interactive rebase, as a commenter mentioned.
I personally like to tag the current head (say git tag tmp/master
), find the commit that made the change (say it's deadbeef
), and interactive rebase against the next older one: git rebase --interactive deadbeef^
.
You'll see a list of commits in your editor with pick
next to each one. Change pick
to edit
next to the commit(s) you need to modify, save, and quit. At each pause, Git will ask you to set the files to the state you wish. Then git add
them, commit, and git rebase --continue
.
Bigger tools for older commits
Usually when someone grabs a big hammer to rewrite early history, it's because they put an important credential into Git history that needs to be excised in all its forms.
Then, you'll need to employ a tool like git-filter-repo
or git-filter-branch
to purge unwanted files from all of history.
Consider practicing on a clone first!! These tools have extreme destructive power.
1 comment thread