Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

How to delete old files in GIT while keeping history?

+3
−0

I'm far from an expert GIT user. I'm usually the only one working in a repository at a time, and know how to do the basics like committing snapshots. I still have to look up how to create and merge branches on the relatively rare occasions I need to use them.

We have some repositories that have gotten large, past the limit of the free hosting service we are using. Early on, some large files that don't really need to be tracked (like .EXEs) were accidentally included in the files GIT tracks. There are also a lot of old versions of some large files we'll never get back to. It would be nice to delete both these kinds of files, but still keep the commit history.

Of course we could copy the GIT repository, clean it out properly, make sure only the files we really need to be tracked are tracked, create a whole new repository and delete the old one (after archiving it on long-term media, of course). However, that looses the history being easily accessible in one place.

Is there a way to effectively fully delete files with all their versions from a GIT repository while still retaining the history and old versions of all other files?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

sounds like a use case for shallow clones (1 comment)

2 answers

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+0
−0

History of text changes is highly compressible

This is why Git is good for source code and other text. You can make hundreds of changes to one file and git gc will compact that down by an astonishing amount. Usually you don't need extra flags on it.

For other readers mostly storing text and seeing heavy disk usage with du, git gc is the answer. Some CPU and temporary disk space usage for a while, then you have the same data in less space.

Large files in branches and tags

If you had put your larger files (e.g. binary releases) on a branch, and then decided you didn't want them, you could delete the branch and run git gc to drop the underlying data. You may need flags if you haven't cleared out the reflogs or waited the relevant number of days.

Where to put large, non-text files?

For next time, consider git annex or some of the similar tools on what git-annex is not.

Shallow clone

You may find the git clone --depth ... or git clone --shallow-since ... options useful. Keep the original repository somewhere cheaper, and use a shallow clone as your working copy on the free hosting service.

Here's an example. This repository is huge because I've been committing a JSON file to it every two seconds for days. There are also 339,963 - 339,925 = 38 other commits of code I've written.

proj$ du -hs original/
246M	original/
proj$ (cd original; git log --oneline) | grep -c 'data fetch'
339925
proj$ (cd original; git log --oneline) | wc
 339963 1019990 7479802

git clone takes shortcuts on local filesystems, for efficiency of storage and time, so we have to tell it not to do that. In addition there is some other restriction on shallow local clones, so I'm going via ssh to localhost.

proj$ rm -rf partial/
proj$ git clone -q --shallow-since 2025-03-15 localhost:$PWD/original/ partial/
mcast@localhost's password: 
proj$ du -hs partial/
432K	partial/
proj$ (cd partial; git log --oneline) | wc
    395    1194    7555

The history in both is the same - so far as it goes.

proj$ (cd partial; git log -5 --oneline)
c8453c3 (HEAD -> main, origin/main, origin/HEAD) data fetch
94e7b04 fix (?) bitrot in login, after switch from Foo to Bar
808a171 data fetch
d967985 data fetch
96b4f33 data fetch
proj$ (cd original; git log -5 --oneline)
c8453c380e (HEAD -> main) data fetch
94e7b04382 fix (?) bitrot in login, after switch from Foo to Bar
808a171204 data fetch
d96798598a data fetch
96b4f33733 data fetch

Looking at the early history works too,

proj$ (cd original; git log --oneline) | tail -n5
ecf012dfcb data fetch
fe78378c1c data fetch
284f0f48af data fetch
2a7c39fcc1 current quality 37~51, bimodal, peak at 40
1e928135f7 initial empty commit
proj$ (cd partial; git log --oneline) | tail -n5
99ab1c1 data fetch
95519da data fetch
b15e3ee data fetch
73d06a0 new session
fcdf48a note bug

but you can't reach the missing history from the shallow clone,

proj$ (cd partial; git log 284f0f48af)
fatal: ambiguous argument '284f0f48af': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
proj$ (cd partial; git log 284f0f48af -- )
fatal: bad revision '284f0f48af'

Shallow is destructive. How to avoid?

proj$ rm -rf might-be-partial/
proj$ git clone -q localhost:$PWD/partial/ might-be-partial
mcast@localhost's password: 
proj$ du -hs might-be-partial/
432K	might-be-partial/

Yes, lots of data is missing.

proj$ rm -rf whole-again
proj$ git clone -q --reject-shallow localhost:$PWD/partial/ whole-again
mcast@localhost's password: 
fatal: source repository is shallow, reject to clone.
proj$ echo $?
128
proj$ ls whole-again
ls: cannot access 'whole-again': No such file or directory

That is good - I end up with an error, not a quietly broken repository. That is a safe default to set and probably a good idea when you start using shallow copies anywhere.

proj$ git config --global clone.rejectShallow 1
proj$ git clone -q localhost:$PWD/partial/ whole-again
mcast@localhost's password: 
fatal: source repository is shallow, reject to clone.

The converse is git clone --no-reject-shallow ... .

Purging large/binary files from history

Michael's answer about this is great. Here is another way to think about the same facts,

The history is implicit in the commitid of the head of your branch. This is because the HEAD commitid is a hash of the current project state and the parent commit(s). You cannot remove data from the repository without damaging the history and risking breaking the repository.

However, you can proceed with shallow history as above. Or you can omit branches which you don't need. Use the supplied tools and you will get sensible error messages at the right times.

Redacting secrets vs saving storage vs commit history errors

The three common reasons for rewriting history are

  1. Redacting secrets. Normally you would need to track down and delete all previous clones which contained the secrets. It is much safer to just change the affected passwords and be more careful next time.
  2. Saving storage. You can delete the large clones when you're happy with the small ones, or you can have the satisfaction of seeing git gc do it.
  3. "I made a mistake in my earlier commits and I want to fix it". This is usually unnecessary - it is quicker and often better to accept the mistakes as part of history. In the future, you will make better mistakes!

Purging hybrid

I haven't done it or seen it done, but there is another possibility.

  1. git branch -m master large-master
  2. Generate rewritten history with git filter-branch or whatever, which omits the large data. Call it small-master.
  3. Clone small-master onto the free site hosting. In a traditional rewrite, especially for redacting secrets
  4. Maintain parallel branches where you have more storage, and merge small-master into large-master when needed.

This is probably a recipe for great confusion, but it would allow a less destructive history rewrite and might assist your understanding of the process.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+4
−0

Depending on some semantics, it is possible to do what you want, for a sufficiently motivated cohort of "we." The tricky part is that you're rewriting history, and everyone has to agree on that new history.

Specifically, when you say

It would be nice to delete both these kinds of files, but still keep the commit history.

I'm interpreting that as "otherwise keep the commit messages with the diffs of files not to be removed." If you want to get technical, there is no way to purge the files from history without rewriting that history.

Background

The commit SHA, the basic identifier of a specific commit, is dependent on the state of the files in the repository.[1] The implication here is that every single commit forward of your oldest one with an unwanted file will have a different SHA than you use now.

When you're the only consumer of the repository,[2] this isn't too much of a concern. Just make the change, git push --force, and move on.

If you are not the only user, everyone else will need to reset all of their local and remote branches to stem from your changed history instead of the common commit.

Rewriting history

Interactive rebase

Recent commits[3] can be fixed with an interactive rebase, as a commenter mentioned.

I personally like to tag the current head (say git tag tmp/master), find the commit that made the change (say it's deadbeef), and interactive rebase against the next older one: git rebase --interactive deadbeef^.

You'll see a list of commits in your editor with pick next to each one. Change pick to edit next to the commit(s) you need to modify, save, and quit. At each pause, Git will ask you to set the files to the state you wish. Then git add them, commit, and git rebase --continue.

Bigger tools for older commits

Usually when someone grabs a big hammer to rewrite early history, it's because they put an important credential into Git history that needs to be excised in all its forms.

Then, you'll need to employ a tool like git-filter-repo or git-filter-branch to purge unwanted files from all of history.

Consider practicing on a clone first!! These tools have extreme destructive power.


  1. Dependent on the files, the parent, some dates, authorship, etc. ↩︎

  2. You're the only consumer or your unwanted file hasn't yet been pushed where others can see it ↩︎

  3. If it was your most recent commit, you can even stage the changes away and git commit --amend. ↩︎

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

The point of the exercise is to reduce the storage size of the repository. When I said it would be n... (2 comments)

Sign up to answer this question »