Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
Post History
History of text changes is highly compressible This is why Git is good for source code and other text. You can make hundreds of changes to one file and git gc will compact that down by an astonish...
#1: Initial revision
# History of text changes is highly compressible This is why Git is good for source code and other text. You can make hundreds of changes to one file and [`git gc`](https://git-scm.com/docs/git-gc) will compact that down by an astonishing amount. Usually you don't need extra flags on it. For other readers mostly storing text and seeing heavy disk usage with `du`, `git gc` is the answer. Some CPU and temporary disk space usage for a while, then you have the same data in less space. ## Large files in branches and tags If you had put your larger files (e.g. binary releases) on a branch, and then decided you didn't want them, you could delete the branch and run `git gc` to drop the underlying data. You may need flags if you haven't cleared out the reflogs or waited the relevant number of days. ## Where to put large, non-text files? For next time, consider [`git annex`](https://git-annex.branchable.com/) or some of the similar tools on [what git-annex is not](https://git-annex.branchable.com/not/). # Shallow clone You may find the `git clone --depth ...` or `git clone --shallow-since ...` options useful. Keep the original repository somewhere cheaper, and use a shallow clone as your working copy on the free hosting service. Here's an example. This repository is huge because I've been committing a JSON file to it every two seconds for days. There are also 339,963 - 339,925 = 38 other commits of code I've written. ``` proj$ du -hs original/ 246M original/ proj$ (cd original; git log --oneline) | grep -c 'data fetch' 339925 proj$ (cd original; git log --oneline) | wc 339963 1019990 7479802 ``` `git clone` takes shortcuts on local filesystems, for efficiency of storage and time, so we have to tell it not to do that. In addition there is some other restriction on shallow local clones, so I'm going via ssh to localhost. ``` proj$ rm -rf partial/ proj$ git clone -q --shallow-since 2025-03-15 localhost:$PWD/original/ partial/ mcast@localhost's password: proj$ du -hs partial/ 432K partial/ proj$ (cd partial; git log --oneline) | wc 395 1194 7555 ``` The history in both is the same - so far as it goes. ``` proj$ (cd partial; git log -5 --oneline) c8453c3 (HEAD -> main, origin/main, origin/HEAD) data fetch 94e7b04 fix (?) bitrot in login, after switch from Foo to Bar 808a171 data fetch d967985 data fetch 96b4f33 data fetch proj$ (cd original; git log -5 --oneline) c8453c380e (HEAD -> main) data fetch 94e7b04382 fix (?) bitrot in login, after switch from Foo to Bar 808a171204 data fetch d96798598a data fetch 96b4f33733 data fetch ``` Looking at the early history works too, ``` proj$ (cd original; git log --oneline) | tail -n5 ecf012dfcb data fetch fe78378c1c data fetch 284f0f48af data fetch 2a7c39fcc1 current quality 37~51, bimodal, peak at 40 1e928135f7 initial empty commit proj$ (cd partial; git log --oneline) | tail -n5 99ab1c1 data fetch 95519da data fetch b15e3ee data fetch 73d06a0 new session fcdf48a note bug ``` but you can't reach the missing history from the shallow clone, ``` proj$ (cd partial; git log 284f0f48af) fatal: ambiguous argument '284f0f48af': unknown revision or path not in the working tree. Use '--' to separate paths from revisions, like this: 'git <command> [<revision>...] -- [<file>...]' proj$ (cd partial; git log 284f0f48af -- ) fatal: bad revision '284f0f48af' ``` ## Shallow is destructive. How to avoid? ``` proj$ rm -rf might-be-partial/ proj$ git clone -q localhost:$PWD/partial/ might-be-partial mcast@localhost's password: proj$ du -hs might-be-partial/ 432K might-be-partial/ ``` Yes, lots of data is missing. ``` proj$ rm -rf whole-again proj$ git clone -q --reject-shallow localhost:$PWD/partial/ whole-again mcast@localhost's password: fatal: source repository is shallow, reject to clone. proj$ echo $? 128 proj$ ls whole-again ls: cannot access 'whole-again': No such file or directory ``` That is good - I end up with an error, not a quietly broken repository. **That is a safe default to set** and probably a good idea when you start using shallow copies anywhere. ``` proj$ git config --global clone.rejectShallow 1 proj$ git clone -q localhost:$PWD/partial/ whole-again mcast@localhost's password: fatal: source repository is shallow, reject to clone. ``` The converse is `git clone --no-reject-shallow ...` . # Purging large/binary files from history Michael's answer about this is great. Here is another way to think about the same facts, The history is implicit in the commitid of the head of your branch. This is because the HEAD commitid is a hash of the current project state and the parent commit(s). You cannot remove data from the repository without damaging the history and risking breaking the repository. However, you can proceed with shallow history as above. Or you can omit branches which you don't need. Use the supplied tools and you will get sensible error messages at the right times. ## Redacting secrets _vs_ saving storage _vs_ commit history errors The three common reasons for rewriting history are 1. Redacting secrets. Normally you would need to track down and delete all previous clones which contained the secrets. It is **much safer to just change the affected passwords** and be more careful next time. 2. Saving storage. You can delete the large clones when you're happy with the small ones, or you can have the satisfaction of seeing `git gc` do it. 3. "I made a mistake in my earlier commits and I want to fix it". This is usually unnecessary - it is quicker and often better to accept the mistakes as part of history. In the future, you will make better mistakes! ## Purging hybrid I haven't done it or seen it done, but there is another possibility. 1. `git branch -m master large-master` 2. Generate rewritten history with `git filter-branch` or whatever, which omits the large data. Call it `small-master`. 3. Clone `small-master` onto the free site hosting. In a traditional rewrite, especially for redacting secrets 4. Maintain parallel branches where you have more storage, and merge `small-master` into `large-master` when needed. This is **probably a recipe for great confusion**, but it would allow a less destructive history rewrite and might assist your understanding of the process.