I store nearly all files of even moderate importance in Git (including this blog post). These are usually plain-text files, but sometimes it's necessary to put binary files under version control. Unfortunately, those are typically difficult to diff and merge, but I recently discovered some features of Git that make this less painful. This blog post focuses on SQLite database files, but at least some of it applies to other binary file types.
My problem specifically involved managing changes to an SQLite database that contained results for a research study. The database was changing as new results arrived and were processed, and it was important to me to track its changes in case of manual or programming errors.
SQLite stores its database in a pretty complex format (described here). While diffing two SQLite databases can sometimes be human-readable, this entirely depends on the binary that happens to fall right around the modified values. It's doable but sometimes requires a lot of annoying horizontal scrolling past screenfuls of control characters. Life's too short for that.
SQLite can dump entire databases out as SQL statements, and Git can be
configured to do this when generating diffs. In a
.gitattributes
or .git/info/attributes
file, give
Git a filename pattern and the name of a diff driver, which we'll define
next. In my case, I added:
db.sqlite3 diff=sqlite3
Then in .git/config
or $HOME/.gitconfig
, define the
diff driver. Mine looks like:
[diff "sqlite3"] textconv = dumpsqlite3
I chose to define an external dumpsqlite3
script, since this can
be useful elsewhere. It just dumps SQL to stdout for the filename given by
its first argument:
#!/bin/sh sqlite3 $1 .dump
At this point, git diff
should show you plain-text diffs, as
should browsing Git commits. There's still one problem left: sometimes
SQLite's binary database will change, but the actual database contents remain
the same. This results in a git status
that says the database
has changed but a git diff
that says it hasn't.
I don't know enough about SQLite to know why this happens. I thought it was because SQLite doesn't compact free space right away in its database files, but I ran into a case where even if I vacuum two database files with identical contents, they still have different binaries.
One brute solution would be to dump the database contents to SQL and read them back into a "fresh" SQLite database. This should result in a canonical binary database, since SQLite doesn't seem to store anything like a timestamp in there. I suspect you could have your diff driver do this automatically every time it runs, but I haven't tried it yet.