Migration

At Algeo we used Mercurial for version control at Bitbucket. We made the choice of using Hg seven years ago, when the popularity contest between Git and Hg wasn’t over yet. Mercurial has the advantage that it was written in Python so it is very easy to extend. At the same time it also makes it slower. Since our repo was relatively small, speed was never a concern. Back then we simply decided based on available hosting services (Github was not big and did not have free private repos). Much has changed since, Git won the DVCS war and Mercurial is losing popularity. No wonder that Bitbucket decided to stop supporting Mercurial repositories from 2020 February. Thus, we had to migrate to Git, eventually.

In this post I summarize how the migration was done using Hg-Git. Hg-Git is a nice Hg extension that lets you push and pull from remote Git repositories. Obviously, for cloning we just needed pushing. The migration was also the perfect time to clean up the repo and remove some large files that were accidentally committed to the repository. By then end of our migration, a 600M repository was reduced to a 140M one!

Setting up Hg-Git

First and foremost we have to set up Hg-Git. It is very easy, just clone the repository somewhere and add the following lines to your mercurial.ini:

[extensions]
hggit = path/to/hg-git/hggit

[git]
branch_bookmark_suffix=_migrate

You just have to specify the path to Hg-Git. We will get back to branch_bookmark_suffix later. Since Mercurial only stores the author name and Git needs the email as well, we have to set up a mapping between the two. We use an author.txt file:

hg_user1 = git_user1 <user1@email.com>
hg_user1 = git_user1 <user2@email.com>

This tells Hg-Git to map hg_user1 to git_user1 and hg_user2 to git_user2. Note the format of the email after the username, Git requires exactly this structure! Now we have to just point Hg-Git to this file. Insert the following lines in mercurial.ini:

[extensions] 
authors = absolute/path/to/authors.txt

This concludes our setup. Clone your mercurial repository somewhere, and let’s start migrating!

Setting up branches

Branches in Hg and Git are slightly different. In Git, they are just simple references to commits. In Hg, a commit belongs to a branch. That is, in Hg multiple commits belong to a single branch. Mercurial’s bookmarks are more similar to Git’s branch: they are (mutable) references to commits. To circumvent the issue with branches, Hg-Git translates Git branches to Hg bookmarks and back. There are two caveat with this: bookmarks can not have the same name as a branch, and what if you want to use bookmarks you don’t want to push Git? Here comes the trick: mark bookmarks with a special suffix, and Hg-Git will only convert these bookmarks. This is what branch_bookmark_suffix setting above does: only bookmarks ending with _migrate are converted.

So our next step is to create a bookmark for each branch, with the _migrate suffix appended to their name. I’m using Windows so I don’t have xargs available. Rather, I wrote a short Python script that does exactly the same:

import subprocess

# get all branches
output = subprocess.check_output(['hg', 'branches', '-T', '{branch}\\n'])
branches = output.splitlines()

for branch in  branches:
    subprocess.check_call(['hg', 'bookmark', '-r', branch, branch + '_migrate'])

Migrating the Repo

Now we just have to push all the contents of the Hg repository to a brand new Git repo:

> git init ..\git-repo
> hg push ..\git-repo
> cd ..\git-repo
> git checkout -b master default

The last command was needed to move the master branch in Git to the newly migrated default branch (Hg uses default for the main branch name, while Git prefers master). And that’s it! We have successfully converted the Mercurial repository to Git, with all branches and tags kept.

If you just wanted to copy the repository, you can stop reading here. For us, the old repository had some garbage in it, so we decided to clean it up while we are at it. The next section tells you how to remove large files and fix special characters.

Cleaning the Repo

Renaming special characters

We had some files that contained special characters, like é. These worked correctly in Mercurial but came out incorrectly encoded after using Hg-Git. Instead of looking into where the conversion went wrong, we simply decided to fix the problem for all, and convert all special characters to ASCII. It’s much more stable solution, since there could be many other tools used in the future that can’t handle non-ASCII letters.

Of course, the simple solution would be to rename the files and create a new commit. However, this does not fix the problem for old commits. If you were to checkout an old commit from history, the encoding error would reappear and Git won’t detect the files with special characters!

The proper solution is to rename the file in the whole history. Git’s filter-branch command helps changing the history. To make it fast, we have to use an index-filter: this only modifies the index in each commit. We just have to write a command that renames the file only in the index. With the help of Stackoverflow we came up with the following script (it works fine under Windows in Git bash):

$ git filter-branch -f --index-filter rename.sh --tag-name-filter cat -- --all

and rename.sh is the following script:

#!/bin/bash

git ls-files -s | \
sed 's-\(\t\"*\)sp\\351cial.txt-special.txt-'  | \
GIT_INDEX_FILE=$GIT_INDEX_FILE.new git update-index --index-info && \
mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"

This renames files called spécial.txt to special.txt. How does this script work? A detailed explanation can be found here. In short, it uses git ls-files to list the filenames, renames the files using sed then saves the changes back to the index.

One interesting thing to notice is how Git represents special characters. You can see that in the sed replace command, spécial is written as sp\\351cial. In fact, ls-files uses a quoted form to represent non-ASCII characters: each non-english letter is replaced by their byte representation, in this case \351.

Removing large files

In the last couple of years, some large files were accidentally committed and pushed in our repository, bringing its total size to 650M. Obviously this is not a perfect state and we decided to remove all those big files.

Finding large files

First, we have to identify the files we want to remove. The command below will list the top 10 offending blobs:

git rev-list --objects --all \
  | grep "$(git verify-pack -v .git/objects/pack/*.idx \
           | sort -k 3 -n \
           | tail -10 \
           | awk '{print$1}')"

It finds blobs, that is if a file was modified it is listed multiple times. It is good enough for our purposes though, you can increase the limit of 10 if you expect there are other large files.

Removing the files

To remove a file from history, use the following command:

git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch file_to_remove" --tag-name-filter cat -- --all

This index filter goes through the history, and runs git rm for each commit. -f turns on forced mode: without this Git does not start the filtering, warning you that there is already a backup branch resulting from the previous filter operation. We can safely ignore this warning. In the git rm command --cached makes sure that the deletion only happens in the index – an index-filter can only modify the index, not the actual files.

Cleaning up

After all these operations, there are lots of garbage left around by Git:

  • filter-branch creates a backup every time
  • The default branch from Hg is still around
  • Git does not actually delete the old commits when using filter-branch, it just creates copies. The original copies (and the large files in them!) are still in the repo

Let’s fix the issues above and clean the repository!

First remove the backup branches. The backups are under refs/original, so you just have to delete all branches in that folder:

git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

Now remove the default branch:

git branch -D default

And finally, remove the commits laying around:

git reflog expire --expire-unreachable=all --all
git gc --aggressive --prune=now

This is a bit trickier: you can clean up only those commits that are not referenced. While we just deleted the backup references above, the reflog still points to the old commits! So the first line just deletes the whole reflog and then we call git gc with --prune option to remove unneeded commits.

Note that Git usually does these housekeeping operations automatically, removing unreferenced commits after 90 days. It also runs gc to compact the database every now and then. Normally, you don’t need to call git gc, especially not with the aggressive option that tries compact the database very hard. When using the common Git commands (such as add, rm, push), Git keeps the local repo clean and fast. However, we just did a bunch of heavy history modifications. In these cases, it is a good idea to fix the database with a gc.

Comments are closed.