Git Notes

The most popular site for hosting open-source projects appears to be github.com. (Perhaps ironically, the software at github.com is not open-source.)

Github derives from git, a version-control system developed in 2005 by Linus Torvalds, who later claimed

I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'.

The Monty Python Mr and Mrs Git video appears to have been scrubbed from the Internet by the copyright holders, but here's a text version: montypython.net/scripts/mrgit.php. The idea of the sketch was that Mr and Mrs Git appear resigned to the fact that their last name is Git, but are oblivious to the social implications of their first names: "A Snivelling Little Rat-Faced" and "Dreary Fat Boring Old".

Typically github users don't necessarily make routine use of the ability of git to maintain multiple independent "branches" simultaneously. Though that can in fact be quite useful.

Here's a rather spare intro to git, starting with cloning: maryrosecook.com/blog/post/git-in-six-hundred-words. Another overview, with more focus on branching, is eagain.net/articles/git-for-computer-scientists.

A popular alternative to git is subversion, an Apache project started in 2000. An earlier alternative is the Concurrent Versions System, or CVS.

Git was perhaps the first source-code manager to adopt the file as the basic unit, rather than maintaining the original plus a series of "patch" files. That said, git can merge files like the following, which necessitates at least some awareness of line structure:

here is the file
second line
new line 2.5 here
line three
line four
that's all, folks!
here is the file
second line
line three
line four
new line 4.5 here
that's all, folks!

Git has in fact, embedded within it, at least four file-diff algorithms; that is, algorithms to identify differences in line-oriented text files. The goal of a file-diff algorithm is to identify, in effect, the minimal changes needed to get from one file to the next, and, then, whether those minimal changes conflict with another, different, set of changes. The default algorithm is known as Myers, but a newer algorithm known as Histogram may perform better. Alternative file-diff algorithms can be specified either with individual flags (eg --histogram), or with --diff-algirthm=histogram, etc.

For a discussion of git's file-diff algorithms, see link.springer.com/article/10.1007/s10664-019-09772-z.

The git merge-file command (not normally executed directly by the user) is where text-file merging takes place. Note that merging involves a three-way diff: of a base file and two change files.

While the file-diff algorithms are not overtly language-aware, there are differences in how well they deal with common programming-language constructs, eg a closing '}' on a line by itself. There is also some research on file-diff algorithms that are language-specific, though those introduce their own issues. So far, they have not been widely adopted.

The file-based orientation of git means that it can be much faster than its predecessors. Except for merging, no operations on file contents are necessary. Earlier systems had to apply each patch in turn to get the current end-result. CVS, in particular, was very slow at merging in large projects.

Git was also the first fully-decentralized repository. If Bob clones Alice's repository, then Alice can decide later that Bob's repo is the master, and merge it into her own. For Torvalds, this went a long way to smooth over unpleasant political debates about whose was the master repo. It also meant that arbitrarily complex work could be done offline. (An earlier model for versioning was that a programmer would "check out" part of the code to work on; the programmer would make changes and then "check back in". During the interval, that part of the code might be held as read-only. Git fixed this.)

In 2002, the Linux team adopted the commercial product BitKeeper, by BitMover, for source-control management. They got a deal to use it for free, but it was not free software in general. Andrew Tridgell attempted to reverse-engineer the network BitKeeper protocol, following his earlier efforts to reverse-engineer Microsoft SMB which led to Samba. This led to BitMover's withdrawal of the free-use offer in April 2005, claiming license violations, though Tridgell had never actually used BitKeeper and so never agreed to the "click-wrap" license. This, in turn, led to the (rapid) development of Git as a replacement, with Torvalds writing at least the first draft.

Having your vendor withdraw your license due to some dispute about usage is, to some, exactly why we should all use open source in the first place.

Here's one quote from Torvalds about git versus CVS (from here):

I credit CVS in a very very negative way. Because I, in many ways, when I designed git, it's "what would Jesus do" except that it's "what would CVS never ever do"-kind of approach to source control management. I've never actually used CVS for the kernel. For the first 10 years of kernel maintenance, we literally used tarballs and patches, which is a much superior source control management system than CVS is, but I did end up using CVS for 7 years at a commercial company, and I hate it with a passion.

Here's another quote attributed to Torvalds on slide 9 of slideshare.net/odimulescu/git-presentatio (and other places):

Take CVS as an example of what not to do; if in doubt, make the exact opposite decision.
Aka WWCVSND (What would CVS never do)

But the slides claim this was said in Torvalds' Google talk on git, at youtube.com/watch?v=4XpnKHJAok8, around the 2:30 mark. He doesn't quite say this. He probably said it at another talk; it does seem that he did say it somewhere.

In 2016, BitMover made BitKeeper open-source (in this case, this was a recognition that it wasn't selling).

References:


Git demo

Command-line git needs to be preconfigured with your identity:

git config --global user.name "Peter Dordal"
git config --global user.email pld@cs.luc.edu

Now suppose this is done (you can check with git config --list). Let's start a project (pld's example is in ~/412/projects, and project1/project2)

mkdir project1
cd project1
git init
    # this sets up git for this folder and subfolders

This directory project1 is the home directory for this git project. It contains a subdirectory ".git" (normally hidden) that contains the history data. Most git commands need to be run in this project directory.

Now let's edit two files in the project1 directory

vi one.py two.py
git add *.py

Here's what I put into the files:

one.py:
    import sys
    sys.path.append('.')
    import two
    two.foo()

two.py:
    def foo():
        print('here I am in two.foo()')

git status

git commit -m master

In the last command, -m specifies a message identifying the commit. Don't forget these.

We can modify a file, and repeat the git add and git commit, as necessary.

Branching

Now let's create a new branch:

git branch changes1
git checkout changes1    # or git checkout -b changes1

Branches allow separate "versions" of the project to coexist. You can be working on a complicated new feature, in branch feature37, and be called back to fix a major bug in branch master. But the branch feature37 work is still there, suspended. You may or may not use branches in your project, because you and your teammates will presumably working off separate clones (below).

To get a list of what branches exist, use git branch.

Now we'll edit the files by adding a print() statement in one.py, print('about to call two.foo()'), and in two.py, print('at end of two.foo()'). We have to add the modified files to the repository (or use git commit with the -a option).

git add *.py
git commit -m changes1

or: git commit -a -m changes1

Now for the fun part:

git checkout master

And all our changes have disappeared! Even if we access from another window!

But we can revert with

git checkout changes1

Now we combine them:

git checkout master
git merge changes1

At this point we could delete the changes1 branch with git branch -d changes1.

Merging issues

Auto-merging works here because we simply added to the changed files, and didn't commit corresponding changes to the master branch. Had we done that, we'd be forced to merge manually. For example, suppose we do the above, but add and commit the following to changes1, with a new line 4:

one:
    import sys
    sys.path.append('.')
    import two
    print('getting ready to call two.foo()')
    two.foo()

And this to master:

one:
    import sys
    sys.path.append('.')
    import two
    print('about to call two.foo()')
    two.foo()

Now when we try git merge changes1 (from within master), we get

Auto-merging one.py
CONFLICT (content): Merge conflict in one.py
Automatic merge failed; fix conflicts and then commit the result.

If we try git diff, we get this (note that git is keeping track of the fact that we're attempting a merge):

diff --cc one.py
index 8be88c0,040e69a..0000000
--- a/one.py
+++ b/one.py
@@@ -1,5 -1,5 +1,9 @@@
  import sys
  sys.path.append('.')
  import two
++<<<<<<< HEAD
 +print('geting ready to call two.foo()')
++=======
+ print('about to call two.foo()')
++>>>>>>> changes1
  two.foo()

This is a form of the diff command; it attempts a line-by-line file comparison. It is successful, in that it does illustrate what lines are different. One way to fix this is to modify changes1/one.py until it's consistent with master/one.py; you may need git merge --abort to be able to go back to the changes1 branch.

Cloning

Now let's try cloning (sometimes called forking, though forking is really a social construct meaning that you are starting a new project; cloning typically means that you intend to merge your changes back into the original, but need to work on a copy for the time being):

cd ..
mkdir project2
cd project2
git clone ../project1

Now we have an entirely separate directory (the actual files are in project2/project1). The project2 team can, if allowed, push their changes back into project1, or ask the project1 team to pull the changes from project2 into project1.

Let's have project2 make some changes. We'll add this line to the end of one.py:

print('done with two.foo()')

Now, let's go back to project1, and enter the following, where p2 is being used as the name, within project 1, for the project2 clone.

git remote add p2 ../project2/project1

And then for the pull, run from project1 to merge the project2/p2 changes into project1:

git pull p2 master

Now our project2 changes are merged into project1!

If we wanted project2 to make a pull request to project 1, we might use git request-pull, which will generate an email message. But a local pull request doesn't really make sense. Github has its own mechanism for making pull requests.

If we wanted the clone to be at the top level of project2, we could have done this, from within project1:

git clone . ../project2

Here's an example of a remote clone:

git clone https://github.com/OrgName/ProjectName.git

This clones the remote repository. Github is set up to make this easy.