Introduction to the basic principles of Git

Original: https://www.escapelife.site/posts/da89563c.html

Simply put, what kind of system is Git? Please note that the following content is very important. If you understand the idea and basic working principle of Git, you will know why and use it with ease. When learning Git, try to clarify what you already know about other version management systems, such as CVS, Subversion, or Perforce. This will help you avoid confusion when using the tool. Although Git is very similar to other version control systems in use, it has very different ways of storing and recognizing information, and understanding these differences will help avoid confusion in use.

picture

Git initializes the code repository

The execution completed the git init command, what did it do?

After executing the following command, we can get the content shown in the figure below. On the right is the code repository created by Git for us, which contains the content needed for version management.

# execute on the left
$ mkdir git-demo
$ cd git-demo && git init
$ rm -rf .git/hooks/*.sample

# execute on the right
$ watch -n 1 -d find .

copy

picture

Here we can take a look at the structure of the generated .git directory:

➜ tree .git
.git
├── HEAD
├── config
├── description
├── hooks
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

copy

.git/config - the local configuration file of the current code repository

  • Local configuration file (.git/config) and global configuration file (~/.gitconfig)
  • By executing the following command, the user configuration can be recorded in the configuration file of the local code repository
  • git config user.name "demo"
  • git config user.email "demo@demo.com"
➜ cat .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
    precomposeunicode = true

[user]
    name = demo
    email = demo@demo.com

copy

.git/objects - where the current repository code is stored

  • blob type
  • commit type
  • tree type
# no content
➜ ll .git/objects
total 0
drwxr-xr-x  2 escape  staff    64B Nov 23 20:39 info
drwxr-xr-x  2 escape  staff    64B Nov 23 20:39 pack

➜ ll .git/objects/info
➜ ll .git/objects/pack

copy

.git/info - Exclusions and other information for the current repository

➜ cat ./.git/info/exclude
# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~

copy

.git/hooks - the default hook script for the current repository

./.git/hooks/commit-msg.sample
./.git/hooks/pre-rebase.sample
./.git/hooks/pre-commit.sample
./.git/hooks/applypatch-msg.sample
./.git/hooks/fsmonitor-watchman.sample
./.git/hooks/pre-receive.sample
./.git/hooks/prepare-commit-msg.sample
./.git/hooks/post-update.sample
./.git/hooks/pre-merge-commit.sample
./.git/hooks/pre-applypatch.sample
./.git/hooks/pre-push.sample
./.git/hooks/update.sample

copy

.git/HEAD - the branch pointer of the current repository

➜ cat .git/HEAD
ref: refs/heads/master

copy

.git/refs - the head pointer of the current repository

# no content
➜ ll .git/refs
total 0
drwxr-xr-x  2 escape  staff    64B Nov 23 20:39 heads
drwxr-xr-x  2 escape  staff    64B Nov 23 20:39 tags

➜ ll .git/refs/heads
➜ ll .git/refs/tags

copy

.git/description - a description of the current repository

➜ cat .git/description
Unnamed repository; edit this file 'description' to name the repository.

copy

what happens after add

Executed the git add command, what did it do?

After executing the following command, we can get the content shown in the figure below. We found that a new file has been added on the right side, but the content in the Git directory has not changed at all. This is because the modifications we perform now are placed in the workspace by default, and the modifications in the workspace are not managed by the Git directory.

And when we execute the git status command, Git can recognize that a new file has been added to the workspace. How is this done? - See the [Understanding blob objects and SHA1] section for details

When we executed the git add command to let Git help us manage files, we found that a new directory and two files were added on the right side, namely the 8d directory, index and 0e41.. files.

# execute on the left
$ echo "hello git" > helle.txt
$ git status
$ git add hello.txt

# execute on the right
$ watch -n 1 -d find .

copy

picture

picture

Let's focus on the generated 8d directory and the following files. The origin of its name is because Git performs a Hash algorithm called SHA1, which is used to turn the file content or string into such a string of encrypted characters.

# View the file type of objects
$ git cat-file -t 8d0e41
blob

# View the file contents of objects
$ git cat-file -p 8d0e41
hello git

# View the file size of objects
$ git cat-file -s 8d0e41
10

# assembled
blob 10\0hello git

copy

Now we know, execute the git add command to add the file from the workspace to the staging area, Git will help us generate some Git objects, which store the content and file type of the file and do not store the file name.

To verify what we said above, we can add the same content to another file, then commit, and watch the .git directory change. We found that there are no new directories and files in the objects directory on the right. This proves that the object of blob type only stores the content of the file. If the content of the two files is the same, only one object needs to be stored.

Speaking of which, why doesn't the object store the file name? Here, when the SHA1 Hash algorithm calculates the hash, it does not include the file name, so it doesn't matter what the name is. The question is, where is the file name information stored? - See the [Understanding blob objects and SHA1] section for details

# execute on the left
$ echo "hello git" > tmp.txt
$ git add tmp.txt

# execute on the right
$ watch -n 1 -d find .

copy

picture

Understanding blob objects and SHA1

Learn about Git's blob objects and pre-SHA1 relationships and correspondence calculations!

Hash algorithm is to change the input of any length into the output of fixed length through the hash algorithm, and the generated length is also different according to the algorithm.

Hash algorithm:

  • MD5 - 128bit - Insecure - File Check
  • SHA1 - 160bit (40bit) - Insecure - Git Storage
  • SHA256 - 256bit-secure - Docker image
  • SHA512 - 512bit - Secure

However, when we use the tool to calculate the SHA1 of the above file content, we will find that it is not what we see in the .git directory. Why is this?

➜ echo "hello git" | shasum
d6a96ae3b442218a91512b9e1c57b9578b487a0b  -

copy

Here, because of the calculation method of the Git tool, it is calculated by using the content of type length \0. Here, we calculate that the file content is only nine digits, but here is ten digits, which is caused by the existence of line breaks in the content. Now we can use the git cat-file command to assemble the complete contents of the Git tool store.

➜ ls -lh hello.txt
-rw-r--r--  1 escape  staff    10B Nov 23 21:12 hello.txt

➜ echo "blob 10\0hello git" | shasum
8d0e41234f24b6da002d962a26c2495ea16a425f  -

# assembled
blob 10\0hello git

copy

picture

When we use the cat command to view the contents of the object object, we find that it looks like a string of garbled characters. In fact, this is the Git tool to compress the original content of the file, and then store it in the object object. Strangely, we found that the compressed content was larger than the original content!

This is because it is compressed and stores some compression-related information. The above example is larger than the original file because the content we created is so small. When we see a relatively large file, we will see that the compressed file size is much smaller than the original file.

➜ cat .git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425f
xKOR04`HWH,6A%

➜ ls -lh .git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425f
-r--r--r--  1 escape  staff    26B Nov 23 21:36 .git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425f

➜ file .git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425f
.git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425f: VAX COFF executable not stripped - version 16694

copy

In fact, we can also get the content of the binary object object through Python code.

import zlib

contents = open('0e41234f24b6da002d962a26c2495ea16a425f', 'rb').read()
zlib.decompress(contents)

copy

picture

Chat about workspaces and staging areas

Talk about workspaces and staging areas, and how files are synced between workspaces and buffers.

We also talked about in the previous chapter. When we execute the git status command, how does the Git tool know that we have a file that is not tracked, and where is the file name information stored?

The answer to all this starts from the workspace and the index area. Git divides the "space" of the corresponding state into three categories: work area, temporary storage area (also called index area) and version area according to the different states it stores. For specific examples, please refer to the following figure.

picture

For a deeper understanding, it is necessary to generate the relevant object object after executing the git add command, but it stores the class content, size and content of the file, and does not contain the information of the file name. The information related to the file name is included in the generated index file (index file).

When we look directly at the contents of the index file, we find garbled characters that we cannot understand, but through the basic output, we can see its file name. To view the contents of the index file, you can view it through the relevant commands provided by Git.

# execute on the left
$ echo "file1" > file1.txt
$ git add file1.txt
$ cat .git/index

$ git ls-files     # List the file list information of the current staging area
$ git ls-files -s  # List the details of the current staging area file

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

When adding a file, the file or directory will flow from the workspace to the staging area, and some other operations will result in a certain difference between the workspace and the staging area. This will lead to the difference between the two when we execute git status.

After the following operations, the contents of the workspace and the temporary storage area will be inconsistent, and we can also view the difference through commands. When we use the add command to add new files to the staging area, we will find that this is consistent.

# execute on the left
$ git status
$ echo "file2" > file2.txt
$ git ls-files -s
$ git status
$ git add file2.txt
$ git ls-files -s
$ git status

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

If we modify a file here, it is obvious that our workspace and temporary storage area are inconsistent again at this time. When we use the command to view the file status, we find that a file has been modified, and how does Git know? Ahem, it is obtained by looking up the content of the index file, finding the corresponding file name and the object object referenced internally, and comparing it with the file content in the workspace.

# execute on the left
$ git ls-files -s
$ echo "file.txt" > file1.txt
$ git status

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

At this time, if we use the git add command to save the modified content to the temporary storage area, we will find that the reference value of the blob object of the object of the corresponding file has changed. At this time, it can be found that there are three objects under the objects directory, of which file1.txt occupies two, but there are only two files. View the content of the corresponding blob object through the command and find that they are different.

# execute on the left
$ git ls-files -s
$ git add file1.txt
$ git ls-files -s

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

Understand the principle of commit submission

After executing the git commit command, what did it do?

The commit record in the Git repository holds a snapshot of all the files in your directory, just like copying the entire directory and then pasting it, but much more elegant than copy and paste! Git wants the commit record to be as lightweight as possible, so it doesn't blindly copy the entire directory every time you make a commit. When conditions permit, it compares the current version with the previous version in the repository, and packs all the differences together as a commit. Git also keeps a history of commits. This is why most commits have a parent node above them.

When we use the add command to submit the workspace to the staging area, the staging area actually saves a state of the current file, including which directories and files are, as well as their corresponding size and content. But we finally need to submit it to the code repository (local), and the command is git commit.

picture

And when we execute the git commit command, what exactly happens? You can see that after submitting, two information object objects are generated in the .git directory, and new files are generated in the logs and refs directories. Through the following operations, we can view the type and corresponding content of its submission.

# execute on the left
$ git commit -m "1st commit"

$ git cat-file -t 6e4a700  # View the type of commit object
$ git cat-file -p 6e4a700  # View the contents of the commit object

$ git cat-file -t 64d6ef5  # Check the type of tree object
$ git cat-file -p 64d6ef5  # View the contents of the tree object

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

So we understand that when we execute the git commit command, a commit object and a tree object will be generated. The content of the commit object contains a tree object and related commit information, and the tree object contains the file status (file name and blob object) in the version we submitted this time, so that we know the changes in this commit.

picture

Since our commit this time around, there have been a few other changes in addition to the handling of the objects directory. For example, the directories of logs and refs have changed. We check the contents of the refs directory and find that it points to the commit object 6e4a70, that is, the latest commit on the current master branch is this 6e4a70.

And the 6e4a70 commit object has a HEAD point, which is the HEAD file in the .git directory. Its essence is a pointer, which always points to the branch we are currently working on, that is, here we are working on the master branch. When we switch branches, the point of this file will also change randomly.

# execute on the left
$ cat .git/refs/heads/master
$ cat .git/HEAD

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

Deepen understanding of commit commit

After executing the git commit command, what did it do?

When we changed, added and submitted the content of file2.txt again, we found that when submitting, when viewing the content of the commit object, it contained the commit information of the parent node. For understanding, you can look at the submission flow chart below.

# execute on the left
$ echo "file2.txt" > file2.txt
$ git status
$ git add file2.txt
$ git ls-files -s
$ git cat-file -p 0ac9638
$ git commit -m "2nd commit"
$ git cat-file -p bab53ff
$ git cat-file -p 2f07720

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

picture

Empty folders are not tracked in Git, and adding a folder does not add an object. When we look at the content of the index, we will find that the file name contains a relative path.

When we submit through the commit command, we will find that three object objects are generated, because the commit operation does not generate blob objects, so there are one commit object and two tree objects respectively. It can be found that the tree object contains a tree containing a directory, which contains the content of the object file.

The file status shown in the figure below shows the concept of version in Git. That is, the commit object points to the root (tree) of a file directory tree in this version, and then the tree points to the blob object (file) and the tree object (directory), so that a complete version can be formed indefinitely.

# execute on the left
$ mkdir floder1
$ echo "file3" > floder1/file3.txt
$ git add floder1
$ git ls-files -s
$ git commit -m "3rd commit"
$ git cat-file -p 1711e01
$ git cat-file -p 9ab67f8

# execute on the right
$ watch -n 1 -d tree .git

copy

picture

Lifecycle state of the file

To sum up, the file status in Git and how to switch.

Now, we have a basic understanding of how files are tracked and synchronized between the workspace, the staging area, and the code repository. In the operation of Git, what are the possible states of the file, and how to switch the state, let's summarize here!

picture

picture

The meaning of Branch and HEAD

After executing the git branch command, what did it do?

What exactly is a branch? What about branch switching? By looking at Git's official documentation, we can see that a branch is a named (master/dev) pointer to a commit object.

When we initialize the warehouse, the provider will assign us a branch called master by default (the default warehouse has been changed to main in the latest version), and the master branch points to the latest commit. Why do you need to name branches? It is for the convenience of our use and memory, which can be simply understood as the same meaning of the alias command.

picture

With the above foundation in place, we need to consider how branching is implemented and works. To implement a branch, we basically need to solve two problems. The first is to store the commit pointed to by each branch. The second problem is to help us identify the current branch when switching branches.

In Git, it has a very special HEAD file. The HEAD file is a pointer, and one of its characteristics is that it always points to the latest commit object of the current branch. And this HEAD file is just right, solving the two problems we raised above.

When we switch branches from master to dev, the HEAD file will also switch immediately, that is, the pointer to dev. The design is so beautiful, worthy of being a genius, a good brain.

picture

# execute on the left
$ cat .git/HEAD
$ cat .git/refs/heads/master
$ git cat-file -t 1711e01

# execute on the right
$ glo = git log

copy

picture

The logic behind the branch operation

After executing the git branch command, what did it do?

Here we can see that after the branch switch, the HEAD point has changed.

# execute on the left
$ git branch
$ git branch dev
$ ll .git/refs/heads
$ cat .git/refs/heads/master
$ cat .git/refs/heads/dev
$ cat .git/HEAD
$ git checkout dev
$ cat .git/HEAD

# execute on the right
$ glo = git log

copy

picture

It should be noted here that even if we delete the branch, some unique objects on the branch will not be deleted. These objects are actually what we commonly call garbage objects, and there are also garbage objects generated by using the add command many times. How to remove and recycle these garbage objects? We'll get to that later.

# execute on the left
$ echo "dev" > dev.txt
$ git add dev.txt
$ git commit -m "1st commit from dev branch"
$ git checkout master
$ git branch -d dev
$ git branch -D dev
$ git cat-file -t 861832c
$ git cat-file -p 861832c
$ git cat-file -p 680f6e9
$ git cat-file -p 38f8e88

# execute on the right
$ glo = git log

copy

picture

checkout and commit operations

Let's talk about checkout and commit operations!

When we execute the checkout command, it can not only switch branches, but also switch to the specified commit, that is, the HEAD file will point to a commit object. In Git, the phenomenon that the HEAD file does not point to master is called detached HEAD.

Regardless of whether the HEAD file points to a branch name or a commit object, the essence is the same, because the branch name also points to a commit object.

picture

# execute on the left
$ git checkout 6e4a700
$ git log

# execute on the right
$ glo = git log

copy

picture

When we switch to the specified commit, if we need to continue to modify the code submission on the corresponding commit, we can use the swtich command mentioned in the above picture to create a new branch and then submit it. However, usually we don't play around and use the checkout command to create new branches.

$ git checkout -b tmp
$ git log

copy

Even if it was possible to do so, we rarely use it. Remember the dev branch we created in the previous chapter? We created the branch and had a new commit, but deleted it without merging into the master branch. If you use the log command to view it now, you can't see it.

Actually, can't you really see it? You have to remember that any operation in Git, such as branch deletion. It just deletes the pointer reference to a specific commit, and the commit itself will not be deleted, that is, the commit of the dev branch is still there.

So how do we find this commit? After finding it, we can continue to work on it, or find the previous file data, etc.

the first method:

  • [Struggling is not good, the last resort]
  • Under the objects directory, look at them one by one, and then switch over.

The second method:

  • [Recommended operation method]
  • Use the git reflog dedicated command provided by Git to find out.
  • The purpose of this command is to record all our previous operations.
# execute on the left
$ git reflog
$ git checkout 9fb7a14
$ git checkout -b dev

# execute on the right
$ glo = git log

copy

picture

picture

Talk about the execution logic of diff

When we execute the diff command, how does the Git logic compare them?

In this section, we use the repository from the previous section. After modifying the contents of the file, let's see what the diff command outputs? Let's take a look here, study and study!

$ echo "hello" > file1.txt
$ git diff
$ git cat-file -p 42d9955
$ git cat-file -p ce01362

# The following command works the same way
$ git diff --cached
$ git diff HEAD

copy

picture

How to add a remote repository in Git

How to associate our local warehouse with the warehouse on the remote server?

Initialize warehouse

$ git init
$ git add README.md
$ git commit -m "first commit"

copy

Associate a remote repository

When we use the above command to associate the remote server repository, our local .git directory will also change. If you view the .git/config file through the command, you can see that the [remote] field appears in the configuration file.

# Associate a remote repository
$ git remote add origin git@github.com:escapelife/git-demo.git

➜ cat .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
    precomposeunicode = true

[remote "origin"]
    url = git@github.com:escapelife/git-demo.git
    fetch = +refs/heads/*:refs/remotes/origin/*

copy

push local branch

When we execute the following command, push the local master branch to the master branch of the remote origin repository. After that, when we log in to GitHub, we can see the contents of the pushed files and directories.

When pushing the content of the branch, it will enumerate the number of objects pushed, compress its content, and then push it to our remote GitHub repository, and create a remote master branch (origin repository).

# push local branch
$ git push -u origin master

copy

After pushing, we can find that the local .git generates some files and directories. What are they? As shown below, four new directories and two files will be added, all of which are the information of the remote warehouse. When we view the contents of the master file through the command, we will find that it is also a commit object. This is consistent with what our local master branch points to. And it is used to represent the current version of the remote warehouse, which is used to distinguish and proofread from the local one.

➜ tree .git
├── logs
│   ├── HEAD
│   └── refs
│       ├── heads
│       │   ├── dev
│       │   ├── master
│       │   └── tmp
│       └── remotes     # Add directory
│           └── origin  # Add directory
│               └── master  # new file
└── refs
    ├── heads
    │   ├── dev
    │   ├── master
    │   └── tmp
    ├── remotes     # Add directory
    │   └── origin  # Add directory
    │       └── master  # new file
    └── tags

copy

Remote warehouse storage code

Use GitLab to understand how the server of the remote repository stores our code!

After we write the code, submit it to the corresponding remote server, and its storage structure is exactly the same as our address. If we think about it carefully, it will be surprising if it is different.

Git is originally a code distribution platform without a central node, that is, each node is the master node, so the directory structure of its storage is always the same. In this way, no matter which node's content is missing or missing, we can find it through other nodes. The Git server is a node that can help us and can be found in real time, that's all.

Posted by imderek on Fri, 03 Jun 2022 06:31:23 +0530