Create a git object module

Good to know before start

What are objects?

What is a Git object? At its core, Git is a “content-addressed filesystem”. That means that unlike regular filesystems, where the name of a file is arbitrary and unrelated to that file’s contents, the names of files as stored by Git are mathematically derived from their contents. This has a very important implication: if a single byte of, say, a text file, changes, its internal name will change, too. To put it simply: you don’t modify a file, you create a new file in a different location. Objects are just that: files in the git repository, whose paths are determined by their contents. source

Git uses objects to store quite a lot of things: first and foremost, the actual files it keeps in version control — source code, for example. Commit are objects, too, as well as tags. With a few notable exceptions (which we’ll see later!), almost everything, in Git, is stored as an object. source

The path is computed by calculating the SHA-1 hash of its contents. More precisely, Git renders the hash as a lowercase hexadecimal string, and splits it in two parts: the first two characters, and the rest. It uses the first part as a directory name, the rest as the file name (this is because most filesystems hate having too many files in a single directory and would slow down to a crawl. Git’s method creates 256 possible intermediate directories, hence dividing the average number of files per directory by 256)

Object format

Before we start implementing the object storage system, we must understand their exact storage format. An object starts with a header that specifies its type: blob, commit, tag or tree (more on that in a second). This header is followed by an ASCII space (0x20), then the size of the object in bytes as an ASCII number, then null (0x00) (the null byte), then the contents of the object. The first 48 bytes of a commit object in Wyag’s repo look like this:

00000000  63 6f 6d 6d 69 74 20 31  30 38 36 00 74 72 65 65  |commit 1086.tree|
00000010  20 32 39 66 66 31 36 63  39 63 31 34 65 32 36 35  | 29ff16c9c14e265|
00000020  32 62 32 32 66 38 62 37  38 62 62 30 38 61 35 61  |2b22f8b78bb08a5a|

The objects (headers and contents) are stored compressed with zlib.

Implementation

Add a simple auxiliary method to DirectoryManager

// src/directory_manager.rs

    pub fn sha_to_file_path(&self, sha: &str) -> PathBuf {
        self.objects_path.join(&sha[0..2]).join(&sha[2..])
    }

This function is supposed to return the absolute path to an object file for a given hash.

Let's add a new unit test for this function:

// src/directory_manager.rs

   #[test]
    fn sha_to_file_path_should_return_correct_path() {
        let dir_manager = DirectoryManager::new(PROJECT_DIR);

        let file_path = dir_manager.sha_to_file_path("e673d1b7eaa0aa01b5bc2442d570a765bdaae751");
        assert_eq!(
            file_path,
            PathBuf::from(format!(
                "{}/.git/objects/e6/73d1b7eaa0aa01b5bc2442d570a765bdaae751",
                PROJECT_DIR
            ))
        );
    }

Create a new module, git_object

Add a new file to src folder named git_object.rs and don't forget to add this new module to lib.rs as well.

// src/lib.rs

pub mod git_object;

Let's add some new data types to this module. Add a new enum named GitObjectType:

#[derive(PartialEq, PartialOrd, Debug, Clone, Copy)]
pub enum GitObjectType {
    Commit,
    Tree,
    Tag,
    Blob,
}

impl FromStr for GitObjectType {
    type Err = ObjectParseError;

    fn from_str(s: &str) -> Result<Self, Self::Err> {
        match s {
            "commit" => Ok(GitObjectType::Commit),
            "tree" => Ok(GitObjectType::Tree),
            "tag" => Ok(GitObjectType::Tag),
            "blob" => Ok(GitObjectType::Blob),
            _ => Err(ObjectParseError::InvalidObjectType),
        }
    }
}

As you can see we implemented FromStr for this type too. Why? Because we'll get the type as a user input. We should check if it's a valid type.

We also need a struct named GitObjectHeader:

#[derive(Debug)]
struct GitObjectHeader {
    object_type: GitObjectType,
    object_size: usize,
}

Implement read function

We are now going to add a read function to the module to read object files with this signature:

pub fn read(repo: &GitRepository, sha: String) -> Result<Box<dyn GitObject>, ObjectParseError> {

The return type is important. First, we're going to return a trait object. Why? Because this function may read and return any of the object types including Commit, Blob, etc.

GitObject is the general look of an object, defined as a trait.

pub trait GitObject {
    fn get_type() -> GitObjectType
    where
        Self: Sized;

    fn serialize(&self) -> String;
}

All of the objects are supposed to have a type and a serialize function.

The simplest object type is Blob. Because they have no actual format. Blobs are user data: the content of every file you put in Git (main.c, logo.png, README.md) is stored as a blob. That makes them easy to manipulate because they have no actual syntax or constraints beyond the basic object storage mechanism: they’re just unspecified data. [source]

Let's add them in the next step!

Last updated