fim init
File Integrity Manager 1.2.3
* New Icon proposed by Raiden
Etienne Vrignaud
With Fim you can:
Check the integrity of files that can be large and binary,
therefore not comparable (.doc, .jpg, .png, .mp3, .mp4, …)
Know quickly the list files that would have been
renamed, moved, deleted or corrupted.
This can happen accidentally
Keep a log of changes done on these files and display later a detailed log of these changes
Search duplicate files and possibly to delete them then
More details in the use cases that follow…
A tool like Git is able to manage large binary files
But with Git the local workspace occupies at least double the space because there is all the content of the repository in the .git
And with Git you cannot permanently delete a file.
It will always exist in the Git history
Moreover all the contents of the modified files are retained.
which takes a huge space with the time
Some tools only work on specific file systems (Example: btrfs-scrub)
The goal was to have a tool that can quickly show the status of the local workspace
Fim creates the .fim
directory in the workspace root. It is its repository.
Inside it stores an index of files contained in the workspace.
The index contains for each file:
— Name and size
— Attributes (dates, permissions)
— 3 small blocks hash
— 3 medium blocks hash
— Complete hash of the file
The index is called the State
and the hash are done using SHA-512
A new State
is created at each commit
The data of the files are not retained. One can thus put Fim in the category of the UVCS (UnVersioned System Control)
1/ Create the repository Fim
fim init
2/ Work on the files, modify
...
3/ Know the state of the workspace
fim st # 'status' command
4/ Commit to save the current state
fim ci -m "My commit comment"
5/ Display the history
fim log
To get a faster status, Fim is able to hash only 3 blocks of the file
at the beginning, middle and end.
You can use the options:
-f
: Fast mode. Hash 3 blocks of 1 MB
-s
: Super-fast mode. Hash 3 blocks of 4 KB
fim st -f
fim st -s
With the -n
option (no hash) Fim just check the file names and the attributes
fim st -n
mkdir test1
$ cd test1
$ for i in 01 02 03 04 05 06 07 08 09 10 ;
do echo "New File $i" > file${i} ; done
$ fim init -m "First State"
# Modifications
$ fim st
$ fim ci -m "My modifications"
$ fim log
Case 1 - Workspace management
Case 2 - Detect and remove duplicates inside a workspace
Case 3 - Duplicate files in another workspace
Case 4 - Backup integrity
Manage directories filled with binaries.
For example: photos, music or movies
Know the state of a workspace in which we work episodically
Follow the developments over time
fim st
Changes made
fim ci -m "My commit comment"
Fim does not backup the file contents.
It is recommended to use a backup software
The super-fast mode during a commit allows you to check the current status in super-fast mode
so that commit is faster.
The modified files are then hashed again in full mode
fim ci -s -y -m "Commit very quickly using super-fast commit"
Some commands are faster when they are executed from a sub-directory, because they have less files to address:
st
(status
): Know the status of the sub-directory
ci
(commit
): Commit changes
fdup
(find-duplicates
): Find duplicate files
rdup
(remove-duplicates
): Delete duplicate files
rfa
(reset-file-attrs
): Reset the files attributes
All other commands are working, but not quicker
You can add a .fimignore
file at all the repository levels,
and also globally in the user home directory
Each line of the file contains the name of a file or directory to ignore.
You can also use the following expressions:
An asterisk to ignore several (Example: *.mp3
)
**/
at the beginning of the line. The remaining will be ignored in all sub directories.
For example, to ignore all the mp3 even those who are in sub directories:
**/*.mp3
Limit the display with -i
to ignore the changes on the:
attrs
: File attributes
dates
: Modification and creation dates
renamed
: Renamed files
fim st -i attrs,dates,renamed
To ignore all kinds, indicate all
fim st -i all
It is important to ensure that the files keep their permissions.
Fim backup and is able to restore.
It stores for each file the following information:
You can restore the permissions using the rfa
(reset-file-attrs
) command:
fim rfa
Fim is able to display the duplicates contained in a workspace using the command fdup
(find-duplicates
):
fim fdup
If the current state is already committed, you can skip the workspace scan phase by using the -l
option:
fim fdup -l
It is possible to remove duplicate files.
• Either in interactive:
fim rdup
• Or in automatic by keeping the first file in the duplicate set:
fim rdup -y
In both cases, it is possible to use the current state as with fdup
by adding the -l
option:
fim rdup -l
Fim can delete duplicate files contained in another workspace.
For example an old backup out of sync where you want to keep only the files that you don’t have.
It removes locally all the files already contained in the master workspace.
For example, with backup
which is a copy of the repository named source
:
cd backup
$ fim rdup -M ../source
When the workspace to clean is remote, you can just copy the .fim
in an empty directory
and specify it as parameter of the -m
option of the rdup
command
Fim allows you to check the integrity of files stored on any kind of file system. In particular the offline backups.
To do so, add to the backup the corresponding Fim (.fim
) repository
Then it will be possible to check with Fim the integrity of the data in the backup
For example, in the case of a DVD that contains a backup and the Fim repository, you can go to the root and get the status:
fim st
Fim diagnose an hardware corruption if the content of the file has changed whereas the creation / modification dates have not been modified
It uses the dcor
command (detect-corruption
):
fim dcor
Fim can produce false positives if the content has changed and the dates have been reset. For example with the
rfa
command
dign
(display-ignored
):
Displays the files or directories ignored in the last State
rbk
(rollback
):
Removes the last committed State
pst
(purge-states
):
Purge old states. Keeps only the last
You can download a Fim distribution
Last release
or build a version from the master (Details)
Fim is written in Java. It requires Java 8
Works on Linux, Windows and Mac OS X
Tested on those three platforms
For Linux, if you do not have Java or not the correct version,
a docker image is available on Docker Hub
• Retrieve the fim-docker
script
curl https://raw.githubusercontent.com/evrignaud/fim/master/fim-docker -L -o fim-docker && chmod a+rx fim-docker
• Launch Fim
The script retrieves the Fim Docker image, then starts.
It takes the same arguments as the fim
command
./fim-docker -h
• Update the image used
docker pull evrignaud/fim
Ideas to try it yourself:
.fim
contentThe .fim
directory contains:
A file with the repository settings: settings.json
A file for each state. It’s a JSON compressed file
It contains a global hash which ensures the integrity.
Fim don’t allow to use a modified state
The content of the State is standardized and can be used on the different supported OS
During the Fim repository initialization you can ask to not use some hash. This allows to have a repository where the commit will be faster, but there could be collisions:
• -f
: Max. hash mode: fast. You can use after -f
, -s
or -n
• -s
: Max. hash mode: super-fast. You can use after -s
or -n
• -n
: Max. hash mode: no-hash. You can use after -n
Example:
fim init -f
After the status
command works by default with the 'fast' level.
fim st # Runs using '-f'
The hash algorithm is SHA-512
SHA-512 is 2 * slower than MD5. It’s a cryptographic algorithm.
This decrease collisions
The size of the key produced (512 Bit / 64 bytes) allows to minimize the collision risk on large files
The disk is slower than the hash algorithm
When the status is checked in full mode, the 3 hash are used which decrease the risk of collision
Fim can manage at least 1 million of files
Multi-thread hash in order to take advantage of the resources
In many cases, performance are conditioned by the speed of the disk
By default, the number of thread is dynamic and depends on the disk throughput
-t
option to change the number of threads used (Details)
A Thread scans the workspace and fills the Queue of files to hash
Several thread hash each one a file:
— Calculation of the size and the location of the next block
— Block mapped into memory using a NIO FileChannel
— Hashing or not by the 3 hashers that produce the 3 hash
Using the MessageDigest
They receive the same blocks in order to reduce the I/O
and therefore read only once each block
Purposely using btrfs RAID1 in degraded mode ?
https://www.spinics.net/lists/linux-btrfs/msg50990.html
For offline long term backups I also used to work with hashdeep to perform and store a hash of all the files and recently started playing with Fim which is similar but with a git backend for storing history. Don’t get fooled by fim being a java application.
It easily outperformed hashdeep on large datasets.
Articles that I have written to promote Fim on LinuxFr.org:
1.2.0 - Focus on performance with Fim 1.2.0
Support of repositories with at least 1 million of files
1.1.0 - Fim 1.1.0
Hash algorithm rewrite to hash a block at the beginning,
one in the middle and one at the end
1.0.2 - Fim release 1.0.2, that verifies the integrity of your file
First public version of Fim
Korben - Vérifier l’intégrité de très nombreux fichiers
Fim est un outil vraiment excellent qui permet de gérer l’intégrité de nombreux fichiers …
01net.com - Pour Linux - Pour Windows
... permet de vérifier l’intégrité de tous vos fichiers après les avoir manipulés en lots …
Pirate Informatique n°26 page 41 - Vérifiez l’intégrité de vos fichiers
Si vous avez un paquet de fichiers à transférer, vous aimeriez sans doute être absolument
sûr que les données n’ont pas été endommagées …
linux-btrfs - Purposely using btrfs RAID1 in degraded mode ? or here
... Don’t get fooled by fim being a java application. It easily outperformed hashdeep on large datasets.
Stack Overflow - Signing every file created in a folder
Fim is published as OpenSource so that it benefits everyone.
That also makes it possible to profit from the ideas of each one.
Here are the ideas and merge request which have been submitted:
Do not hesitate to open issues with your ideas on GitHub
git-lfs - Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git
git-annex - Managing files with git, without checking the file contents into git
doc - Manage your files and documents
DIRT (DIRectory Tracker) - Another little Linux command line utility to track changed files in a directory tree.
MD5 Summer - Application for Windows 9x, NT, ME, 2000 and XP which generates and verifies md5 checksums
btrfs-scrub - Reads all data from the disk and verifies checksums
They don’t have super-fast mode for status or commit
Asciidoctor - Ruby implementation of AsciiDoc to generate
the Fim documentation and this presentation slides