duperemove utility finds duplicate filesystem extents and optionally schedule them for deduplication. An “extent” is a small part of a file that is stored within a filesystem. On some filesystems, one extent can be referenced multiple times when the contents of different files are identical. This is known as “extent sharing” or “deduplication” and it can save disk space by eliminating the need to store multiple copies of the same data.
The duperemove command can be used to find and remove these duplicate extents, which can help to free up disk space. It works by comparing the contents of files and identifying extents that are identical. Once duplicate extents are found, the command can be used to schedule them for deduplication, which will remove the duplicate extents and free up disk space.
duperemove Command Examples
1. Search for duplicate extents in a directory and show them:
# duperemove -r path/to/directory
2. Deduplicate duplicate extents on a Btrfs or XFS (experimental) filesystem:
# duperemove -r -d path/to/directory
3. Use a hash file to store extent hashes (less memory usage and can be reused on subsequent runs):
# duperemove -r -d --hashfile=path/to/hashfile path/to/directory
4. Limit I/O threads (for hashing and dedupe stage) and CPU threads (for duplicate extent finding stage):
# duperemove -r -d --hashfile=path/to/hashfile --io-threads=N --cpu-threads=N path/to/directory