本次想要记录使用shell或者python对指定目录下的文件进行md5完整性检验
md5文件一般都会保存在对应文件的相同目录下,需要针对多个子目录挨个进行md5sum -c
的话未免有点麻烦。
需要cd到子文件夹下进行
md5sum -c
所以函数式样的编程就很方便。
主体思考
对每个文件生成md5,并和已经存在的md5进行比较,如果相同就可以print
Pass
shell工作记录
#!/bin/bash
# Main folder containing subfolders
base_folder="Dunwill_PDX_RNA-seq"
# Output file for results
output_file="file_integrity_check_results.csv"
# Initialize the output file with headers
echo "File,Status" > "$output_file"
# Function to calculate and verify MD5 checksum
verify_md5() {
local file_path=$1
local md5_file_path="${file_path}.md5"
if [ -f "$md5_file_path" ]; then
# Calculate the MD5 checksum of the file
calculated_md5=$(md5sum "$file_path" | awk '{ print $1 }')
# Read the stored MD5 checksum from the .md5 file
stored_md5=$(awk '{ print $1 }' "$md5_file_path")
# Compare the calculated MD5 with the stored MD5
if [ "$calculated_md5" == "$stored_md5" ]; then
echo "$file_path,Pass" >> "$output_file"
else
echo "$file_path,Fail" >> "$output_file"
fi
else
echo "$file_path,MD5 file missing" >> "$output_file"
fi
}
# Export the function to be used by find command
export -f verify_md5
export output_file
# Find all files (excluding .md5 files) and verify their checksums
find "$base_folder" -type f ! -name "*.md5" -exec bash -c 'verify_md5 "$0"' {} \;
# Find all files (excluding .md5 files) and verify their checksums in parallel
# find "$base_folder" -type f ! -name "*.md5" | parallel -j 14 verify_md5 {} >> "$output_file"
echo "File integrity check completed. Results saved to $output_file."
随后保存成sh文件转化为可执行文件跑一下就行了。chmod +x check_md5_integrity_parallel.sh
或者直接bash check_md5_integrity_parallel.sh
python 工作记录
import os
import hashlib
import pandas as pd
def calculate_md5(file_path):
"""Calculate MD5 checksum of the given file."""
hash_md5 = hashlib.md5()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def verify_md5(file_path, md5_file_path):
"""Verify the MD5 checksum of the given file against the value in the md5 file."""
calculated_md5 = calculate_md5(file_path)
with open(md5_file_path, 'r') as f:
stored_md5 = f.read().strip().split()[0]
return calculated_md5 == stored_md5
def check_integrity(base_folder):
results = []
for subdir, _, files in os.walk(base_folder):
for file in files:
if file.endswith(".md5"):
continue
file_path = os.path.join(subdir, file)
md5_file_path = file_path + ".md5"
if os.path.exists(md5_file_path):
status = verify_md5(file_path, md5_file_path)
results.append({
"File": file_path,
"Status": "Pass" if status else "Fail"
})
else:
results.append({
"File": file_path,
"Status": "MD5 file missing"
})
return results
# Main folder containing subfolders
base_folder = "Dunwill_PDX_RNA-seq"
# Run the integrity check
results = check_integrity(base_folder)
# Convert results to a DataFrame for better readability
df_results = pd.DataFrame(results)
print(df_results)
# Optionally, save the results to a CSV file
df_results.to_csv("file_integrity_check_results.csv", index=False)
也是使用命令行运行python check_md5_integrity.py