A utility for splitting files into chunks with optional overlap.
chunkfile
is a Python script that splits files into multiple chunks based on various criteria. It supports three modes of operation:
- Split by number of chunks (-n)
- Split by chunk size (-s)
- Split by number of lines (-l)
Each mode supports overlapping content between chunks to preserve context across chunk boundaries.
chunkfile filename [-n NUM_CHUNKS] [-s SIZE] [-l LINES] [-o OVERLAP]
filename
: The file to split
-n, --num_chunks
: Number of chunks to create-s, --size
: Size of each chunk in bytes-l, --lines
: Number of lines per chunk-o, --overlap
: Number of bytes/lines to overlap between chunks (default: 0)
You must specify exactly one of: -n
, -s
, or -l
.
-
Split a file into 10 chunks with 100-byte overlap:
chunkfile large_file.txt -n 10 -o 100
-
Split a file into 2048-byte chunks with 50-byte overlap:
chunkfile large_file.txt -s 2048 -o 50
-
Split a file into chunks of 1000 lines with 10-line overlap:
chunkfile large_file.txt -l 1000 -o 10
The script creates numbered chunks with the following naming pattern:
{original_name}_{number:02}{extension}
For example, splitting input.txt
would create:
input_01.txt
input_02.txt
- etc.
- Splits file into chunks with exact number of lines
- Preserves complete lines
- Supports line overlap between chunks
- Removes trailing newlines to prevent extra blank lines
- Splits file into chunks of exact byte size
- Supports byte overlap between chunks
- Handles partial reads and end-of-file conditions
- Preserves binary content
- Maintains context between chunks
- Configurable overlap size
- Works with both line and byte modes
- Ensures consistent chunk sizes
The script validates inputs and provides clear error messages for:
- Missing required arguments
- Invalid chunk sizes
- Overlap larger than chunk size
- Non-positive numbers
- File access issues
- Reads file line by line
- Maintains overlap buffer
- Handles UTF-8 and binary files
- Preserves line endings except for trailing newline
- Uses efficient block reading
- Maintains overlap buffer
- Handles partial reads
- Preserves binary content exactly
- Python 3.6+
- Standard library only (no external dependencies)
genmd
: Generate markdown documentationfiletree
: Display directory structure