Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zip-cli binary crate #235

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from
Draft

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Aug 12, 2024

Problem

We often want to generate zip files for use in testing and benchmarking, as in #208 and #233. In #233 I added a python script in tests/data/ which uses the built-in zipfile stdlib module to create zip files. While this works, it also seems a little silly when we've done most of the work already in this crate to create a highly flexible interface to zip files. Separately, it seems like having a CLI to work on would be a good way to learn more about ways we can make this crate work better with filesystems (e.g. in a TODO I note that we should probably modify .add_symlink_from_path() to accept an OsStr target).

Prior Investigations

I previously created https://github.com/cosmicexplorer/medusa-zip as a sandbox to mess around with high-performance zip merging and splitting. That was partially because the zip crate was unmaintained at the time, and partially because I was under the impression that some of the techniques would be difficult to retrofit into a standard zip library. It was also an excuse to play around with pyo3 and learn how to build for many python platforms with cibuildwheels.

Since then, I've found that:

  1. The use of async as in medusa-zip tends to heavily slow down programs during purely filesystem-based I/O, as they tend not to wait for very long (this is why linux does not offer very many nonblocking file APIs). async tends to be harmful to latency and more useful for very high throughput, especially when waiting on network latencies.
  2. High levels of parallelism do not require egregious usage of temporarily output files as in medusa-zip, but can be implemented with pread() and pipes on #[cfg(unix)] targets as in parallel/pipelined extraction #208.

Solution

Create a zip-cli crate in the cli/ subdir. This isn't published anywhere yet, as the main purpose right now is to make it easier to generate test data for benchmarks. This is planned to provide three subcommands:

  • compress: accept a specification of entries and options and generate a zip file output, like zip.
  • info: write out info about a zip file, like zipinfo.
  • extract: extract individual entries or an entire archive to stdout or the filesystem, like unzip.

Result

Currently, only compress is implemented, but it seems to work great:

> cargo run -p zip-cli -- compress -h
   Compiling zip-cli v0.0.0 (/home/cosmicexplorer/tools/zip/cli)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.32s
     Running `target/debug/zip-cli compress -h`
do a compress

Usage: zip-cli compress [OPTIONS] [positional-path]...

Options:
  -o, --output-file <output-file>  Output zip file path to write.
      --stdout                     Allow writing output to stdout even if stdout is a tty.
  -h, --help                       Print help (see more with '--help')

ATTRIBUTES:
  -c, --compression-method <compression-method>
          Which compression technique to use.
          Defaults to deflate if not specified. [possible values: stored, deflate, deflate64, bzip2, zstd]
  -l, --compression-level <compression-level>
          How much compression to perform, from 0..=24.
  -m, --mode <mode>
          Unix permissions to apply to the file, in octal (like chmod).
      --large-file <large-file>
          Whether to enable large file support. [possible values: true, false]

ENTRIES:
  -n, --name <name>
          The name to apply to the entry.
  -d, --dir
          Create a directory entry.
  -s, --symlink
          Make the next entry into a symlink entry.
  -i, --imm <immediate>
          Write an entry containing this data.
  -f, --file <file>
          Write an entry with the contents of this file path.
  -r, --recursive-dir <recursive-directory>
          Write all the recursive contents of this directory path.
  [positional-path]...
          Write the file or recursive directory contents, relativizing the path.
> cargo run -p zip-cli -- -v compress -o out.zip -n a -d -n b -s -i c -m755 -f tmp/file.txt 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/zip-cli -v compress -o out.zip -n a -d -n b -s -i c -m755 -f tmp/file.txt`
writing compressed zip to output file path "out.zip"
default zip entry options: FileOptions { compression_method: Deflated, compression_level: None, last_modified_time: DateTime::from_date_and_time(2024, 8, 12, 16, 42, 29)?, permissions: None, large_file: false, encrypt_with: None, extended_options: (), alignment: 1, zopfli_buffer_size: Some(32768) }
setting name of next entry to "a"
writing dir entry
setting name of next entry to "b"
setting symlink flag for next entry
writing immediate symlink entry with name "b" and target "c"
setting file mode 0o755
writing file entry from path "tmp/file.txt" with name "tmp/file.txt"
> unzip -l out.zip
Archive:  out.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2024-08-12 16:42   a/
        1  2024-08-12 16:42   b
        4  2024-08-12 16:42   tmp/file.txt
---------                     -------
        5                     3 files

TODO

  • Implement info and extract, and gate them by feature flags (zip-cli binary crate #235 (comment)).
  • Implement a WrapErr trait and Error impl that doesn't call into eyre or anyhow, since we're not making a large server application which may error out in unexpected ways (zip-cli binary crate #235 (comment)).
    • Also make CLI arg parsing errors use this error framework instead of calling process::exit() by hand.
  • Make errors of the same type produce the same error message.
    • e.g. for -f args and positional file args that don't exist, don't write out different error messages, but make them produce the same case of e.g. a CompressError struct, which is converted into a standard stderr output + exit code at the top-level main() method.

Longer-term: optimize compress via a "merge" operation

medusa-zip accepts a JSON interface for specifying inputs to compress, which is less useful for the shell but much more useful for execution e.g. from build tools. Having all the inputs planned out in advance also makes it possible to perform extremely simple and powerful optimizations by creating "sub-zips" in parallel and merging them without decompressing. This is something I plan to propose after I spend more time on #208.

cli/src/compress.rs Fixed Show fixed Hide fixed
cli/src/compress.rs Fixed Show fixed Hide fixed
Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a partial review; I'll review most of the code changes tomorrow or Wednesday.

Cargo.toml Outdated
[workspace]
members = [
".",
"cli",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fuzz also be a member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea! Unfortunately, I get this incredibly strange error during linking:

error: linking with `cc` failed: exit status: 1
  |
  = note: LC_ALL="C" PATH="/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/self-contained:/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/self-contained:/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/self-contained:/home/cosmicexplorer/bazel-bullshit/:/usr/lib/jvm/default-runtime/bin:/home/cosmicexplorer/.pyenv/shims:/home/cosmicexplorer/tools/emacs/rex-install/bin:/home/cosmicexplorer/.local/bin:/home/cosmicexplorer/go/bin:/home/cosmicexplorer/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/usr/bin/core_perl:/usr/bin/vendor_perl:/home/cosmicexplorer/.zsh/snippets/bash:/home/cosmicexplorer/go/bin:/home/cosmicexplorer/.cargo/bin" VSLANG="1033" "cc" "-m64" "/tmp/rustcgcg83u/symbols.o" "-Wl,-Bstatic" "-Wl,--whole-archive" "-Wl,/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc-nightly_rt.asan.a" "-Wl,--no-whole-archive" "/home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/deps/fuzz_read-73bbb0616ed32380.fuzz_read.775828843f50ae8f-cgu.0.rcgu.o" "-Wl,--as-needed" "-L" "/home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/deps" "-L" "/home/cosmicexplorer/tools/zip/target/release/deps" "-L" "/home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/build/libfuzzer-sys-394e55403fa44217/out" "-L" "/home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/build/tikv-jemalloc-sys-9cd5b3f80b1e0b3c/out/build/lib" "-L" "/usr/lib" "-L" "/home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/build/zstd-sys-28dcdb0894e5b147/out" "-L" "/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "/tmp/rustcgcg83u/libzstd_sys-f8986c36e00f9a6f.rlib" "/tmp/rustcgcg83u/libtikv_jemalloc_sys-b8ade72b5ff520e1.rlib" "/tmp/rustcgcg83u/liblibfuzzer_sys-64607a53ace2a9b6.rlib" "/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-4d7d16bbf0636a40.rlib" "-Wl,-Bdynamic" "-lbz2" "-lstdc++" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-B/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/gcc-ld" "-B/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/gcc-ld" "-B/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/gcc-ld" "-fuse-ld=lld" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-L" "/home/cosmicexplorer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/self-contained" "-o" "/home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/deps/fuzz_read-73bbb0616ed32380" "-pie" "-Wl,-z,relro,-z,now" "-Wl,--strip-all" "-nodefaultlibs"
# ... (snip)
          rust-lld: error: undefined symbol: __sancov_gen_.51
          >>> referenced by fuzz_read.775828843f50ae8f-cgu.0
          >>>               /home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/deps/fuzz_read-73bbb0616ed32380.fuzz_read.775828843f50ae8f-cgu.0.rcgu.o:(asan.module_dtor.10819)
          
          rust-lld: error: undefined symbol: __sancov_gen_.1086
          >>> referenced by fuzz_read.775828843f50ae8f-cgu.0
          >>>               /home/cosmicexplorer/tools/zip/target/x86_64-unknown-linux-gnu/release/deps/fuzz_read-73bbb0616ed32380.fuzz_read.775828843f50ae8f-cgu.0.rcgu.o:(asan.module_dtor.10892)
          
          rust-lld: error: too many errors emitted, stopping now (use --error-limit=0 to see all errors)
          collect2: error: ld returned 1 exit status
          
error: could not compile `zip-fuzz` (bin "fuzz_read") due to 1 previous error
Error: failed to build fuzz script: ASAN_OPTIONS="detect_odr_violation=0" RUSTFLAGS="-Cpasses=sancov-module -Cllvm-args=-sanitizer-coverage-level=4 -Cllvm-args=-sanitizer-coverage-inline-8bit-counters -Cllvm-args=-sanitizer-coverage-pc-table -Cllvm-args=-sanitizer-coverage-trace-compares --cfg fuzzing -Clink-dead-code -Zsanitizer=address -Cllvm-args=-sanitizer-coverage-stack-depth -Cdebug-assertions -C codegen-units=1" "cargo" "build" "--manifest-path" "/home/cosmicexplorer/tools/zip/fuzz/Cargo.toml" "--target" "x86_64-unknown-linux-gnu" "--release" "--config" "profile.release.debug=true" "--all-features" "--bin" "fuzz_read"

This is from:

  1. putting "fuzz" in Cargo.toml's workspace.members array.
  2. commenting out the workspace.members array in fuzz/Cargo.toml.

This seems to happen pretty reliably, and I honestly have no clue how to debug it. It would very much be nice to be able to reuse the arbitrary dependency declaration. If I dug more into how cargo fuzz is executed and configured I could probably figure this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh, this is because of our strip and other stuff in profile.release! I think I can turn that off for fuzz targets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, ok. It looks like cargo fuzz is strictly only able to use release (by default) or dev (with a command-line flag). Custom Profiles might be appropriate. I don't really know how people distribute rust binaries, though--do they just run cargo install zip-cli and it builds from source? What profile would that use? I'll figure that out.

Copy link
Contributor Author

@cosmicexplorer cosmicexplorer Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like we could e.g. tell users "run cargo install --profile release-lto zip-cli" for a smaller binary, maybe (see https://doc.rust-lang.org/cargo/commands/cargo-install.html#compilation-options). I suspect there are best practices already established for this sort of thing that we should look to, since we're surely not the only person who has wanted to put a fuzz crate and a binary crate in the same workspace. I will now take a look at your other comment regarding splitting up the binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, to clarify:

  1. cargo fuzz is only able to use the dev or release profiles (release by default)
    • I assume we want to use release because fuzzing is slow, so we don't really have an option to change anything there -- we're just stuck with fuzzing using profile.release.
  2. turning on lto = false in profile.release breaks cargo fuzz (some linker error about ASan, linked above)
    • lto = true converts our current 1.6MB binary to 1.3MB, which is good but not as good as strip, which does work just fine.
  3. Custom profiles can inherit from profile.release and additionally set e.g. lto = true, without interfering with other profiles.
    • For local development, we can run cargo build --profile release-lto.
  4. cargo install also uses profile.release by default, and users generally expect to be able to just run cargo install zip-cli without additional flags.
    • However, cargo install at least does support custom profiles. So we could tell users "please run cargo install --profile release-lto for a smaller binary."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I suspect that disabling aes-crypto and compression algos other than deflate will make a bigger difference to the binary's size and memory footprint than any combination of lto and strip, and still cover the most common use cases.

cli/Cargo.toml Outdated Show resolved Hide resolved
cli/Cargo.toml Outdated Show resolved Hide resolved
cli/Cargo.toml Show resolved Hide resolved
cli/src/args.rs Outdated Show resolved Hide resolved
@cosmicexplorer cosmicexplorer force-pushed the cli branch 4 times, most recently from 1f9d6ce to 78c659b Compare August 13, 2024 13:45
@cosmicexplorer
Copy link
Contributor Author

Ok, after all those very long replies, the solution is actually quite simple:

  • make zip-cli expose a [lib] target as well as a [[bin]] target
    • forward all features from zip in its own [features] table
  • make another crate zip-clite (in its own workspace) that depends on zip-cli as a library (to perform clap argument parsing and implement each subcommand)

Result:

> cd cli
> cargo build --release
> ls target/release/zip-cli
-rwxr-xr-x 2 cosmicexplorer wheel 1.8M Aug 13 09:34 target/release/zip-cli*
> cd clite
> cargo build --release
> ls target/release/zip-clite
-rwxr-xr-x 2 cosmicexplorer wheel 825k Aug 13 09:35 target/release/zip-clite*

So we end up having to publish two separate crates, but (1) this allows us to achieve the precise separation you described in #235 (comment), and (2) this also means users can actually just run cargo install zip-cli and get the thing they want, or cargo install zip-clite if they want the smaller version.

Currently zip-clite does not expose any features of its own (it hardcodes deflate-flate2 and deflate-zlib), but zip-cli does, and it can be narrowed down to a smaller binary as desired:

> cd cli
> cargo build --release --no-default-features --features deflate-zlib,bzip2,lzma,xz
> ls target/release/zip-cli                                                          
-rwxr-xr-x 2 cosmicexplorer wheel 1.2M Aug 13 09:40 target/release/zip-cli*

Note that this 1.2M is smaller than the previous 1.8M with default features enabled.

I like this solution a lot and I'm really glad we didn't have to delve into anything more complex!

@cosmicexplorer
Copy link
Contributor Author

If we really wanted to care about artifact size for clite (do we?), we could consider removing clap, especially since the compress subcommand parses its order-dependent arguments in a relatively idiosyncratic way--we essentially have to work around clap a lot (see this incredibly lengthy bit of boilerplate we have to do to express our order-dependent args in clap:

zip2/cli/src/args.rs

Lines 83 to 317 in 78c659b

fn from_arg_matches(matches: &ArgMatches) -> Result<Self, clap::Error> {
let allow_stdout = matches.get_flag("stdout");
let output_path = matches.get_one("output-file").cloned();
assert!(!(allow_stdout && output_path.is_some()));
/* (1) Extract each arg type and associated indices. This is extremely boilerplate. */
let methods: Vec<CompressionMethodArg> = matches
.get_many("compression-method")
.map(|vs| vs.copied().collect())
.unwrap_or_default();
let methods_indices: Vec<usize> = matches
.indices_of("compression-method")
.map(|is| is.collect())
.unwrap_or_default();
let mut methods: VecDeque<(CompressionMethodArg, usize)> =
methods.into_iter().zip(methods_indices).collect();
let levels: Vec<CompressionLevel> = matches
.get_many("compression-level")
.map(|vs| vs.copied().map(|i: i64| CompressionLevel(i)).collect())
.unwrap_or_default();
let levels_indices: Vec<usize> = matches
.indices_of("compression-level")
.map(|is| is.collect())
.unwrap_or_default();
let mut levels: VecDeque<(CompressionLevel, usize)> =
levels.into_iter().zip(levels_indices).collect();
let modes: Vec<Mode> = matches
.get_many("mode")
.map(|vs| vs.copied().collect())
.unwrap_or_default();
let modes_indices: Vec<usize> = matches
.indices_of("mode")
.map(|is| is.collect())
.unwrap_or_default();
let mut modes: VecDeque<(Mode, usize)> = modes.into_iter().zip(modes_indices).collect();
let large_files: Vec<bool> = matches
.get_many("large-file")
.map(|vs| vs.copied().collect())
.unwrap_or_default();
let large_files_indices: Vec<usize> = matches
.indices_of("large-file")
.map(|is| is.collect())
.unwrap_or_default();
let mut large_files: VecDeque<(bool, usize)> =
large_files.into_iter().zip(large_files_indices).collect();
let names: Vec<String> = matches
.get_many("name")
.map(|vs| vs.cloned().collect())
.unwrap_or_default();
let names_indices: Vec<usize> = matches
.indices_of("name")
.map(|is| is.collect())
.unwrap_or_default();
let mut names: VecDeque<(String, usize)> = names.into_iter().zip(names_indices).collect();
let mut directories_indices: VecDeque<usize> = matches
.indices_of("directory")
.map(|is| is.collect())
.unwrap_or_default();
let mut symlinks_indices: VecDeque<usize> = matches
.indices_of("symlink")
.map(|is| is.collect())
.unwrap_or_default();
let immediates: Vec<OsString> = matches
.get_many("immediate")
.map(|vs| vs.cloned().collect())
.unwrap_or_default();
let immediates_indices: Vec<usize> = matches
.indices_of("immediate")
.map(|is| is.collect())
.unwrap_or_default();
let mut immediates: VecDeque<(OsString, usize)> =
immediates.into_iter().zip(immediates_indices).collect();
let file_paths: Vec<PathBuf> = matches
.get_many("file")
.map(|vs| vs.cloned().collect())
.unwrap_or_default();
let file_paths_indices: Vec<usize> = matches
.indices_of("file")
.map(|is| is.collect())
.unwrap_or_default();
let mut file_paths: VecDeque<(PathBuf, usize)> =
file_paths.into_iter().zip(file_paths_indices).collect();
let recursive_dir_paths: Vec<PathBuf> = matches
.get_many("recursive-directory")
.map(|vs| vs.cloned().collect())
.unwrap_or_default();
let recursive_dir_paths_indices: Vec<usize> = matches
.indices_of("recursive-directory")
.map(|is| is.collect())
.unwrap_or_default();
let mut recursive_dir_paths: VecDeque<(PathBuf, usize)> = recursive_dir_paths
.into_iter()
.zip(recursive_dir_paths_indices)
.collect();
/* (2) Map each arg back to its original order by popping the minimum index entry off its
* queue. */
let mut args: Vec<CompressionArg> = Vec::new();
enum ArgType {
Method,
Level,
Mode,
LargeFile,
Name,
Dir,
Sym,
Imm,
FilePath,
RecDirPath,
}
let mut min_index: usize = usize::MAX;
let mut min_arg_type: Option<ArgType> = None;
while !methods.is_empty()
|| !levels.is_empty()
|| !modes.is_empty()
|| !large_files.is_empty()
|| !names.is_empty()
|| !directories_indices.is_empty()
|| !symlinks_indices.is_empty()
|| !immediates.is_empty()
|| !file_paths.is_empty()
|| !recursive_dir_paths.is_empty()
{
if let Some((_, i)) = methods.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::Method);
}
}
if let Some((_, i)) = levels.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::Level);
}
}
if let Some((_, i)) = modes.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::Mode);
}
}
if let Some((_, i)) = large_files.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::LargeFile);
}
}
if let Some((_, i)) = names.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::Name);
}
}
if let Some(i) = directories_indices.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::Dir);
}
}
if let Some(i) = symlinks_indices.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::Sym);
}
}
if let Some((_, i)) = immediates.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::Imm);
}
}
if let Some((_, i)) = file_paths.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::FilePath);
}
}
if let Some((_, i)) = recursive_dir_paths.front() {
if *i < min_index {
min_index = *i;
min_arg_type = Some(ArgType::RecDirPath);
}
}
assert_ne!(min_index, usize::MAX);
let new_arg = match min_arg_type.take().unwrap() {
ArgType::Method => {
CompressionArg::CompressionMethod(methods.pop_front().unwrap().0)
}
ArgType::Level => CompressionArg::Level(levels.pop_front().unwrap().0),
ArgType::Mode => CompressionArg::Mode(modes.pop_front().unwrap().0),
ArgType::LargeFile => CompressionArg::LargeFile(large_files.pop_front().unwrap().0),
ArgType::Name => CompressionArg::Name(names.pop_front().unwrap().0),
ArgType::Dir => {
directories_indices.pop_front().unwrap();
CompressionArg::Dir
}
ArgType::Sym => {
symlinks_indices.pop_front().unwrap();
CompressionArg::Symlink
}
ArgType::Imm => CompressionArg::Immediate(immediates.pop_front().unwrap().0),
ArgType::FilePath => CompressionArg::FilePath(file_paths.pop_front().unwrap().0),
ArgType::RecDirPath => {
CompressionArg::RecursiveDirPath(recursive_dir_paths.pop_front().unwrap().0)
}
};
args.push(new_arg);
min_index = usize::MAX;
}
/* (3) Collect positional arguments. */
let positional_paths: Vec<PathBuf> = matches
.get_many("positional-path")
.map(|vs| vs.cloned().collect())
.unwrap_or_default();
Ok(Self {
allow_stdout,
output_path,
args,
positional_paths,
})
}
; that is a 200-line function).

Especially if our goal here is more about providing an interface to the zip library rather than maintaining a carefully-crafted CLI that users employ constantly in day-to-day work, I think we may be within our rights to consider parsing args outside of clap. I may try to see how easy it is to drop clap from this and whether it feels like the right idea.

@cosmicexplorer
Copy link
Contributor Author

Ugh, I actually really like the help text generation mechanism for clap and I don't think binary size is that much of a concern for us at all. I think we definitely have an excuse to go outside of clap if we feel like it but I would prefer to work on other things right now.

@Pr0methean
Copy link
Member

Pr0methean commented Aug 14, 2024

Have you looked at the other parsers listed here? https://rust-cli-recommendations.sunshowers.io/cli-parser.html#alternatives-to-clap (I'm not sure what they mean when they say argh "follows Fuchsia OS conventions"; does --help on a Fuchsia OS command delegate to man the way the Git subcommands do?)

@Pr0methean
Copy link
Member

Pr0methean commented Aug 14, 2024

It also occurs to me that it might be helpful to gate the dependency on clap on a default-but-optional feature; for example, if clite was being deployed for shell-script use to a server that nobody could SSH into without beeping someone's pager and getting the thirty-third degree -- as is true for most prod servers at AWS -- then the help text wouldn't matter.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Aug 14, 2024

Thanks so much for that discussion of use cases for this binary!! It is a new day and I think it is actually not a bad idea to try parsing args by hand! Will report back ^_^!

EDIT: a clap feature is a great idea, I'll try that so we can more easily compare the UX of the result!

cli/src/compress.rs Fixed Show fixed Hide fixed
cli/src/compress.rs Fixed Show fixed Hide fixed
@cosmicexplorer
Copy link
Contributor Author

Seems pretty clear that clap is unnecessary, and I've removed it (the parsing is also much easier to follow, since clap requires immense circumlocutions to support...an ordered sequence of args).

I'd also like to note: I'm currently depending on color-eyre in zip-cli and just plain eyre in zip-clite. Right now, the sizes are 1.4M and 477k in --release mode:

> cd cli
> ls target/release/zip-cli
-rwxr-xr-x 2 cosmicexplorer wheel 1.4M Aug 14 14:07 target/release/zip-cli*
> cd clite
> ls target/release/zip-clite 
-rwxr-xr-x 2 cosmicexplorer wheel 477k Aug 14 14:07 target/release/zip-clite*

For iterating on this, color-eyre is vaguely useful, but I don't think we'll actually need a backtrace here, since the error cases are either going to be argument parsing or i/o errors, where we should be providing a formatted error message and exiting with a specific error code instead of printing a backtrace. As with the decision to drop clap, we're not making a big hulking command-line application with so many flags and failure modes we can't remember them all. I think that for every usage of ? in the app right now, we should be able to provide a specific and useful error message. For i/o errors especially, we don't actually get any benefit from eyre/anyhow, since they don't capture the path of the file that produced the error for us. So we need to manually handle those errors anyway to produce a good error message. I don't think this will be incredibly difficult or time-consuming.

@cosmicexplorer
Copy link
Contributor Author

Hm, I tried removing eyre and color-eyre respectively, and actually only got 1.4M and 465k respectively for cargo build --release (so zip-cli stayed the same size, and zip-clite only became 12kb smaller). So what I might do then is to keep them in, and make use of WrapErr to provide context along with the error messages, so we can still get that context we need to make good error messages--but no reason to remove the nice error handling/propagation crates if it doesn't help our binary size (and better error reporting upon failure seems useful enough to keep, and eyre seems safe and widely used).

err,
"writing immediate symlink entry with name {name:?} and target {target:?}"
)?;
/* TODO: .add_symlink() should support OsString targets! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Aug 14, 2024

Ok, after doing a lot of .wrap_err_with() calls, we're now up to 1.7M for zip-cli and 780k for zip-clite. Error messages look like this:

> cargo run -- -v compress -o out.zip -n a -d -n b -s -i c -m 755 -f tmp/file2.txt
writing compressed zip to output file path "out.zip"
default zip entry options: FileOptions { compression_method: Deflated, compression_level: None, last_modified_time: DateTime::default(), permissions: None, large_file: false, encrypt_with: None, extended_options: (), alignment: 1, zopfli_buffer_size: Some(32768) }
setting name of next entry to "a"
writing dir entry
setting name of next entry to "b"
setting symlink flag for next entry
writing immediate symlink entry with name "b" and target "c"
setting file mode 0o755
writing file entry from path "tmp/file2.txt" with name "tmp/file2.txt"
Error: 
   0: tmp/file2.txt
   1: No such file or directory (os error 2)

Location:
   /home/cosmicexplorer/tools/zip/cli/src/compress.rs:286

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

I was able to mimic this somewhat without the eyre crate, and it did work (although it required a lot more boilerplate, which is somewhat of a risk in itself due to poor readability), so let me know how you feel about error crates like eyre.

Note that invalid cli args are processed without any error crate, and we explicitly call process::exit() ourselves:

> cargo run -- -v compress -o out.zip -n a -d -n b -s -s -i c -m 755 -f tmp/file2.txt
writing compressed zip to output file path "out.zip"
default zip entry options: FileOptions { compression_method: Deflated, compression_level: None, last_modified_time: DateTime::from_date_and_time(2024, 8, 14, 20, 2, 54)?, permissions: None, large_file: false, encrypt_with: None, extended_options: (), alignment: 1, zopfli_buffer_size: Some(32768) }
setting name of next entry to "a"
writing dir entry
setting name of next entry to "b"
setting symlink flag for next entry
setting symlink flag for next entry
error: symlink flag provided twice before entry

Usage: target/debug/zip-cli compress [-h|--help] [OUTPUT-FLAG] [ENTRIES]... [--] [PATH]...

This is what the compress command help output looks like. Note that the possible values for -c/--compression-method are gated based on the selected features:

> cargo run -- compress --help
do a compress

Usage: target/debug/zip-cli compress [-h|--help] [OUTPUT-FLAG] [ENTRIES]... [--] [PATH]...

  -h, --help    Print help

Output flags:
Where and how to write the generated zip archive.
  -o, --output-file <file>
          Output zip file path to write.
          The output file is currently always truncated if it already exists.
          If not provided, output is written to stdout.

      --stdout
          Allow writing output to stdout even if stdout is a tty.

ENTRIES:
After at most one output flag is provided, the rest of the command line is attributes and
entry data. Attributes modify later entries.

Sticky attributes:
These flags apply to everything that comes after them until reset by another instance of the
same attribute. Sticky attributes continue to apply to positional arguments received after
processing all flags.

  -c, --compression-method <method-name>
          Which compression technique to use.
          Defaults to deflate if not specified.

          Possible values:
          - stored:    uncompressed
          - deflate:   with deflate (default)
          - deflate64: with deflate64
          - bzip2:     with bzip2
          - zstd:      with zstd

  -l, --compression-level <level>
          How much compression to perform, from 0..=24.
          The accepted range of values differs for each technique.

  -m, --mode <mode>
          Unix permissions to apply to the file, in octal (like chmod).

      --large-file [true|false]
          Whether to enable large file support.
          This may take up more space for records, but allows files over 32 bits in length to be
          written, up to 64 bit sizes.

Non-sticky attributes:
These flags only apply to the next entry after them, and may not be repeated.

  -n, --name <name>
          The name to apply to the entry.

  -s, --symlink
          Make the next entry into a symlink entry.
          A symlink entry may be immediate with -i, or it may read the symlink value from the
          filesystem with -f.

Entry data:
Each of these flags creates an entry in the output zip archive.

  -d, --dir
          Create a directory entry.
          A name must be provided beforehand with -n.

  -i, --imm <immediate>
          Write an entry containing this data.
          A name must be provided beforehand with -n.

  -f, --file <path>
          Write an entry with the contents of this file path.
          A name may be provided beforehand with -n, otherwise the name will be inferred from
          relativizing the given path to the working directory.

  -r, --recursive-dir <dir>
          Write all the recursive contents of this directory path.
          A name may be provided beforehand with -n, which will be used as the prefix for all
          recursive contents of this directory. Otherwise, the name will be inferred from
          relativizing the given path to the working directory.

Positional entries:
  [PATH]...
          Write the file or recursive directory contents, relativizing the path.
          If the given path points to a file, then a single file entry will be written.
          If the given path is a symlink, then a single symlink entry will be written.
          If the given path refers to a directory, then the recursive contents will be written.

@Pr0methean
Copy link
Member

Pr0methean commented Aug 14, 2024

Cool! Can --large-file possibly be set automatically when the input isn't a pipe or socket or open for writing?

@cosmicexplorer
Copy link
Contributor Author

Cool! Can --large-file possibly be set automatically when the input isn't a pipe or socket or open for writing?

Yes!!! (In fact it currently doesn't support anything except regular files, directories, and symlinks, although it definitely could). We can keep the existing "sticky" behavior for explicit --large-file [true|false] flags.

src/write.rs Dismissed Show dismissed Hide dismissed
src/write.rs Dismissed Show dismissed Hide dismissed
cli/src/args.rs Fixed Show fixed Hide fixed
cli/src/args.rs Fixed Show fixed Hide fixed
cli/src/args.rs Fixed Show fixed Hide fixed
@cosmicexplorer
Copy link
Contributor Author

Hmmmm. I took another look over this and I actually think what's here is very useful, but that it's difficult to understand how to correctly use any of the commands because the --help output provides exactly one view of the help text instead of having a brief vs detailed output. I also think structured (json) input would be very very useful for programmatic invocation, instead of having to translate to a command line. I have an idea of how to achieve this, and am working on that now.

The result should be that:

  • simple stuff is simple
  • command input is decoupled from argument parsing
  • --help output has a simple and a detailed view

#[derive(Debug)]
pub enum ArgParseError {
StdoutMessage(String),
/* FIXME: give these errors much more structure!! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
const ZSTD_HELP_LINE: &'static str = "";
}

/* TODO: add support for entry comments! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
}

/* TODO: add support for entry comments! */
/* TODO: add support for merging/transforming other zips!! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
How much compression to perform, from 0..=24.

The accepted range of values differs for each technique.
TODO: how much???

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
CompressionArg::Symlink => {
/* writeln!(err, "setting symlink flag for next entry").unwrap(); */
if symlink_flag {
/* TODO: make this a warning? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
ZipCommand::Info(info) => info.do_main(err),
ZipCommand::Extract(extract) => extract.do_main(err),
ZipCommand::Compress(compress) => compress.do_main(err),
/* TODO: ZipCommand::Crawl! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants