Skip to content

Commit

Permalink
Summarization Pipeline CLI (#2)
Browse files Browse the repository at this point in the history
* 🔊 add logging setup util

Signed-off-by: Peter <[email protected]>

* 🔥 rm skeleton

Signed-off-by: Peter <[email protected]>

* add MWE

Signed-off-by: Peter <[email protected]>

* ✨ add GPU util fns

Signed-off-by: Peter <[email protected]>

* 🚧 base sumtext fn

Signed-off-by: Peter <[email protected]>

* ✨ add booksum postprocessor

Signed-off-by: Peter <[email protected]>

* ✨ mem comp fn

Signed-off-by: Peter <[email protected]>

* 🚧 wip imports

Signed-off-by: Peter <[email protected]>

* 🚧 📝 docs & color print

Signed-off-by: Peter <[email protected]>

* 🚧 basic argparse

Signed-off-by: Peter <[email protected]>

* ✨ 🚧 filesum fns

Signed-off-by: Peter <[email protected]>

* 📝 re-init

Signed-off-by: Peter <[email protected]>

* 🚧 update model loading

Signed-off-by: Peter <[email protected]>

* ✨ util to save parms

Signed-off-by: Peter <[email protected]>

* 🚧 ✨ build out basic cli script

Signed-off-by: Peter <[email protected]>

* 🚚rename

Signed-off-by: Peter <[email protected]>

* wip debug

* 🚨 🎨 fix format and reimports

Signed-off-by: peter szemraj <[email protected]>

* ⚰️ rm dead code

Signed-off-by: peter szemraj <[email protected]>

* 🚧 adjust imports

Signed-off-by: peter szemraj <[email protected]>

* 🐛 fix CLI fn name bug

Signed-off-by: peter szemraj <[email protected]>

* 🐛 fix argparse bugs

Signed-off-by: peter szemraj <[email protected]>

* 🚧 📝 improve argparse, CLI

Signed-off-by: peter szemraj <[email protected]>

* 🚸 standardize CLI (basic)

Signed-off-by: peter szemraj <[email protected]>

* 🎨 update api

Signed-off-by: peter szemraj <[email protected]>

* 🐛 naming bug

Signed-off-by: peter szemraj <[email protected]>

* 🔇 update/quieten logs

Signed-off-by: peter szemraj <[email protected]>

* 📝 docs

Signed-off-by: peter szemraj <[email protected]>

* details

Signed-off-by: peter szemraj <[email protected]>

* 📝 easier install

Signed-off-by: peter szemraj <[email protected]>

Signed-off-by: Peter <[email protected]>
Signed-off-by: peter szemraj <[email protected]>
  • Loading branch information
pszemraj authored Jan 16, 2023
1 parent 0beb6c5 commit aa63119
Show file tree
Hide file tree
Showing 10 changed files with 668 additions and 236 deletions.
6 changes: 0 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1 @@
# Changelog

## Version 0.1 (development)

- Feature A added
- FIX: nasty bug #1729 fixed
- add your changes here!
54 changes: 47 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,37 +14,77 @@

> utility for using transformers summarization models on text docs
A continuation of the [document summarization](<https://huggingface.co/spaces/pszemraj/document-summarization>) space on huggingface.
An extension/generalization of the [document summarization](<https://huggingface.co/spaces/pszemraj/document-summarization>) space on huggingface. The purpose of this package is to provide a simple interface for using summarization models on text documents of arbitrary length.

⚠️ **WARNING**: _This package is a WIP and is not ready for production use. Some things may not work yet._ ⚠️

## Installation

Install the package using pip:

```bash
pip install -e .
# create a virtual environment (optional)
pip install git+https://github.com/pszemraj/textsum.git
```

The textsum package is now installed in your virtual environment. You can now use the CLI or UI demo (see [Usage](#usage)).

### Full Installation _(PDF OCR, gradio UI demo)_

To install all the dependencies _(includes PDF OCR, gradio UI demo)_, run:

```bash
git clone https://github.com/pszemraj/textsum.git
cd textsum
# create a virtual environment (optional)
pip install -e .[all]
```

## Usage

### CLI

To summarize a directory of text files, run the following command:

```bash
textsum-dir /path/to/dir
```

The following options are available:

```
usage: textsum-dir [-h] [-o OUTPUT_DIR] [-m MODEL_NAME] [-batch BATCH_LENGTH] [-stride BATCH_STRIDE] [-nb NUM_BEAMS]
[-l2 LENGTH_PENALTY] [-r2 REPETITION_PENALTY] [--no_cuda] [-length_ratio MAX_LENGTH_RATIO] [-ml MIN_LENGTH]
[-enc_ngram ENCODER_NO_REPEAT_NGRAM_SIZE] [-dec_ngram NO_REPEAT_NGRAM_SIZE] [--no_early_stopping] [--shuffle]
[--lowercase] [-v] [-vv] [-lf LOGFILE]
input_dir
```

For more information, run:

```bash
textsum-dir --help
```

### UI Demo

Simply run the following command to start the UI demo:
For convenience, a UI demo is provided using [gradio](https://gradio.app/). To run the demo, run the following command:

```bash
ts-ui
textsum-ui
```

Other args to be added soon
This is currently a minimal demo, but it will be expanded in the future to accept other arguments and options.

---

## Roadmap

- [ ] add argparse CLI for UI demo
- [ ] add CLI for summarization of all text files in a directory
- [ ] API for summarization of text docs
- [x] add CLI for summarization of all text files in a directory
- [ ] python API for summarization of text docs
- [ ] optimum inference integration
- [ ] better documentation, details on improving performance (speed, quality, memory usage, etc.)

and other things I haven't thought of yet

Expand Down
3 changes: 2 additions & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ testing =
[options.entry_points]
# Add here console scripts like:
console_scripts =
ts-ui = textsum.app:run
textsum-ui = textsum.app:run
textsum-dir = textsum.cli:run
# For example:
# console_scripts =
# fibonacci = textsum.skeleton:run
Expand Down
6 changes: 6 additions & 0 deletions src/textsum/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
"""
textsum - a package for summarizing text
"""
import sys

from . import cli, utils

if sys.version_info[:2] >= (3, 8):
# TODO: Import directly (no need for conditional) when `python_requires = >= 3.8`
from importlib.metadata import PackageNotFoundError, version # pragma: no cover
Expand Down
8 changes: 6 additions & 2 deletions src/textsum/app.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
"""
app.py - a module to run the text summarization app (gradio interface)
"""
import contextlib
import logging
import os
Expand All @@ -19,7 +22,7 @@

from textsum.pdf2text import convert_PDF_to_Text
from textsum.summarize import load_model_and_tokenizer, summarize_via_tokenbatches
from textsum.utils import load_example_filenames, saves_summary, truncate_word_count
from textsum.utils import save_summary, truncate_word_count

_here = Path(__file__).parent

Expand Down Expand Up @@ -137,7 +140,7 @@ def proc_submission(
html += ""

# save to file
saved_file = saves_summary(_summaries)
saved_file = save_summary(_summaries)

return html, sum_text_out, scores_out, saved_file

Expand All @@ -156,6 +159,7 @@ def load_uploaded_file(file_obj, max_pages=20):
# file_path = Path(file_obj[0].name)

# check if mysterious file object is a list
global ocr_model
if isinstance(file_obj, list):
file_obj = file_obj[0]
file_path = Path(file_obj.name)
Expand Down
Loading

0 comments on commit aa63119

Please sign in to comment.