Skip to content

Commit

Permalink
docs: add bricks training notebook (#211)
Browse files Browse the repository at this point in the history
* added bricks notebook

* more unicode quotes; isd dataframe column fix

* fix remove_punctuation docs

* typo fixes

* put staging bricks in code
  • Loading branch information
MthwRobinson authored Feb 10, 2023
1 parent d0c6d50 commit f890972
Show file tree
Hide file tree
Showing 8 changed files with 767 additions and 8 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
## 0.4.7-dev3
## 0.4.7-dev4

* Added the ability to pull an HTML document from a url in `partition_html`.
* Added the the ability to get file summary info from lists of filenames and lists
of file contents.
* Added optional page break to `partition` for `.pptx`, `.pdf`, images, and `.html` files.
* Added `to_dict` method to document elements.
* Include more unicode quotes in `replace_unicode_quotes`.

## 0.4.6

Expand Down
5 changes: 1 addition & 4 deletions docs/source/bricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -587,10 +587,7 @@ Examples:
from unstructured.cleaners.core import remove_punctuation
# Returns "A lovely quote"
replace_unicode_characters("“A lovely quote!”")
# Returns ""
replace_unicode_characters("'()[]{};:'\",.?/\\-_")
remove_punctuation("“A lovely quote!”")
``clean_prefix``
Expand Down
Binary file added example-docs/layout-parser-paper-fast.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
741 changes: 741 additions & 0 deletions examples/training/1-Intro to Bricks.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion test_unstructured/staging/test_base_staging.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def test_convert_to_isd_csv(output_csv_file):
isd_csv_string = base.convert_to_isd_csv(elements)
csv_file.write(isd_csv_string)

fieldnames = ["type", "text", "coordinates", "element_id"]
fieldnames = ["type", "text"]
with open(output_csv_file, "r") as csv_file:
csv_rows = csv.DictReader(csv_file)
assert all(set(row.keys()) == set(fieldnames) for row in csv_rows)
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.4.7-dev3" # pragma: no cover
__version__ = "0.4.7-dev4" # pragma: no cover
20 changes: 20 additions & 0 deletions unstructured/cleaners/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,11 +55,31 @@ def replace_unicode_quotes(text) -> str:
-------
\x93What a lovely quote!\x94 -> “What a lovely quote!”
"""
# NOTE(robinson) - We should probably make this something more sane like a regex
# instead of a whole big series of replaces
text = text.replace("\x91", "‘")
text = text.replace("\x92", "’")
text = text.replace("\x93", "“")
text = text.replace("\x94", "”")
text = text.replace("'", "'")
text = text.replace(\x80\x99", "'")
text = text.replace(\x80“", "—")
text = text.replace(\x80”", "–")
text = text.replace(\x80˜", "‘")
text = text.replace(\x80¦", "…")
text = text.replace(\x80™", "’")
text = text.replace(\x80œ", "“")
text = text.replace(\x80?", "”")
text = text.replace(\x80ť", "”")
text = text.replace(\x80ś", "“")
text = text.replace(\x80¨", "—")
text = text.replace(\x80ł", "″")
text = text.replace(\x80Ž", "")
text = text.replace(\x80‚", "")
text = text.replace(\x80‰", "")
text = text.replace(\x80‹", "")
text = text.replace(\x80", "")
text = text.replace(\x80s'", "")
return text


Expand Down
2 changes: 1 addition & 1 deletion unstructured/staging/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def convert_to_isd_csv(elements: List[Text]) -> str:
Returns the representation of document elements as an Initial Structured Document (ISD)
in CSV Format.
"""
csv_fieldnames: List[str] = ["type", "text", "coordinates", "element_id"]
csv_fieldnames: List[str] = ["type", "text"]
rows: List[Dict[str, str]] = convert_to_isd(elements)
with io.StringIO() as buffer:
csv_writer = csv.DictWriter(buffer, fieldnames=csv_fieldnames)
Expand Down

0 comments on commit f890972

Please sign in to comment.