Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD script to create a simplified version of hocr-files #152

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Prev Previous commit
Next Next commit
FIX char encoding, ADD remove-empty-contents which removes empty cont…
…ents or only containing whitespaces,.
JKamlah authored and JKamlah committed Aug 6, 2019
commit be4bb771f2bdf0852bbf26f1fc9ddf482375fe59
23 changes: 13 additions & 10 deletions hocr-simplify
Original file line number Diff line number Diff line change
@@ -15,29 +15,30 @@ parser = argparse.ArgumentParser(
description=('change level of typesetting and/or'
'remove properties to create'
'a simplified hocr-version'))
properties = ['baseline', 'bbox', 'cflow', 'cuts', 'hardbreak', 'image',
properties = {'baseline', 'bbox', 'cflow', 'cuts', 'hardbreak', 'image',
'imagemd5', 'lpageno', 'ppageno', 'nlp', 'order', 'poly',
'scan_res', 'textangle', 'x_booxes', 'x_font', 'x_fsize',
'x_confs', 'x_scanner', 'x_source', 'x_wconf']
'x_confs', 'x_scanner', 'x_source', 'x_wconf'}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have also an option to delete id and/or dir parameter, but they are on their own.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing attributes is now implemented

parser.add_argument('file', nargs='?', default=sys.stdin)
parser.add_argument('-t', '--typesetting', type=str,
choices=['glyph', 'word', 'line', 'par', 'carea', 'page'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the choice glyph doing anything for simplification? I haven't seen an hocr-example where there was an element inside a ocr-glyph.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought i would need them, to remove char choices, but i've implemented it in another place. So i removed the "glyph" typesetting option.

help='Maximum level of typesetting')
parser.add_argument('-a', '--remove-attributes', nargs='+',
help='Removes attributes, e.g. id')
parser.add_argument('-e', '--remove-empty-contents', action='store_true',
help='Removes contents which are empty or contains whitespaces only')
parser.add_argument('-p', '--remove-properties', nargs='+',
help='List of properties: {}'.format(','.join(properties)))
parser.add_argument('-c', '--remove-choices', action='store_true',
help='Removes alternatives (only for tesseract outputs)')
parser.add_argument('fileout', nargs='?',
help="Output path, default: print to terminal")
parser.add_argument('-v', '--verbose',
action='store_true', help='Verbose, default: %(default)s')

args = parser.parse_args()

doc = html.parse(args.file)
with open(args.file,"r",encoding="utf-8") as f:
doc = html.parse(f)

# change level of typesetting
if args.typesetting:
@@ -50,6 +51,7 @@ if args.typesetting:
# update meta content
for node in doc.xpath("//*[@name='ocr-capabilities']"):
content = node.get("content")
if content is None: continue
if args.typesetting in content:
node.set("content", content.split(args.typesetting)[0] + args.typesetting)
if args.verbose:
@@ -59,10 +61,11 @@ if args.typesetting:
for node in doc.xpath("//*[@class='{}']".format(args.typesetting)):
if args.verbose:
print(re.sub(r'\s+', '\x20', node.text_content()).strip())
if args.remove_choices or "glyph" in args.typesetting:
node.text = node.text_content().split(" ")[0].strip()
else:
node.text = node.text_content().strip()
text_content = node.text_content()
if args.remove_empty and text_content.strip() == "":
node.getparent().remove(node)
continue
node.text = "\n".join([text.strip() for text in text_content.splitlines() if text.strip() != ""])
for child in list(node):
node.remove(child)

@@ -93,5 +96,5 @@ else:
os.makedirs(os.path.dirname(args.fileout))

# write new hocr file
with open(args.fileout, "w") as f:
with open(args.fileout, "w", encoding="utf-8") as f:
f.writelines(etree.tostring(doc, pretty_print=True,encoding=str))
Empty file added test/testdata/kraken.hocr
Empty file.