Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document htmlbox #2895

Merged
merged 1 commit into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed docs/images/img-encoding.jpg
Binary file not shown.
Binary file added docs/images/img-htmlbox1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/img-htmlbox2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/img-htmlbox3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/images/img-rotate.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/images/img-textbox.jpg
Binary file not shown.
63 changes: 62 additions & 1 deletion docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ In a nutshell, this is what you can do with PyMuPDF:
:meth:`Page.insert_image` PDF only: insert an image
:meth:`Page.insert_link` PDF only: insert a link
:meth:`Page.insert_text` PDF only: insert text
:meth:`Page.insert_htmlbox` PDF only: insert html text in a rectangle
:meth:`Page.insert_textbox` PDF only: insert a text box
:meth:`Page.links` return a generator of the links on the page
:meth:`Page.load_annot` PDF only: load a specific annotation
Expand Down Expand Up @@ -407,7 +408,7 @@ In a nutshell, this is what you can do with PyMuPDF:
* This can be used to create watermark images: on a temporary PDF page create a stamp annotation with a low opacity value, make a pixmap from it with *alpha=True* (and potentially also rotate it), discard the temporary PDF page and use the pixmap with :meth:`insert_image` for your target PDF.


.. image :: images/img-stampannot.*
.. image:: images/img-stampannot.*
:scale: 80

.. method:: add_widget(widget)
Expand Down Expand Up @@ -625,6 +626,66 @@ In a nutshell, this is what you can do with PyMuPDF:

PDF only: Insert text into the specified :data:`rect_like` *rect*. See :meth:`Shape.insert_textbox`.

.. index::
pair: rect; insert_htmlbox
pair: text; insert_htmlbox
pair: css; insert_htmlbox
pair: adjust; insert_htmlbox
pair: archive; insert_htmlbox
pair: overlay; insert_htmlbox
pair: rotate; insert_htmlbox
pair: oc; insert_htmlbox
pair: morph; insert_htmlbox

.. method:: insert_htmlbox(rect, text, *, css=None, scale_low=0, archive=None, rotate=0, oc=0, overlay=True)

* New in v1.23.8

PDF only. Insert text into the specified rectangle. The method has similarities with methods :meth:`Page.insert_textbox` and :meth:`TextWriter.fill_textbox`, but is **much more powerful**. This is achieved by letting a :ref:`Story` object do all the required processing.

* Parameter `text` may be a string as in the other methods. But it will be **interpreted as HTML source** and may therefore also contain HTML language elements -- including styling. The `css` parameter may be used to pass in additional styling instructions.

* Automatic line breaks are inserted at word boundaries. The "soft hyphen" character `"&#173;"` can be used to cause hyphenation and thus also cause line breaks. **Forced** line breaks however are only achievable via the HTML tag `<br>` - `"\\n"` is ignored and will be treated like a space.

* With this method the following can be achieved:

- Styling effects like bold, italic, text color, text alignment, font size or font switching.
- The text may inlude arbitrary languages -- **including right-to-left** languages.
- Scripts like `Devanagari <https://en.wikipedia.org/wiki/Devanagari>`_ and several others in Asia have a highly complex system of ligatures, where two or more unicodes together yield one glyph. The Story uses the software package `HarfBuzz <https://harfbuzz.github.io/>`_ , to deal with these things and produce correct output.
- One can also **include images** via HTML tag `<img>` -- the Story will take care of the appropriate layout. This is an alternative option to insert images, compared to :meth:`Page.insert_image`.
- HTML tables (tag `<table>`) may be included in the text and will be handled appropriately.
- Links are automatically generated when present.

* If content does not fit in the rectangle, the developer has two choices:

- **either** be just informed (and accept a no-op),
- **or** (`scale=True` - the default) scale down the content until it fits.

:arg rect_like rect: rectangle on page to receive the text.
:arg str,Story text: the text to be written. Can contain plain text and HTML tags with styling instructions. Alternatively, a :ref:`Story` object may be specified (in which case the internal Story generation step will be omitted). A Story must have been generated with all required styling and Archive information.
:arg str css: optional string containing additional CSS instructions. Ignored if `text` is a Story.
:arg float scale_low: if necessary scale down the content until it fits in the target rectangle. This sets the down scaling limit. Default is 0, no limit. A value of 1 means no down-scaling. A value of e.g. 0.2 means maximum down-scaling by 80%.
:arg Archive archive: an Archive object that points to locations where to find images or non-standard fonts. If `text` refers to images, this parameter is always reqired. Ignored if `text` is a Story.
:arg int rotate: one of the values 0, 90, 180, 270. Depending on this, text will be filled:

- 0: top-left to bottom-right.
- 90: bottom-left to top-right.
- 180: bottom-right to top-left.
- 270: top-right to bottom-left.

.. image:: images/img-rotate.*

:arg int oc: the xref of an :data:`OCG` / :data:`OCMD` or 0. Please refer to :meth:`Page.show_pdf_page` for details.
:arg bool overlay: put the text in front of other content. Please refer to :meth:`Page.show_pdf_page` for details.

:returns: A tuple of floats (spare_height, scale).

- `spare_height`: -1 if content did not fit, else >= 0. It is the height of the unused (still available) rectangle stripe. Positive only if scale = 1 (no down-scaling happened).
- `scale`: down-scaling factor, 0 < scale <= 1.

Please refer to examples in this section of the recipes: :ref:`RecipesText_I_c`.


.. index::
pair: closePath; draw_line
pair: color; draw_line
Expand Down
215 changes: 156 additions & 59 deletions docs/recipes-text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -380,89 +380,186 @@ How to Fill a Text Box
This script fills 4 different rectangles with text, each time choosing a different rotation value::

import fitz
doc = fitz.open(...) # new or existing PDF

doc = fitz.open() # new or existing PDF
page = doc.new_page() # new page, or choose doc[n]
r1 = fitz.Rect(50,100,100,150) # a 50x50 rectangle
disp = fitz.Rect(55, 0, 55, 0) # add this to get more rects
r2 = r1 + disp # 2nd rect
r3 = r1 + disp * 2 # 3rd rect
r4 = r1 + disp * 3 # 4th rect
t1 = "text with rotate = 0." # the texts we will put in

# write in this overall area
rect = fitz.Rect(100, 100, 300, 150)

# partition the area in 4 equal sub-rectangles
CELLS = fitz.make_table(rect, cols=4, rows=1)

t1 = "text with rotate = 0." # these texts we will written
t2 = "text with rotate = 90."
t3 = "text with rotate = -90."
t4 = "text with rotate = 180."
red = (1,0,0) # some colors
gold = (1,1,0)
blue = (0,0,1)
"""We use a Shape object (something like a canvas) to output the text and
t3 = "text with rotate = 180."
t4 = "text with rotate = 270."
text = [t1, t2, t3, t4]
red = fitz.pdfcolor["red"] # some colors
gold = fitz.pdfcolor["gold"]
blue = fitz.pdfcolor["blue"]
"""
We use a Shape object (something like a canvas) to output the text and
the rectangles surrounding it for demonstration.
"""
shape = page.new_shape() # create Shape
shape.draw_rect(r1) # draw rectangles
shape.draw_rect(r2) # giving them
shape.draw_rect(r3) # a yellow background
shape.draw_rect(r4) # and a red border
shape.finish(width = 0.3, color = red, fill = gold)
# Now insert text in the rectangles. Font "Helvetica" will be used
# by default. A return code rc < 0 indicates insufficient space (not checked here).
rc = shape.insert_textbox(r1, t1, color = blue)
rc = shape.insert_textbox(r2, t2, color = blue, rotate = 90)
rc = shape.insert_textbox(r3, t3, color = blue, rotate = -90)
rc = shape.insert_textbox(r4, t4, color = blue, rotate = 180)
shape.commit() # write all stuff to page /Contents
doc.save("...")

Several default values were used above: font "Helvetica", font size 11 and text alignment "left". The result will look like this:

.. image:: images/img-textbox.*
for i in range(len(CELLS[0])):
shape.draw_rect(CELLS[0][i]) # draw rectangle
shape.insert_textbox(
CELLS[0][i], text[i], fontname="hebo", color=blue, rotate=90 * i
)

shape.finish(width=0.3, color=red, fill=gold)

shape.commit() # write all stuff to the page
doc.ez_save(__file__.replace(".py", ".pdf"))

Some default values were used above: font size 11 and text alignment "left". The result will look like this:

.. image:: images/img-rotate.*
:scale: 50

------------------------------------------

.. _RecipesText_I_c:

How to Use Non-Standard Encoding
How to Fill a Box with HTML Text
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since v1.14, MuPDF allows Greek and Russian encoding variants for the :data:`Base14_Fonts`. In PyMuPDF this is supported via an additional *encoding* argument. Effectively, this is relevant for Helvetica, Times-Roman and Courier (and their bold / italic forms) and **characters outside the ASCII code range only**. ASCII characters remain Latin!
Method :meth:`Page.insert_htmlbox` offers a **much more powerful** way to insert text in a rectangle. Instead of simple, plain text, this method accepts HTML source, which may not only contain HTML tags but also styling instructions to influence things like font, font weight (bold) and style (italic), color and much more. It is also possible to mix multiple fonts and languages, output HTML tables and insert images. Any URI links are also supported.

.. note:: Please keep in mind that the Base-14 fonts only support characters with `ord(c) < 256`. The `encoding` parameter does not change that. So only characters with `ord(c) > 128` are under the influence of `encoding`.

To avoid these restrictions, we strongly recommend to use the file-based font variants, which are available via the :ref:`Font` class. These fonts do not require (and ignore) the encoding parameter. Your text can also be any mixture of standard Latin, Cyrillic, Greek and other characters. `fitz.Font("helv")` for example support 654 glyphs - not just 256. The only consideration is that your PDF file size will grow because now a font file will be embedded.

Choosing any font from `pymupf-fonts <https://pypi.org/project/pymupdf-fonts/>`_ will provide you with the best of all worlds: nice and rich fonts that are also subsettable via :meth:`Document.subset_fonts()`. This limits your file sizes significantly. `fitz.Font("figo")` for example supports 4577 glyphs. But still, after using :meth:`Document.subset_fonts()`, the file size increase will probably be something like 10 or 12 KB -- and not 43 KB as with `fitz.Font("helv")`.
For even more styling flexibility, an additional CSS source may also be given.

Here is how to request Russian encoding with the standard font Helvetica::
The method is based on the :ref:`Story` class. Therefore, complex script systems like Devanagari, Nepali, Tamil and many are supported and written correctly thanks to using the HarfBuzz library - which provides this feature, called *"text shaping"*.

page.insert_text(point, russian_text, encoding=fitz.TEXT_ENCODING_CYRILLIC)
Any required fonts to output characters are automatically pulled in from the Google NOTO font library - as a fallback when the optionally supplied user font(s) do not contain some glyphs.

The valid encoding values are TEXT_ENCODING_LATIN (0), TEXT_ENCODING_GREEK (1), and TEXT_ENCODING_CYRILLIC (2, Russian) with Latin being the default. Encoding can be specified by all relevant font and text insertion methods.
As a small glimpse into the features offered here, we will output the following HTML-enriched text::

By the above statement, the fontname *helv* is automatically connected to the Russian font variant of Helvetica. Any subsequent text insertion with **this fontname** will use the Russian Helvetica encoding.
import fitz

If you change the fontname just slightly, you can also achieve an **encoding "mixture"** for the **same base font** on the same page::

import fitz
doc=fitz.open()
rect = fitz.Rect(100, 100, 400, 300)

text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed
eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b>
nisi ut aliquid ex ea commodi consequat. Quis aute iure
<span style="color: #f00;">reprehenderit</span>
in <span style="color: #0f0;font-weight:bold;">voluptate</span> velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat
cupiditat non proident, sunt in culpa qui
<a href="https://www.artifex.com">officia</a> deserunt mollit anim id
est laborum."""

doc = fitz.Document()

page = doc.new_page()
shape = page.new_shape()
t="Sômé tèxt wìth nöñ-Lâtîn characterß."
shape.insert_text((50,70), t, fontname="helv", encoding=fitz.TEXT_ENCODING_LATIN)
shape.insert_text((50,90), t, fontname="HElv", encoding=fitz.TEXT_ENCODING_GREEK)
shape.insert_text((50,110), t, fontname="HELV", encoding=fitz.TEXT_ENCODING_CYRILLIC)
shape.commit()
doc.save("t.pdf")
page.insert_htmlbox(rect, text, css="* {font-family: sans-serif;font-size:14px;}")

doc.ez_save(__file__.replace(".py", ".pdf"))

Please note how the "css" parameter is used to globally select the default "sans-serif" font and a font size of 14.

The result will look like this:

.. image:: images/img-htmlbox1.*

Here is another example that outputs a table with this method. This time, we are including all the styling in the HTML source itself. Please also note, how it works to include an image - even within a table cell::

import fitz_new as fitz
import os

filedir = os.path.dirname(__file__)


text = """
<style>
body {
font-family: sans-serif;
}

td,
th {
border: 1px solid blue;
border-right: none;
border-bottom: none;
padding: 5px;
text-align: center;
}

table {
border-right: 1px solid blue;
border-bottom: 1px solid blue;
border-spacing: 0;
}
</style>

<body>
<p><b>Some Colors</b></p>
<table>
<tr>
<th>Lime</th>
<th>Lemon</th>
<th>Image</th>
<th>Mauve</th>
</tr>
<tr>
<td>Green</td>
<td>Yellow</td>
<td><img src="img-cake.png" width=50></td>
<td>Between<br>Gray and Purple</td>
</tr>
</table>
</body>
"""

The result:
doc = fitz.Document()

.. image:: images/img-encoding.*
:scale: 50
page = doc.new_page()
rect = page.rect + (36, 36, -36, -36)

# we must specify an Archive because of the image
page.insert_htmlbox(rect, text, archive=fitz.Archive("."))

doc.ez_save(__file__.replace(".py", ".pdf"))



The result will look like this:

.. image:: images/img-htmlbox2.*


Our third example will demonstrate the automatic multi-language support that also includes text shaping for complex scripting systems like Devanagari and right-to-left languages::

import fitz

greetings = (
"Hello, World!", # english
"Hallo, Welt!", # german
"سلام دنیا!", # persian
"வணக்கம், உலகம்!", # tamil
"สวัสดีชาวโลก!", # thai
"Привіт Світ!", # ucranian
"שלום עולם!", # hebrew
"ওহে বিশ্ব!", # bengali
"你好世界!", # chinese
"こんにちは世界!", # japanese
"안녕하세요, 월드!", # korean
"नमस्कार, विश्व !", # sanskrit
"हैलो वर्ल्ड!", # hindi
)
doc = fitz.open()
page = doc.new_page()
rect = (50, 50, 200, 500)
text = " ... ".join([t for t in greetings])

The snippet above indeed leads to three different copies of the Helvetica font in the PDF. Each copy is uniquely identified (and referenceable) by using the correct upper-lower case spelling of the reserved word "helv"::
# the output of the above is simple:
page.insert_htmlbox(rect, text)
doc.save(__file__.replace(".py", ".pdf"))

for f in doc.get_page_fonts(0): print(f)
And this is the output:

[6, 'n/a', 'Type1', 'Helvetica', 'helv', 'WinAnsiEncoding']
[7, 'n/a', 'Type1', 'Helvetica', 'HElv', 'WinAnsiEncoding']
[8, 'n/a', 'Type1', 'Helvetica', 'HELV', 'WinAnsiEncoding']
.. image:: images/img-htmlbox3.*

.. include:: footer.rst
Loading