TextPage#

This class represents text and images shown on a document page. All MuPDF document types are supported.

The usual ways to create a textpage are DisplayList.GetTextPage() and Page.GetTextPage(). Because there is a limited set of methods in this class, there exist wrappers in Page which are handier to use. The last column of this table shows these corresponding Page methods.

For a description of what this class is all about, see Appendix 2.

Method

Description

page GetText or Search method

ExtractText()

Extract plain text

“text”

ExtractBlocks()

Plain text grouped in blocks

“blocks”

ExtractWords()

All words with their bbox

“words”

ExtractHtml()

Page content in HTML format

“html”

ExtractXHtml()

Page content in XHTML format

“xhtml”

ExtractXML()

Page text in XML format

“xml”

ExtractDict()

Page content in PageInfo format

“dict”

ExtractJSON()

Page content in JSON format

“json”

ExtractRAWDict()

Page content in PageInfo format

“rawdict”

ExtractRawJSON()

Page content in JSON format

“rawjson”

Search()

Search for a string in the page

Page.SearchFor()

ExtractSelection()

Extract selection in format of string

Class API

class TextPage#
ExtractText(bool sort: false)#

Return a string of the page’s complete text. The text is UTF-8 unicode and in the same sequence as specified at the time of document creation.

Parameters:

sort (bool) – Sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a “natural” reading order.

Return type:

string

ExtractBlocks()#

Textpage content as a list of text lines grouped by block. Each list items looks like this:

(x0, y0, x1, y1, "lines in the block", block_no, block_type)

The first four entries are the block’s bbox coordinates, block_type is 1 for an image block, 0 for text. block_no is the block sequence number. Multiple text lines are joined via line breaks.

For an image block, its bbox and a text line with some image meta information is included – not the image content.

This is a high-speed method with just enough information to output plain text in desired reading sequence.

Return type:

list of TextBlock

ExtractWords(char[] delimiters: null)#

Textpage content as a list of single words with bbox information. An item of this list looks like this:

(x0, y0, x1, y1, "word", block_no, line_no, word_no)
Parameters:

delimiters (str) – Use these characters as additional word separators. By default, all white spaces (including the non-breaking space 0xA0) indicate start and end of a word. Now you can specify more characters causing this. For instance, the default will return "john.doe@outlook.com" as one word. If you specify delimiters="@." then the four words "john", "doe", "outlook", "com" will be returned. Other possible uses include ignoring punctuation characters delimiters=string.punctuation. The “word” strings will not contain any delimiting character.

This is a high-speed method which e.g. allows extracting text from within given areas or recovering the text reading sequence.

Return type:

list of WordBlock

ExtractHtml()#

Textpage content as a string in HTML format. This version contains complete formatting and positioning information. Images are included (encoded as base64 strings). You need an HTML package to interpret the output. Your internet browser should be able to adequately display this information, but see HTMLQuality.

Return type:

str

ExtractDict(bool sort: false)#

Textpage content as a dictionary. Provides same information detail as HTML. See below for the structure.

Parameters:

sort (bool) – Sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a “natural” reading order.

Return type:

dict

ExtractJSON(bool sort: false)#

Textpage content as a JSON string. Created by JsonConvert.SerializeObject(TextPage.ExtractDict()). It is included for backlevel compatibility. You will probably use this method ever only for outputting the result to some file. The method detects binary image data and converts them to base64 encoded strings.

Parameters:

sort (bool) – Sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a “natural” reading order.

Return type:

string

ExtractXHtml()#

Textpage content as a string in XHTML format. Text information detail is comparable with ExtractTEXT(), but also contains images (base64 encoded). This method makes no attempt to re-create the original visual appearance.

Return type:

string

ExtractXML()#

Textpage content as a string in XML format. This contains complete formatting information about every single character on the page: font, size, line, paragraph, location, color, etc. Contains no images. You need an XML package to interpret the output.

Return type:

string

ExtractRAWDict(bool sort: false)#

Textpage content as a dictionary – technically similar to ExtractDict(), and it contains that information as a subset (including any images). It provides additional detail down to each character, which makes using XML obsolete in many cases. See below for the structure.

Parameters:

sort (bool) – Sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a “natural” reading order.

Return type:

PageInfo

ExtractRawJSON(bool sort: false)#

Textpage content as a JSON string. Created by JsonConvert.SerializeObject(TextPage.ExtractRAWDict()). You will probably use this method ever only for outputting the result to some file. The method detects binary image data and converts them to base64 encoded strings.

Parameters:

sort (bool) – Sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a “natural” reading order.

Return type:

string

Search(string needle, bool quads: false)#

Search for string and return a list of found locations.

Parameters:
  • needle (string) – the string to search for. Upper and lower cases will all match if needle consists of ASCII letters only – it does not yet work for “Ä” versus “ä”, etc.

  • quads (bool) – return quadrilaterals instead of rectangles.

Return type:

list of Quad

Returns:

a list of Rect or Quad objects, each surrounding a found needle occurrence. As the search string may contain spaces, its parts may be found on different lines. In this case, more than one rectangle (resp. quadrilateral) are returned. The method now supports dehyphenation, so it will find e.g. “method”, even if it was hyphenated in two parts “meth-” and “od” across two lines. The two returned rectangles will contain “meth” (no hyphen) and “od”.

Note

Overview of changes in v1.18.2:

  1. The hitMax parameter has been removed: all hits are always returned.

  2. The Rect parameter of the TextPage is now respected: only text inside this area is examined. Only characters with fully contained bboxes are considered. The wrapper method Page.search_for() correspondingly supports a clip parameter.

  3. Hyphenated words are now found.

  4. Overlapping rectangles in the same line are now automatically joined. We assume that such separations are an artifact created by multiple marked content groups, containing parts of the same search needle.

Example Quad versus Rect: when searching for needle “pymupdf”, then the corresponding entry will either be the blue rectangle, or, if quads was specified, the quad Quad(ul, ur, ll, lr).

../_images/img-quads.jpg
ExtractSelection(Point a, Point b)#

Extract selection from the bounds contained by two points.

Parameters:
  • a (Point) – Start point.

  • b (Point) – End point.

Return type:

string

Rect#

The rectangle associated with the text page. This either equals the rectangle of the creating page or the clip parameter of Page.GetTextPage() and text extraction / searching methods.

Note

The output of text searching and most text extractions is restricted to this rectangle. (X)HTML and XML output will however always extract the full page.

Structure of Outputs#

Methods TextPage.ExtractDict(), TextPage.ExtractJSON(), TextPage.ExtractRAWDict(), and TextPage.ExtractRawJSON() return dictionaries, containing the page’s text and image content. The dictionary structures of all four methods are almost equal. They strive to map the text page’s information hierarchy of blocks, lines, spans and characters as precisely as possible, by representing each of these by its own sub-dictionary:

  • A page consists of a list of blocks.

  • A (text) block consists of a list of line dictionaries.

  • A line consists of a list of span dictionaries.

  • A span either consists of the text itself or, for the RAW variants, a list of character dictionaries.

  • RAW variants: a character is a dictionary of its origin, bbox and unicode.

Please note, that only bboxes (= Rect 4-tuples) are returned, whereas a TextPage actually has the full position information – in Quad format. The reason for this decision is a memory consideration: a Quad needs 488 bytes (3 times the size of a Rect). Given the mentioned amounts of generated bboxes, returning Quad information would have a significant impact.

In the vast majority of cases, we are dealing with horizontal text only, where bboxes provide entirely sufficient information.

As mentioned, using these functions is ever only needed, if the text is not written horizontallyline["dir"] != (1, 0) – and you need the quad for text marker annotations (Page.AddHighlightAnnot() and friends).

../_images/img-textpage.png

PageInfo Structure#

Key

Value

Width

width of the clip rectangle (float)

Height

height of the clip rectangle (float)

Blocks

list of Block structure

Block Structure#

Block dictionaries come in two different formats for image blocks and for text blocks.

Image block:

Key

Value

Type

1 = image (int)

Bbox

image bbox on page (Rect)

Number

block count (int)

Ext

image type (string), as file extension

Width

original image width (int)

Height

original image height (int)

ColorSpace

colorspace component count (int)

Xres

resolution in x-direction (int)

Yres

resolution in y-direction (int)

Bpc

bits per component (int)

Transform

matrix transforming image rect to bbox (Matrix)

Size

size of the image in bytes (int)

Image

image content byte[]

Possible values of the “ext” key are “bmp”, “gif”, “jpeg”, “jpx” (JPEG 2000), “jxr” (JPEG XR), “png”, “pnm”, and “tiff”.

Note

  1. An image block is generated for all and every image occurrence on the page. Hence there may be duplicates, if an image is shown at different locations.

  2. TextPage and corresponding method Page.GetText() are available for all document types. Only for PDF documents, methods Document.GetPageImages() / Page.GetImages() offer some overlapping functionality as far as image lists are concerned. But both lists may or may not contain the same items. Any differences are most probably caused by one of the following:

  3. The image’s “transformation matrix” is defined as the matrix, for which the expression bbox / transform == Rect(0, 0, 1, 1) is true, lookup details here: ImageTransformation.

Text Block:

Key

Value

Type

0 = text (int)

Bbox

block rectangle, Rect

Number

block count (int)

Lines

list of text line structure

Line Structure#

Key

Value

Bbox

line rectangle, Rect

WMode

writing mode (int): 0 = horizontal, 1 = vertical

Dir

writing direction, Point

Spans

list of span dictionaries

The value of key “dir” is the unit vector dir = (cosine, -sine) of the angle, which the text has relative to the x-axis [2]. See the following picture: The word in each quadrant (counter-clockwise from top-right to bottom-right) is rotated by 30, 120, 210 and 300 degrees respectively.

../_images/img-line-dir.png

Span Structure#

Spans contain the actual text. A line contains more than one span only, if it contains text with different font properties.

Key

Value

Bbox

span rectangle, Rect

Origin

the first character’s origin, Point

Font

font name (string)

Asc

ascender of the font (float)

Desc

descender of the font (float)

Size

font size (float)

Flags

font characteristics (int)

Color

text color in sRGB format (int)

Text

(only for ExtractDict()) text (string)

Chars

(only for ExtractRAWDict()) list of character dictionaries

../_images/img-asc-desc.png

These numbers may be used to compute the minimum height of a character (or span) – as opposed to the standard height provided in the “bbox” values (which actually represents the line height). The following code recalculates the span bbox to have a height of fontSize exactly fitting the text inside:

float a = span.Asc float d = span.Desc Rect r = new Rect(span.Bbox) Point o = new Point(span.Origin) # its y-value is the baseline r.y1 = o.y - span.Size * d / (a - d) r.y0 = r.y1 - span.Size

Caution

The above calculation may deliver a larger height! This may e.g. happen for OCRed documents, where the risk of text artifacts is high. MuPDF tries to come up with a reasonable bbox height, independently from the fontSize found in the PDF. So please ensure that the height of span["bbox"] is larger than span["size"].

The following shows the original span rectangle in red and the rectangle with re-computed height in blue.

../_images/img-span-rect.png

“flags” is an integer, which represents font properties except for the first bit 0. They are to be interpreted like this:

  • bit 0: superscripted (20) – not a font property, detected by MuPDF code.

  • bit 1: italic (21)

  • bit 2: serifed (22)

  • bit 3: monospaced (23)

  • bit 4: bold (24)

Bits 1 thru 4 are font properties, i.e. encoded in the font program. Please note, that this information is not necessarily correct or complete: fonts quite often contain wrong data here.

Character Structure for ExtractRAWDict()#

Key

Value

Origin

character’s left baseline point, Point

Bbox

character rectangle, Rect

C

the character (unicode)

This image shows the relationship between a character’s bbox and its quad: textpagechar

Footnotes