pdf2docx.text.TextBlock module#

Text block objects based on PDF raw dict extracted with PyMuPDF.

Data structure based on this link:

{
    # raw dict
    # --------------------------------
    'type': 0,
    'bbox': (x0,y0,x1,y1),
    'lines': [ lines ]

    # introduced dict
    # --------------------------------
    'before_space': bs,
    'after_space': as,
    'line_space': ls,

    'alignment': 0,
    'left_space': 10.0,
    'right_space': 0.0,

    'tab_stops': [15.4, 35.0]
}
class pdf2docx.text.TextBlock.TextBlock(raw: Optional[dict] = None)#

Bases: Block

Text block.

add(line_or_lines)#

Add line or lines to TextBlock.

property average_row_gap#

Average distance between adjacent two physical rows.

bbox: fitz.Rect#
make_docx(p)#

Create paragraph for a text block.

Refer to python-docx doc for details on text format:

Args:

p (Paragraph): python-docx paragraph instance.

Note

The left position of paragraph is set by paragraph indent, rather than TAB stop.

parse_exact_line_spacing()#

Calculate exact line spacing, e.g. spacing = Pt(12).

The layout of pdf text block: line-space-line-space-line, excepting space before first line, i.e. space-line-space-line, when creating paragraph in docx. So, an average line height is space+line. Then, the height of first line can be adjusted by updating paragraph before-spacing.

Note

Compared with the relative spacing mode, it has a more precise layout, but less flexible editing ability, especially changing the font size.

parse_horizontal_spacing(bbox, line_separate_threshold: float, line_break_width_ratio: float, line_break_free_space_ratio: float, lines_left_aligned_threshold: float, lines_right_aligned_threshold: float, lines_center_aligned_threshold: float)#

Set horizontal spacing based on lines layout and page bbox.

  • The general spacing is determined by paragraph alignment and indentation.

  • The detailed spacing of block lines is determined by tab stops.

Multiple alignment modes may exist in block (due to improper organized lines from PyMuPDF), e.g. some lines align left, and others right. In this case, LEFT alignment is set, and use TAB to position each line.

parse_relative_line_spacing()#

Calculate relative line spacing, e.g. spacing = 1.02. Relative line spacing is based on standard single line height, which is font-related.

Note

The line spacing could be updated automatically when changing the font size, while the layout might be broken in exact spacing mode, e.g. overlapping of lines.

parse_text_format(shapes)#

Parse text format with style represented by rectangles.

Args:

shapes (Shapes): Shapes representing potential styles applied on blocks.

plot(page)#

Plot block/line/span area for debug purpose.

Args:

page (fitz.Page): pdf page.

property raw_text#

Raw text content in block without considering images.

property row_count#

Count of physical rows.

store()#

Store attributes in json format.

property text#

Text content in block. Note image is counted as a placeholder <image>.

property text_direction#

All lines contained in text block must have same text direction. Otherwise, set normal direction.

property white_space_only#

If this block contains only white space or not. If True, this block is safe to be removed.