pdf2docx.table.TablesConstructor module#

Parsing table blocks.

  • lattice table: explicit borders represented by strokes.

  • stream table : borderless table recognized from layout of text blocks.

Terms definition:

  • From appearance aspect, we say stroke and fill, the former looks like a line, while the later an area.

  • From semantic aspect, we say border (cell border) and shading (cell shading).

  • An explicit border is determined by a certain stroke, while a stroke may also represent an underline of text.

  • An explicit shading is determined by a fill, while a fill may also represent a highlight of text.

  • Border object is introduced to determin borders of stream table. Border instance is a virtual border adaptive in a certain range, then converted to a stroke once finalized, and finally applied to detect table border.

class pdf2docx.table.TablesConstructor.TablesConstructor(parent)#

Bases: object

Object parsing TableBlock for specified Layout.

lattice_tables(connected_border_tolerance: float, min_border_clearance: float, max_border_width: float)#

Parse table with explicit borders/shadings represented by rectangle shapes.

Args:

connected_border_tolerance (float): Two borders are intersected if the gap lower than this value. min_border_clearance (float): The minimum allowable clearance of two borders. max_border_width (float): Max border width.

stream_tables(min_border_clearance: float, max_border_width: float, line_separate_threshold: float)#

Parse table with layout of text/image blocks, and update borders with explicit borders represented by rectangle shapes.

Refer to lattice_tables for arguments description.