pdf2docx.layout.Layout module#

Document layout depends on Blocks and Shapes.

Layout here refers to the content and position of text, image and table. The target is to convert source blocks and shapes to a flow layout that can be re-created as docx elements like paragraph and table. In addition to Section and Column, TableBlock is used to maintain the page layout . So, detecting and parsing table block is the principle steps.

The prerequisite work is done before this step:

  1. Clean up source blocks and shapes in Page level, e.g. convert source blocks to Line level, because the block structure determined by PyMuPDF might be not reasonable.

  2. Parse structure in document level, e.g. page header/footer.

  3. Parse Section and Column layout in Page level.

The page layout parsing idea:

  1. Parse table layout in Column level.
    1. Detect explicit tables first based on shapes.

    2. Then, detect stream tables based on original text blocks and parsed explicit tables.

    3. Move table contained blocks (lines or explicit table) to associated cell-layout.

  2. Parse paragraph in Column level.
    1. Detect text blocks by combining related lines.

    2. Parse paragraph style, e.g. text format, alignment

  3. Calculate vertical spacing based on parsed tables and paragraphs.

  4. Repeat above steps for cell-layout in parsed table level.

class pdf2docx.layout.Layout.Layout(bbox=None)#

Bases: Element, ABC

Blocks and shapes structure and formats.

assign_blocks(blocks: list)#

Add blocks (line or table block) to this layout.

Args:

blocks (list): a list of text line or table block to add.

Note

If a text line is partly contained, it must deep into span -> char.

assign_shapes(shapes: list)#

Add shapes to this cell.

Args:

shapes (list): a list of Shape instance to add.

bbox: fitz.Rect#
parse(**settings)#

Parse layout.

Args:

settings (dict): Layout parsing parameters.

restore(data: dict)#

Restore Layout from parsed results.

store()#

Store parsed layout in dict format.

abstract property working_bbox#

Working bbox of current Layout.