pdf2docx.layout.Layout module#
Document layout depends on Blocks and Shapes.
Layout here refers to the content and position of text, image and table. The target is to convert
source blocks and shapes to a flow layout that can be re-created as docx elements like paragraph and
table. In addition to Section
and Column
, TableBlock
is used to maintain the page layout .
So, detecting and parsing table block is the principle steps.
The prerequisite work is done before this step:
Clean up source blocks and shapes in Page level, e.g. convert source blocks to
Line
level, because the block structure determined byPyMuPDF
might be not reasonable.Parse structure in document level, e.g. page header/footer.
Parse Section and Column layout in Page level.
The page layout parsing idea:
- Parse table layout in Column level.
Detect explicit tables first based on shapes.
Then, detect stream tables based on original text blocks and parsed explicit tables.
Move table contained blocks (lines or explicit table) to associated cell-layout.
- Parse paragraph in Column level.
Detect text blocks by combining related lines.
Parse paragraph style, e.g. text format, alignment
Calculate vertical spacing based on parsed tables and paragraphs.
Repeat above steps for cell-layout in parsed table level.
- class pdf2docx.layout.Layout.Layout(bbox=None)#
Bases:
Element
,ABC
Blocks and shapes structure and formats.
- assign_blocks(blocks: list)#
Add blocks (line or table block) to this layout.
- Args:
blocks (list): a list of text line or table block to add.
Note
If a text line is partly contained, it must deep into span -> char.
- assign_shapes(shapes: list)#
Add shapes to this cell.
- Args:
shapes (list): a list of Shape instance to add.
- bbox: fitz.Rect#
- parse(**settings)#
Parse layout.
- Args:
settings (dict): Layout parsing parameters.
- restore(data: dict)#
Restore Layout from parsed results.
- store()#
Store parsed layout in dict format.
- abstract property working_bbox#
Working bbox of current Layout.