pdf2docx.table.TableStructure module#
Parsing table structure based on strokes and fills.
- class pdf2docx.table.TableStructure.CellStructure(bbox: list)#
Bases:
object
Cell structure with properties bbox, borders, shading, etc.
- property is_merged#
- property is_merging#
- parse_borders(h_strokes: dict, v_strokes: dict)#
Parse cell borders from strokes.
- Args:
- h_strokes (dict): A dict of y-coordinate v.s. horizontal strokes, e.g.
{y0: [h1,h2,..], y1: [h3,h4,...]}
- v_strokes (dict): A dict of x-coordinates v.s. vertical strokes, e.g.
{x0: [v1,v2,..], x1: [v3,v4,...]}
- class pdf2docx.table.TableStructure.TableStructure(strokes: Shapes, **settings)#
Bases:
object
Parsing table structure based on strokes/fills.
Steps to parse table structure:
x0 x1 x2 x3 y0 +----h1---+---h2---+----h3---+ | | | | v1 v2 v3 v4 | | | | y1 +----h4------------+----h5---+ | | | v5 v6 v7 | | | y2 +--------h6--------+----h7---+
Group horizontal and vertical strokes:
self.h_strokes = { y0 : [h1, h2, h3], y1 : [h4, h5], y2 : [h6, h7] }
These
[x0, x1, x2, x3] x [y0, y1, y2]
forms table lattices, i.e. 2 rows x 3 cols.Check merged cells in row/column direction.
Let horizontal line
y=(y0+y1)/2
cross through table, it gets intersection withv1
,v2
andv3
, indicating no merging exists for cells in the first row.When
y=(y1+y2)/2
, it has no intersection with vertical strokes atx=x1
, i.e. merging status is[1, 0, 1]
, indicatingCell(2,2)
is merged intoCell(2,1)
.So, the final merging status in this case:
[ [(1,1), (1,1), (1,1)], [(1,2), (0,0), (1,1)] ]
- property bbox#
Table boundary bbox.
- Returns:
fitz.Rect: bbox of table.
- property num_cols#
- property num_rows#
- parse(fills: Shapes)#
Parse table structure.
- Args:
fills (Shapes): Fill shapes representing cell shading.
- to_table_block()#
Convert parsed table structure to
TableBlock
instance.- Returns:
TableBlock: Parsed table block instance.
- property x_cols#
Left x-coordinate
x0
of each column.- Returns:
list: x-coordinates of each column.
- property y_rows#
Top y-coordinate
y0
of each row.- Returns:
list: y-coordinates of each row.