pdf2docx.table.TableStructure module#

Parsing table structure based on strokes and fills.

class pdf2docx.table.TableStructure.CellStructure(bbox: list)#

Bases: object

Cell structure with properties bbox, borders, shading, etc.

property is_merged#
property is_merging#
parse_borders(h_strokes: dict, v_strokes: dict)#

Parse cell borders from strokes.

Args:
h_strokes (dict): A dict of y-coordinate v.s. horizontal strokes, e.g.

{y0: [h1,h2,..], y1: [h3,h4,...]}

v_strokes (dict): A dict of x-coordinates v.s. vertical strokes, e.g.

{x0: [v1,v2,..], x1: [v3,v4,...]}

parse_shading(fills: Shapes)#

Parse cell shading from fills.

Args:

fills (Shapes): Fill shapes representing cell shading.

class pdf2docx.table.TableStructure.TableStructure(strokes: Shapes, **settings)#

Bases: object

Parsing table structure based on strokes/fills.

Steps to parse table structure:

    x0        x1       x2        x3
y0  +----h1---+---h2---+----h3---+
    |         |        |         |
    v1        v2       v3        v4
    |         |        |         |
y1  +----h4------------+----h5---+
    |                  |         |
    v5                 v6        v7
    |                  |         |
y2  +--------h6--------+----h7---+
  1. Group horizontal and vertical strokes:

    self.h_strokes = {
        y0 : [h1, h2, h3],
        y1 : [h4, h5],
        y2 : [h6, h7]
    }
    

These [x0, x1, x2, x3] x [y0, y1, y2] forms table lattices, i.e. 2 rows x 3 cols.

  1. Check merged cells in row/column direction.

Let horizontal line y=(y0+y1)/2 cross through table, it gets intersection with v1, v2 and v3, indicating no merging exists for cells in the first row.

When y=(y1+y2)/2, it has no intersection with vertical strokes at x=x1, i.e. merging status is [1, 0, 1], indicating Cell(2,2) is merged into Cell(2,1).

So, the final merging status in this case:

[
    [(1,1), (1,1), (1,1)],
    [(1,2), (0,0), (1,1)]
]
property bbox#

Table boundary bbox.

Returns:

fitz.Rect: bbox of table.

property num_cols#
property num_rows#
parse(fills: Shapes)#

Parse table structure.

Args:

fills (Shapes): Fill shapes representing cell shading.

to_table_block()#

Convert parsed table structure to TableBlock instance.

Returns:

TableBlock: Parsed table block instance.

property x_cols#

Left x-coordinate x0 of each column.

Returns:

list: x-coordinates of each column.

property y_rows#

Top y-coordinate y0 of each row.

Returns:

list: y-coordinates of each row.