pdf2docx.shape.Path module#

Objects representing PDF path (stroke and filling) extracted from pdf drawings and annotations.

Data structure based on results of page.get_drawings():

{
    'color': (x,x,x) or None,  # stroke color
    'fill' : (x,x,x) or None,  # fill color
    'width': float,            # line width
    'closePath': bool,         # whether to connect last and first point
    'rect' : rect,             # page area covered by this path
    'items': [                 # list of draw commands: lines, rectangle or curves.
        ("l", p1, p2),         # a line from p1 to p2
        ("c", p1, p2, p3, p4), # cubic Bézier curve from p1 to p4, p2 and p3 are the control points
        ("re", rect),          # a rect represented with two diagonal points
        ("qu", quad)           # a quad represented with four corner points
    ],
    ...
}
References:

Note

The coordinates extracted by page.get_drawings() is based on real page CS, i.e. with rotation considered. This is different from page.get_text('rawdict').

class pdf2docx.shape.Path.C(item)#

Bases: Segment

Bezier curve path with source ("c", p1, p2, p3, p4).

class pdf2docx.shape.Path.L(item)#

Bases: Segment

Line path with source ("l", p1, p2).

property length#
to_strokes(width: float, color: list)#

Convert to stroke dict.

Args:

width (float): Specify width for the stroke. color (list): Specify color for the stroke.

Returns:

list: A list of Stroke dicts.

Note

A line corresponds to one stroke, but considering the consistence, the return stroke dict is append to a list. So, the length of list is always 1.

class pdf2docx.shape.Path.Path(raw: dict)#

Bases: object

Path extracted from PDF, consist of one or more Segments.

property is_fill#
property is_iso_oriented#

It is iso-oriented when all contained segments are iso-oriented.

property is_stroke#
plot(canvas)#

Plot path for debug purpose.

Args:

canvas: PyMuPDF drawing canvas by page.new_shape().

Reference:

to_shapes()#

Convert path to Shape raw dicts.

Returns:

list: A list of Shape dict.

class pdf2docx.shape.Path.Q(item)#

Bases: R

Quad path with source ("qu", quad).

class pdf2docx.shape.Path.R(item)#

Bases: Segment

Rect path with source ("re", rect).

to_strokes(width: float, color: list)#

Convert each edge to stroke dict.

Args:

width (float): Specify width for the stroke. color (list): Specify color for the stroke.

Returns:

list: A list of Stroke dicts.

Note

One Rect path is converted to a list of 4 stroke dicts.

class pdf2docx.shape.Path.Segment(item)#

Bases: object

A segment of path, e.g. a line or a rectangle or a curve.

to_strokes(width: float, color: list)#
class pdf2docx.shape.Path.Segments(items: list, close_path=False)#

Bases: object

A sub-path composed of one or more segments.

property area#

Calculate segments area with Green formulas. Note the boundary of Bezier curve is simplified with its control points.

property bbox#

Calculate segments bbox.

property is_iso_oriented#

ISO-oriented criterion: the ratio of real area to bbox exceeds 0.9.

property points#

Connected points of segments.

to_fill(color: list)#

Convert segment closed area to a Fill dict.

Args:

color (list): Specify fill color.

Returns:

dict: Fill dict.

to_strokes(width: float, color: list)#

Convert each segment to a Stroke dict.

Args:

width (float): Specify stroke width. color (list): Specify stroke color.

Returns:

list: A list of Stroke dicts.