pdf2docx.shape.Paths module#

Objects representing PDF path (stroke and filling) extracted by page.get_drawings().

This method is new since PyMuPDF 1.18.0, with both pdf raw path and annotations like Line, Square and Highlight considered.

class pdf2docx.shape.Paths.Paths(instances: Optional[list] = None, parent=None)#

Bases: Collection

A collection of paths.

bbox#: Calculate only once and cache property value.

property is_iso_oriented#: It is iso-oriented when all contained segments are iso-oriented.

plot(page)#

Plot paths for debug purpose.

Args:: page (fitz.Page): PyMuPDF page.

restore(raws: list)#: Initialize paths from raw data get by page.get_drawings().

to_shapes()#

Convert contained paths to ISO strokes or rectangular fills.

Returns:: list: A list of Shape raw dicts.

to_shapes_and_images(min_svg_gap_dx: float = 15, min_svg_gap_dy: float = 15, min_w: float = 2, min_h: float = 2, clip_image_res_ratio: float = 3.0)#

Convert paths to iso-oriented shapes or images. The semantic type of path is either table/text style or vector graphic. This method is to: * detect svg regions -> exist at least one non-iso-oriented path * convert svg to bitmap by clipping page * convert the rest paths to iso-oriented shapes for further table/text style parsing

Args:: min_svg_gap_dx (float): Merge svg if the horizontal gap is less than this value. min_svg_gap_dy (float): Merge svg if the vertical gap is less than this value. min_w (float): Ignore contours if the bbox width is less than this value. min_h (float): Ignore contours if the bbox height is less than this value. clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap. Defaults to 3.0.
Returns:: tuple: (list of shape raw dict, list of image raw dict).