pdf2docx.shape.Shape module#

Objects representing PDF stroke and filling extracted from Path.

  • Stroke: consider only the horizontal or vertical path segments

  • Fill : bbox of closed path filling area

Hyperlink in PyMuPDF is represented as uri and its rectangular area (hot-area), while the applied text isn’t extracted explicitly. To reuse the process that identifying applied text of text style shape (e.g. underline and highlight), hyperlink is also abstracted to be a Shape.

Note

The evident difference of hyperlink shape to text style shape is: the type of hyperlink shape is determined in advance, while text style shape needs to be identified by the position to associated text blocks.

Above all, the semantic meaning of shape instance may be:

  • strike through line of text

  • under line of text

  • highlight area of text

  • table border

  • cell shading

  • hyperlink

Data structure:

{
    'type': int,
    'bbox': (x0, y0, x1, y1),
    'color': srgb_value,

    # for Stroke
    'start': (x0, y0),
    'end': (x1, y1),
    'width': float,

    # for Hyperlink
    'uri': str
}

Note

These coordinates are relative to real page CS since they’re extracted from page.get_drawings(), which is based on real page CS. So, needn’t to multiply Element.ROTATION_MATRIX when initializing from source dict.

class pdf2docx.shape.Shape.Fill(raw: Optional[dict] = None)#

Bases: Shape

Rectangular (bbox) filling area of a closed path. The semantic meaning may be table shading, or text style like highlight.

bbox: Rect#
property default_type#

Default semantic type for a Fill shape: table shading or text highlight.

to_stroke(max_border_width: float)#

Convert to Stroke instance based on width criterion.

Args:

max_border_width (float): Stroke width must less than this value.

Returns:

Stroke: Stroke instance.

Note

A Fill from shape point of view may be a Stroke from content point of view. The criterion here is whether the width is smaller than defined max_border_width.

Bases: Shape

Rectangular area, i.e. hot area for a hyperlink.

Hyperlink in PyMuPDF is represented as uri and its hot area, while the applied text isn’t extracted explicitly. To reuse the process that identifying applied text of text style shape (e.g. underline and highlight), hyperlink is also abstracted to be a Shape.

bbox: Rect#
property default_type#

Default semantic type for a Hyperlink: always hyperlink.

parse_semantic_type(blocks: Optional[list] = None)#

Semantic type of Hyperlink shape is determined, i.e. RectType.HYPERLINK.

store()#

Store properties in raw dict.

class pdf2docx.shape.Shape.Shape(raw: Optional[dict] = None)#

Bases: Element

Shape object.

bbox: Rect#
property default_type#

Default semantic type for a shape.

equal_to_type(rect_type: RectType)#

If shape type is equal to the specified one or not.

has_potential_type(rect_type: RectType)#

If shape type has a chance to be the specified one or not.

property is_determined#

If the shape type is determined to a basic item of RectType.

parse_semantic_type(blocks: list)#

Determine semantic type based on the position to text blocks. Note the results might be a combination of raw types, e.g. the semantic type of a stroke can be either text strike, underline or table border.

Args:

blocks (list): A list of Line instance, sorted in reading order in advance.

plot(page, color)#

Plot rectangle shapes with PyMuPDF.

store()#

Store properties in raw dict.

property type#
class pdf2docx.shape.Shape.Stroke(raw: Optional[dict] = None)#

Bases: Shape

Horizontal or vertical stroke of a path. The semantic meaning may be table border, or text style line like underline and strike-through.

bbox: Rect#
property default_type#

Default semantic type for a Stroke shape: table border, underline or strike-through.

property horizontal#
store()#

Store properties in raw dict.

update_bbox(rect)#

Update stroke bbox (related to real page CS).

  • Update start/end points if rect.area==0.

  • Ppdate bbox directly if rect.area!=0.

Args:

rect (fitz.Rect, tuple): (x0, y0, x1, y1) like data.

Returns:

Stroke: self

property vertical#
property x0#
property x1#
property y0#
property y1#