pdf2docx.shape.Shape module#
Objects representing PDF stroke and filling extracted from Path.
Stroke: consider only the horizontal or vertical path segments
Fill : bbox of closed path filling area
Hyperlink in PyMuPDF
is represented as uri and its rectangular area (hot-area), while the
applied text isn’t extracted explicitly. To reuse the process that identifying applied text of
text style shape (e.g. underline and highlight), hyperlink is also abstracted to be a Shape
.
Note
The evident difference of hyperlink shape to text style shape is: the type
of hyperlink
shape is determined in advance, while text style shape needs to be identified by the position
to associated text blocks.
Above all, the semantic meaning of shape instance may be:
strike through line of text
under line of text
highlight area of text
table border
cell shading
hyperlink
Data structure:
{
'type': int,
'bbox': (x0, y0, x1, y1),
'color': srgb_value,
# for Stroke
'start': (x0, y0),
'end': (x1, y1),
'width': float,
# for Hyperlink
'uri': str
}
Note
These coordinates are relative to real page CS since they’re extracted from page.get_drawings()
,
which is based on real page CS. So, needn’t to multiply Element.ROTATION_MATRIX when initializing
from source dict.
- class pdf2docx.shape.Shape.Fill(raw: Optional[dict] = None)#
Bases:
Shape
Rectangular (bbox) filling area of a closed path. The semantic meaning may be table shading, or text style like highlight.
- bbox: Rect#
- property default_type#
Default semantic type for a Fill shape: table shading or text highlight.
- to_stroke(max_border_width: float)#
Convert to Stroke instance based on width criterion.
- Args:
max_border_width (float): Stroke width must less than this value.
- Returns:
Stroke: Stroke instance.
Note
A Fill from shape point of view may be a Stroke from content point of view. The criterion here is whether the width is smaller than defined
max_border_width
.
- class pdf2docx.shape.Shape.Hyperlink(raw: Optional[dict] = None)#
Bases:
Shape
Rectangular area, i.e.
hot area
for a hyperlink.Hyperlink in
PyMuPDF
is represented as uri and its hot area, while the applied text isn’t extracted explicitly. To reuse the process that identifying applied text of text style shape (e.g. underline and highlight), hyperlink is also abstracted to be aShape
.- bbox: Rect#
- property default_type#
Default semantic type for a Hyperlink: always hyperlink.
- parse_semantic_type(blocks: Optional[list] = None)#
Semantic type of Hyperlink shape is determined, i.e.
RectType.HYPERLINK
.
- store()#
Store properties in raw dict.
- class pdf2docx.shape.Shape.Shape(raw: Optional[dict] = None)#
Bases:
Element
Shape object.
- bbox: Rect#
- property default_type#
Default semantic type for a shape.
- property is_determined#
If the shape type is determined to a basic item of RectType.
- parse_semantic_type(blocks: list)#
Determine semantic type based on the position to text blocks. Note the results might be a combination of raw types, e.g. the semantic type of a stroke can be either text strike, underline or table border.
- Args:
blocks (list): A list of
Line
instance, sorted in reading order in advance.
- plot(page, color)#
Plot rectangle shapes with
PyMuPDF
.
- store()#
Store properties in raw dict.
- property type#
- class pdf2docx.shape.Shape.Stroke(raw: Optional[dict] = None)#
Bases:
Shape
Horizontal or vertical stroke of a path. The semantic meaning may be table border, or text style line like underline and strike-through.
- bbox: Rect#
- property default_type#
Default semantic type for a Stroke shape: table border, underline or strike-through.
- property horizontal#
- store()#
Store properties in raw dict.
- update_bbox(rect)#
Update stroke bbox (related to real page CS).
Update start/end points if
rect.area==0
.Ppdate bbox directly if
rect.area!=0
.
- Args:
rect (fitz.Rect, tuple):
(x0, y0, x1, y1)
like data.- Returns:
Stroke: self
- property vertical#
- property x0#
- property x1#
- property y0#
- property y1#