pdf2docx.common.Element module#

Object with a bounding box, e.g. Block, Line, Span.

Based on PyMuPDF, the coordinates (e.g. bbox of page.get_text('rawdict')) are generally provided relative to the un-rotated page; while this pdf2docx library works under real page coordinate system, i.e. with rotation considered. So, any instances created by this Class are always applied a rotation matrix automatically.

Therefore, the bbox parameter used to create Element instance MUST be relative to un-rotated CS. If final coordinates are provided, should update it after creating an empty object:

Element().update_bbox(final_bbox)

Note

An exception is page.get_drawings(), the coordinates are converted to real page CS already.

class pdf2docx.common.Element.Element(raw: Optional[dict] = None, parent=None)#

Bases: IText

Boundary box with attribute in fitz.Rect type.

ROTATION_MATRIX = Matrix(1.0, 0.0, -0.0, 1.0, 0.0, 0.0)#
contains(e: Element, threshold: float = 1.0)#

Whether given element is contained in this instance, with margin considered.

Args:

e (Element): Target element threshold (float, optional): Intersection rate.

Defaults to 1.0. The larger, the stricter.

Returns:

bool: [description]

copy()#

make a deep copy.

get_expand_bbox(dt: float)#

Get expanded bbox with margin in both x- and y- direction.

Args:

dt (float): Expanding margin.

Returns:

fitz.Rect: Expanded bbox.

Note

This method creates a new bbox, rather than changing the bbox of itself.

get_main_bbox(e, threshold: float = 0.95)#

If the intersection with e exceeds the threshold, return the union of these two elements; else return None.

Args:

e (Element): Target element. threshold (float, optional): Intersection rate. Defaults to 0.95.

Returns:

fitz.Rect: Union bbox or None.

horizontally_align_with(e, factor: float = 0.0, text_direction: bool = True)#

Check whether two Element instances have enough intersection in horizontal direction, i.e. along the reading direction.

Args:

e (Element): Element to check with factor (float, optional): threshold of overlap ratio, the larger it is, the higher

probability the two bbox-es are aligned.

text_direction (bool, optional): consider text direction or not. True by default.

Examples:

+--------------+
|              | L1  +--------------------+
+--------------+     |                    | L2
                     +--------------------+

An enough intersection is defined based on the minimum width of two boxes:

L1+L2-L>factor*min(L1,L2)
in_same_row(e)#

Check whether in same row/line with specified Element instance. With text direction considered.

Taking horizontal text as an example:

  • yes: the bottom edge of each box is lower than the centerline of the other one;

  • otherwise, not in same row.

Args:

e (Element): Target object.

Note

The difference to method horizontally_align_with: they may not in same line, though aligned horizontally.

property parent#
plot(page, stroke: tuple = (0, 0, 0), width: float = 0.5, fill: Optional[tuple] = None, dashes: Optional[str] = None)#

Plot bbox in PDF page for debug purpose.

classmethod pure_rotation_matrix()#

Pure rotation matrix used for calculating text direction after rotation.

classmethod set_rotation_matrix(rotation_matrix)#

Set global rotation matrix.

Args:

Rotation_matrix (fitz.Matrix): target matrix

store()#

Store properties in raw dict.

union_bbox(e)#

Update current bbox to the union with specified Element.

Args:

e (Element): The target to get union

Returns:

Element: self

update_bbox(rect)#

Update current bbox to specified rect.

Args:
rect (fitz.Rect or list): bbox-like (x0, y0, x1, y1),

in real page CS (with rotation considered).

vertically_align_with(e, factor: float = 0.0, text_direction: bool = True)#

Check whether two Element instances have enough intersection in vertical direction, i.e. perpendicular to reading direction.

Args:

e (Element): Object to check with factor (float, optional): Threshold of overlap ratio, the larger it is, the higher

probability the two bbox-es are aligned.

text_direction (bool, optional): Consider text direction or not. True by default.

Returns:

bool: [description]

Examples:

+--------------+
|              |
+--------------+
        L1
        +-------------------+
        |                   |
        +-------------------+
                L2

An enough intersection is defined based on the minimum width of two boxes:

L1+L2-L>factor*min(L1,L2)