pdf2docx.common.Collection module#

A group of instances, e.g. Blocks, Lines, Spans, Shapes.

class pdf2docx.common.Collection.BaseCollection(instances: Optional[list] = None, parent=None)#

Bases: object

Base collection representing a list of instances.

append(instance)#
property bbox#

bbox of combined collection.

extend(instances: list)#
property parent#
reset(instances: Optional[list] = None)#

Reset instances list.

Args:

instances (list, optional): reset to target instances. Defaults to None.

Returns:

BaseCollection: self

restore(*args, **kwargs)#

Construct Collection from a list of dict.

store()#

Store attributes in json format.

class pdf2docx.common.Collection.Collection(instances: Optional[list] = None, parent=None)#

Bases: BaseCollection, IText

Collection of instance focusing on grouping and sorting elements.

group(fun)#

Group instances according to user defined criterion.

Args:

fun (function): with 2 arguments representing 2 instances (Element) and return bool.

Returns:

list: a list of grouped Collection instances.

Examples 1:

# group instances intersected with each other
fun = lambda a,b: a.bbox & b.bbox

Examples 2:

# group instances aligned horizontally
fun = lambda a,b: a.horizontally_aligned_with(b)

Note

It’s equal to a GRAPH searching problem, build adjacent list, and then search graph to find all connected components.

group_by_columns(factor: float = 0.0, sorted: bool = True, text_direction: bool = False)#

Group elements into columns based on the bbox.

group_by_connectivity(dx: float, dy: float)#

Collect connected instances into same group.

Args:

dx (float): x-tolerances to define connectivity dy (float): y-tolerances to define connectivity

Returns:

list: a list of grouped Collection instances.

Note

  • It’s equal to a GRAPH traversing problem, which the critical point in building the adjacent list, especially a large number of vertex (paths).

  • Checking intersections between paths is actually a Rectangle-Intersection problem, studied already in many literatures.

group_by_physical_rows(sorted: bool = False, text_direction: bool = False)#

Group lines into physical rows.

group_by_rows(factor: float = 0.0, sorted: bool = True, text_direction: bool = False)#

Group elements into rows based on the bbox.

sort_in_line_order()#

Sort collection instances in a physical with text direction considered, e.g. for normal reading direction: from left to right.

sort_in_reading_order()#

Sort collection instances in reading order (considering text direction), e.g. for normal reading direction: from top to bottom, from left to right.

sort_in_reading_order_plus()#

Sort instances in reading order, especially for instances in same row. Taking natural reading direction for example: reading order for rows, from left to right for instances in row. In the following example, A comes before B:

             +-----------+
+---------+  |           |
|   A     |  |     B     |
+---------+  +-----------+

Steps:

  • Sort elements in reading order, i.e. from top to bottom, from left to right.

  • Group elements in row.

  • Sort elements in row: from left to right.

property text_direction#

Get text direction. All instances must have same text direction.

class pdf2docx.common.Collection.ElementCollection(instances: Optional[list] = None, parent=None)#

Bases: Collection

Collection of Element instances.

append(e: Element)#

Append an instance, update parent’s bbox accordingly and set the parent of the added instance.

Args:

e (Element): instance to append.

contained_in_bbox(bbox)#

Filter instances contained in target bbox.

Args:

bbox (fitz.Rect): target boundary box.

insert(nth: int, e: Element)#

Insert a Element and update parent’s bbox accordingly.

Args:

nth (int): the position to insert. e (Element): the instance to insert.

is_flow_layout(line_separate_threshold: float, cell_layout=False)#

Whether contained elements are in flow layout or not.

pop(nth: int)#

Delete the nth instance.

Args:

nth (int): the position to remove.

Returns:

Collection: the removed instance.

split_with_intersection(bbox: Rect, threshold: float = 0.001)#

Split instances into two groups: one intersects with bbox, the other not.

Args:

bbox (fitz.Rect): target rect box. threshold (float): It’s intersected when the overlap rate exceeds this threshold. Defaults to 0.

Returns:

tuple: two group in original class type.