pdf2docx.common.Collection module#
A group of instances, e.g. Blocks, Lines, Spans, Shapes.
- class pdf2docx.common.Collection.BaseCollection(instances: Optional[list] = None, parent=None)#
Bases:
object
Base collection representing a list of instances.
- append(instance)#
- property bbox#
bbox of combined collection.
- extend(instances: list)#
- property parent#
- reset(instances: Optional[list] = None)#
Reset instances list.
- Args:
instances (list, optional): reset to target instances. Defaults to None.
- Returns:
BaseCollection: self
- restore(*args, **kwargs)#
Construct Collection from a list of dict.
- store()#
Store attributes in json format.
- class pdf2docx.common.Collection.Collection(instances: Optional[list] = None, parent=None)#
Bases:
BaseCollection
,IText
Collection of instance focusing on grouping and sorting elements.
- group(fun)#
Group instances according to user defined criterion.
- Args:
fun (function): with 2 arguments representing 2 instances (Element) and return bool.
- Returns:
list: a list of grouped
Collection
instances.
Examples 1:
# group instances intersected with each other fun = lambda a,b: a.bbox & b.bbox
Examples 2:
# group instances aligned horizontally fun = lambda a,b: a.horizontally_aligned_with(b)
Note
It’s equal to a GRAPH searching problem, build adjacent list, and then search graph to find all connected components.
- group_by_columns(factor: float = 0.0, sorted: bool = True, text_direction: bool = False)#
Group elements into columns based on the bbox.
- group_by_connectivity(dx: float, dy: float)#
Collect connected instances into same group.
- Args:
dx (float): x-tolerances to define connectivity dy (float): y-tolerances to define connectivity
- Returns:
list: a list of grouped
Collection
instances.
Note
It’s equal to a GRAPH traversing problem, which the critical point in building the adjacent list, especially a large number of vertex (paths).
Checking intersections between paths is actually a Rectangle-Intersection problem, studied already in many literatures.
- group_by_physical_rows(sorted: bool = False, text_direction: bool = False)#
Group lines into physical rows.
- group_by_rows(factor: float = 0.0, sorted: bool = True, text_direction: bool = False)#
Group elements into rows based on the bbox.
- sort_in_line_order()#
Sort collection instances in a physical with text direction considered, e.g. for normal reading direction: from left to right.
- sort_in_reading_order()#
Sort collection instances in reading order (considering text direction), e.g. for normal reading direction: from top to bottom, from left to right.
- sort_in_reading_order_plus()#
Sort instances in reading order, especially for instances in same row. Taking natural reading direction for example: reading order for rows, from left to right for instances in row. In the following example, A comes before B:
+-----------+ +---------+ | | | A | | B | +---------+ +-----------+
Steps:
Sort elements in reading order, i.e. from top to bottom, from left to right.
Group elements in row.
Sort elements in row: from left to right.
- property text_direction#
Get text direction. All instances must have same text direction.
- class pdf2docx.common.Collection.ElementCollection(instances: Optional[list] = None, parent=None)#
Bases:
Collection
Collection of
Element
instances.- append(e: Element)#
Append an instance, update parent’s bbox accordingly and set the parent of the added instance.
- Args:
e (Element): instance to append.
- contained_in_bbox(bbox)#
Filter instances contained in target bbox.
- Args:
bbox (fitz.Rect): target boundary box.
- insert(nth: int, e: Element)#
Insert a Element and update parent’s bbox accordingly.
- Args:
nth (int): the position to insert. e (Element): the instance to insert.
- is_flow_layout(line_separate_threshold: float, cell_layout=False)#
Whether contained elements are in flow layout or not.
- pop(nth: int)#
Delete the
nth
instance.- Args:
nth (int): the position to remove.
- Returns:
Collection: the removed instance.
- split_with_intersection(bbox: Rect, threshold: float = 0.001)#
Split instances into two groups: one intersects with
bbox
, the other not.- Args:
bbox (fitz.Rect): target rect box. threshold (float): It’s intersected when the overlap rate exceeds this threshold. Defaults to 0.
- Returns:
tuple: two group in original class type.