pdf2docx.common.algorithm module#
- pdf2docx.common.algorithm.get_area(bbox_1: tuple, bbox_2: tuple)#
- pdf2docx.common.algorithm.graph_bfs(graph)#
Breadth First Search graph (may be disconnected graph).
- Args:
graph (list): GRAPH represented by adjacent list, [set(1,2,3), set(…), …]
- Returns:
list: A list of connected components
- pdf2docx.common.algorithm.inner_contours(img_binary: array, bbox: tuple, min_w: float, min_h: float)#
Inner contours of current region, especially level 2 contours of the default opencv tree hirerachy.
- Args:
img_binary (np.array): Binarized image with interesting region (255) and empty region (0). bbox (tuple): The external bbox. min_w (float): Ignore contours if the bbox width is less than this value. min_h (float): Ignore contours if the bbox height is less than this value.
- Returns:
list: A list of bbox-es of inner contours.
- pdf2docx.common.algorithm.recursive_xy_cut(img_binary: array, min_w: float = 0.0, min_h: float = 0.0, min_dx: float = 15.0, min_dy: float = 15.0)#
Split image with recursive xy-cut algorithm.
- Args:
img_binary (np.array): Binarized image with interesting region (255) and empty region (0). min_w (float): Ignore bbox if the width is less than this value. min_h (float): Ignore bbox if the height is less than this value. min_dx (float): Merge two bbox-es if the x-gap is less than this value. min_dy (float): Merge two bbox-es if the y-gap is less than this value.
- Returns:
list: bbox (x0, y0, x1, y1) of split blocks.
- pdf2docx.common.algorithm.solve_rects_intersection(V: list, num: int, index_groups: list)#
Implementation of solving Rectangle-Intersection Problem.
Performance:
O(nlog n + k) time and O(n) space, where k is the count of intersection pairs.
- Args:
V (list): Rectangle-related x-edges data, [(index, Rect, x), (…), …]. num (int): Count of V instances, equal to len(V). index_groups (list): Target adjacent list for connectivity between rects.
Procedure
detect(V, H, m)
:if m < 2 then return else - let V1 be the first ⌊m/2⌋ and let V2 be the rest of the vertical edges in V in the sorted order; - let S11 and S22 be the set of rectangles represented only in V1 and V2 but not spanning V2 and V1, respectively; - let S12 be the set of rectangles represented only in V1 and spanning V2; - let S21 be the set of rectangles represented only in V2 and spanning V1 - let H1 and H2 be the list of y-intervals corresponding to the elements of V1 and V2 respectively - stab(S12, S22); stab(S21, S11); stab(S12, S21) - detect(V1, H1, ⌊m/2⌋); detect(V2, H2, m − ⌊m/2⌋)
- pdf2docx.common.algorithm.xy_project_profile(img_source: array, img_binary: array, gap: int = 5, dw: Optional[int] = None, dh: Optional[int] = None)#
Projection profile along x and y direction.
- ```
┌────────────────┐
- dh │ │
- └────────────────┘
gap
┌────────────────┐ ┌───┐ │ │ │ │
- h │ image │ │ │
│ │ │ │ └────────────────┘ └───┘
w dw
- Args:
img_source (np.array): Source image, e.g. RGB mode. img_binary (np.array): Binarized image. gap (int, optional): Gap between sub-graph. Defaults to 5. dw (int, optional): Graph height of x projection profile. Defaults to None. dh (int, optional): Graph height of y projection profile. Defaults to None.
- Returns:
np.array: The combined graph data.