pdf2docx.common.algorithm module#

pdf2docx.common.algorithm.get_area(bbox_1: tuple, bbox_2: tuple)#
pdf2docx.common.algorithm.graph_bfs(graph)#

Breadth First Search graph (may be disconnected graph).

Args:

graph (list): GRAPH represented by adjacent list, [set(1,2,3), set(…), …]

Returns:

list: A list of connected components

pdf2docx.common.algorithm.inner_contours(img_binary: array, bbox: tuple, min_w: float, min_h: float)#

Inner contours of current region, especially level 2 contours of the default opencv tree hirerachy.

Args:

img_binary (np.array): Binarized image with interesting region (255) and empty region (0). bbox (tuple): The external bbox. min_w (float): Ignore contours if the bbox width is less than this value. min_h (float): Ignore contours if the bbox height is less than this value.

Returns:

list: A list of bbox-es of inner contours.

pdf2docx.common.algorithm.recursive_xy_cut(img_binary: array, min_w: float = 0.0, min_h: float = 0.0, min_dx: float = 15.0, min_dy: float = 15.0)#

Split image with recursive xy-cut algorithm.

Args:

img_binary (np.array): Binarized image with interesting region (255) and empty region (0). min_w (float): Ignore bbox if the width is less than this value. min_h (float): Ignore bbox if the height is less than this value. min_dx (float): Merge two bbox-es if the x-gap is less than this value. min_dy (float): Merge two bbox-es if the y-gap is less than this value.

Returns:

list: bbox (x0, y0, x1, y1) of split blocks.

pdf2docx.common.algorithm.solve_rects_intersection(V: list, num: int, index_groups: list)#

Implementation of solving Rectangle-Intersection Problem.

Performance:

O(nlog n + k) time and O(n) space, where k is the count of intersection pairs.
Args:

V (list): Rectangle-related x-edges data, [(index, Rect, x), (…), …]. num (int): Count of V instances, equal to len(V). index_groups (list): Target adjacent list for connectivity between rects.

Procedure detect(V, H, m):

if m < 2 then return else
- let V1 be the first ⌊m/2⌋ and let V2 be the rest of the vertical edges in V in the sorted order;
- let S11 and S22 be the set of rectangles represented only in V1 and V2 but not spanning V2 and V1, respectively;
- let S12 be the set of rectangles represented only in V1 and spanning V2; 
- let S21 be the set of rectangles represented only in V2 and spanning V1
- let H1 and H2 be the list of y-intervals corresponding to the elements of V1 and V2 respectively
- stab(S12, S22); stab(S21, S11); stab(S12, S21)
- detect(V1, H1, ⌊m/2⌋); detect(V2, H2, m − ⌊m/2⌋)
pdf2docx.common.algorithm.xy_project_profile(img_source: array, img_binary: array, gap: int = 5, dw: Optional[int] = None, dh: Optional[int] = None)#

Projection profile along x and y direction.

```

┌────────────────┐

dh │ │
└────────────────┘

gap

┌────────────────┐ ┌───┐ │ │ │ │

h │ image │ │ │

│ │ │ │ └────────────────┘ └───┘

w dw

```

Args:

img_source (np.array): Source image, e.g. RGB mode. img_binary (np.array): Binarized image. gap (int, optional): Gap between sub-graph. Defaults to 5. dw (int, optional): Graph height of x projection profile. Defaults to None. dh (int, optional): Graph height of y projection profile. Defaults to None.

Returns:

np.array: The combined graph data.