pdf2docx.text.Lines module#

A group of Line objects.

class pdf2docx.text.Lines.Lines(instances: Optional[list] = None, parent=None)#

Bases: ElementCollection

Collection of text lines.

adjust_last_word(delete_end_line_hyphen: bool)#

Adjust word at the end of line: # - it might miss blank between words from adjacent lines # - it’s optional to delete hyphen since it might not at the the end

of line after conversion

property image_spans#

Get all ImageSpan instances.

parse_line_break(bbox, line_break_width_ratio: float, line_break_free_space_ratio: float)#

Whether hard break each line.

Args:

bbox (Rect): bbox of parent layout, e.g. page or cell. line_break_width_ratio (float): user defined threshold, break line if smaller than this value. line_break_free_space_ratio (float): user defined threshold, break line if exceeds this value.

Hard line break helps ensure paragraph structure, but pdf-based layout calculation may change in docx due to different rendering mechanism like font, spacing. For instance, when one paragraph row can’t accommodate a Line, the hard break leads to an unnecessary empty row. Since we can’t 100% ensure a same structure, it’s better to focus on the content - add line break only when it’s necessary to, e.g. short lines.

parse_tab_stop(line_separate_threshold: float)#

Calculate tab stops for parent block and whether add TAB stop before each line.

Args:

line_separate_threshold (float): Don’t need a tab stop if the line gap less than this value.

parse_text_format(shape)#

Parse text format with style represented by rectangle shape.

Args:

shape (Shape): Potential style shape applied on blocks.

Returns:

bool: Whether a valid text style.

restore(raws: list)#

Construct lines from raw dicts list.

split_vertically_by_text(line_break_free_space_ratio: float, new_paragraph_free_space_ratio: float)#

Split lines into separate paragraph by checking text. The parent text block consists of lines with similar line spacing, while lines in other paragraph might be counted when the paragraph spacing is relatively small. So, it’s necessary to split those lines by checking the text contents.

Note

Considered only normal reading direction, from left to right, from top to bottom.

property unique_parent#

Whether all contained lines have same parent.