pdf2docx.layout.Blocks module#

A group of text elements, distinguished to Shape elements. For instance, TextBlock, ImageBlock or TableBlock after parsing, while Line instances at the beginning, and a combination of Line and TableBlock during parsing process.

class pdf2docx.layout.Blocks.Blocks(instances: Optional[list] = None, parent=None)#

Bases: ElementCollection

Block collections.

assign_to_tables(tables: list)#

Add blocks (line or sub-table) to associated cells of given tables.

Args:: tables (list): A list of TableBlock instances.

clean_up(float_image_ignorable_gap: float, line_overlap_threshold: float)#

Clean up blocks in page level.

convert to lines
remove lines out of page
remove transformed text: text direction is not (1, 0) or (0, -1)
remove empty lines

Args:: float_image_ignorable_gap (float): Regarded as float image if the intersection exceeds this value. line_overlap_threshold (float): remove line if the intersection exceeds this value.

Note

The block structure extracted from PyMuPDF might be unreasonable, e.g. * one real paragraph is split into multiple blocks; or * one block consists of multiple real paragraphs

collect_stream_lines(potential_shadings: list, line_separate_threshold: float)#

Collect elements in Line level (line or table bbox), which may contained in a stream table region.

Table may exist on the following conditions:

blocks in a row don’t follow flow layout; or
block is contained in potential shading

Args:: potential_shadings (list): a group of shapes representing potential cell shading line_separate_threshold (float): two separate lines if the x-distance exceeds this value
Returns:: list: A list of Lines. Each group of Lines represents a potential table.

Note

PyMuPDF may group multi-lines in a row as a text block while each line belongs to different cell. So, it’s required to deep into line level.

property floating_image_blocks#

property inline_image_blocks#: Get inline image blocks contained in this Collection.

property lattice_table_blocks#: Get lattice table blocks contained in this Collection.

make_docx(doc)#

Create page based on parsed block structure.

Args:: doc (Document, _Cell): The container to make docx content.

parse_block(max_line_spacing_ratio: float, line_break_free_space_ratio: float, new_paragraph_free_space_ratio: float)#: Group lines into text block.

parse_spacing(*args)#

Calculate external and internal space for text blocks:

vertical distance between blocks, i.e. paragraph before/after spacing
horizontal distance to left/right border, i.e. paragraph left/right indent
vertical distance between lines, i.e. paragraph line spacing

parse_text_format(rects, delete_end_line_hyphen: bool)#

Parse text format with style represented by stroke/fill shapes.

Args:: rects (Shapes): Potential styles applied on blocks. delete_end_line_hyphen (bool): delete hyphen at the end of a line if True.

plot(page)#: Plot blocks in PDF page for debug purpose.

restore(raws: list)#

Clean current instances and restore them from source dict. ImageBlock is converted to ImageSpan contained in TextBlock.

Args:: raws (list): A list of raw dicts representing text/image/table blocks.
Returns:: Blocks: self

property stream_table_blocks#: Get stream table blocks contained in this Collection.

property table_blocks#: Get table blocks contained in this Collection.

property text_blocks#: Get text/image blocks contained in this Collection.