pdf2docx.layout.Blocks module#

A group of text elements, distinguished to Shape elements. For instance, TextBlock, ImageBlock or TableBlock after parsing, while Line instances at the beginning, and a combination of Line and TableBlock during parsing process.

class pdf2docx.layout.Blocks.Blocks(instances: Optional[list] = None, parent=None)#

Bases: ElementCollection

Block collections.

assign_to_tables(tables: list)#

Add blocks (line or sub-table) to associated cells of given tables.

Args:

tables (list): A list of TableBlock instances.

clean_up(float_image_ignorable_gap: float, line_overlap_threshold: float)#

Clean up blocks in page level.

  • convert to lines

  • remove lines out of page

  • remove transformed text: text direction is not (1, 0) or (0, -1)

  • remove empty lines

Args:

float_image_ignorable_gap (float): Regarded as float image if the intersection exceeds this value. line_overlap_threshold (float): remove line if the intersection exceeds this value.

Note

The block structure extracted from PyMuPDF might be unreasonable, e.g. * one real paragraph is split into multiple blocks; or * one block consists of multiple real paragraphs

collect_stream_lines(potential_shadings: list, line_separate_threshold: float)#

Collect elements in Line level (line or table bbox), which may contained in a stream table region.

Table may exist on the following conditions:

  • blocks in a row don’t follow flow layout; or

  • block is contained in potential shading

Args:

potential_shadings (list): a group of shapes representing potential cell shading line_separate_threshold (float): two separate lines if the x-distance exceeds this value

Returns:

list: A list of Lines. Each group of Lines represents a potential table.

Note

PyMuPDF may group multi-lines in a row as a text block while each line belongs to different cell. So, it’s required to deep into line level.

property floating_image_blocks#
property inline_image_blocks#

Get inline image blocks contained in this Collection.

property lattice_table_blocks#

Get lattice table blocks contained in this Collection.

make_docx(doc)#

Create page based on parsed block structure.

Args:

doc (Document, _Cell): The container to make docx content.

parse_block(max_line_spacing_ratio: float, line_break_free_space_ratio: float, new_paragraph_free_space_ratio: float)#

Group lines into text block.

parse_spacing(*args)#

Calculate external and internal space for text blocks:

  • vertical distance between blocks, i.e. paragraph before/after spacing

  • horizontal distance to left/right border, i.e. paragraph left/right indent

  • vertical distance between lines, i.e. paragraph line spacing

parse_text_format(rects, delete_end_line_hyphen: bool)#

Parse text format with style represented by stroke/fill shapes.

Args:

rects (Shapes): Potential styles applied on blocks. delete_end_line_hyphen (bool): delete hyphen at the end of a line if True.

plot(page)#

Plot blocks in PDF page for debug purpose.

restore(raws: list)#

Clean current instances and restore them from source dict. ImageBlock is converted to ImageSpan contained in TextBlock.

Args:

raws (list): A list of raw dicts representing text/image/table blocks.

Returns:

Blocks: self

property stream_table_blocks#

Get stream table blocks contained in this Collection.

property table_blocks#

Get table blocks contained in this Collection.

property text_blocks#

Get text/image blocks contained in this Collection.