pdf2docx.layout.Blocks module#
A group of text elements, distinguished to Shape
elements. For instance, TextBlock
,
ImageBlock
or TableBlock
after parsing, while Line
instances at the beginning,
and a combination of Line
and TableBlock
during parsing process.
- class pdf2docx.layout.Blocks.Blocks(instances: Optional[list] = None, parent=None)#
Bases:
ElementCollection
Block collections.
- assign_to_tables(tables: list)#
Add blocks (line or sub-table) to associated cells of given tables.
- Args:
tables (list): A list of TableBlock instances.
- clean_up(float_image_ignorable_gap: float, line_overlap_threshold: float)#
Clean up blocks in page level.
convert to lines
remove lines out of page
remove transformed text: text direction is not (1, 0) or (0, -1)
remove empty lines
- Args:
float_image_ignorable_gap (float): Regarded as float image if the intersection exceeds this value. line_overlap_threshold (float): remove line if the intersection exceeds this value.
Note
The block structure extracted from
PyMuPDF
might be unreasonable, e.g. * one real paragraph is split into multiple blocks; or * one block consists of multiple real paragraphs
- collect_stream_lines(potential_shadings: list, line_separate_threshold: float)#
Collect elements in Line level (line or table bbox), which may contained in a stream table region.
Table may exist on the following conditions:
blocks in a row don’t follow flow layout; or
block is contained in potential shading
- Args:
potential_shadings (list): a group of shapes representing potential cell shading line_separate_threshold (float): two separate lines if the x-distance exceeds this value
- Returns:
list: A list of Lines. Each group of Lines represents a potential table.
Note
PyMuPDF
may group multi-lines in a row as a text block while each line belongs to different cell. So, it’s required to deep into line level.
- property floating_image_blocks#
- property inline_image_blocks#
Get inline image blocks contained in this Collection.
- property lattice_table_blocks#
Get lattice table blocks contained in this Collection.
- make_docx(doc)#
Create page based on parsed block structure.
- Args:
doc (Document, _Cell): The container to make docx content.
- parse_block(max_line_spacing_ratio: float, line_break_free_space_ratio: float, new_paragraph_free_space_ratio: float)#
Group lines into text block.
- parse_spacing(*args)#
Calculate external and internal space for text blocks:
vertical distance between blocks, i.e. paragraph before/after spacing
horizontal distance to left/right border, i.e. paragraph left/right indent
vertical distance between lines, i.e. paragraph line spacing
- parse_text_format(rects, delete_end_line_hyphen: bool)#
Parse text format with style represented by stroke/fill shapes.
- Args:
rects (Shapes): Potential styles applied on blocks. delete_end_line_hyphen (bool): delete hyphen at the end of a line if True.
- plot(page)#
Plot blocks in PDF page for debug purpose.
- restore(raws: list)#
Clean current instances and restore them from source dict. ImageBlock is converted to ImageSpan contained in TextBlock.
- Args:
raws (list): A list of raw dicts representing text/image/table blocks.
- Returns:
Blocks: self
- property stream_table_blocks#
Get stream table blocks contained in this Collection.
- property table_blocks#
Get table blocks contained in this Collection.
- property text_blocks#
Get text/image blocks contained in this Collection.