pdf2docx.page.RawPage module#

A wrapper of pdf page engine (e.g. PyMuPDF, pdfminer) to do the following work:

class pdf2docx.page.RawPage.RawPage(page_engine=None)#

Bases: BasePage, ABC

A wrapper of page engine.

calculate_margin(**settings)#: Calculate page margin.

Note

Ensure this method is run right after cleaning up the layout, so the page margin is calculated based on valid layout, and stay constant.

abstract extract_raw_dict(**settings)#

Extract source data with page engine. Return a dict with the following structure: ```

{
“width” : w, “height”: h, “blocks”: [{…}, {…}, …], “shapes” : [{…}, {…}, …]

}

parse_section(**settings)#

Detect and create page sections.

Note

process_font(fonts: Fonts)#

Update font properties, e.g. font name, font line height ratio, of TextSpan.

property raw_text#: Extracted raw text in current page. Should be run after restore() data.

property text#: All extracted text in this page, with images considered as <image>. Should be run after restore() data.