pdf2docx.page.RawPage module#

A wrapper of pdf page engine (e.g. PyMuPDF, pdfminer) to do the following work:

  • extract source contents

  • clean up blocks/shapes, e.g. elements out of page

  • calculate page margin

  • parse page structure roughly, i.e. section and column

class pdf2docx.page.RawPage.RawPage(page_engine=None)#

Bases: BasePage, ABC

A wrapper of page engine.

calculate_margin(**settings)#

Calculate page margin.

Note

Ensure this method is run right after cleaning up the layout, so the page margin is calculated based on valid layout, and stay constant.

clean_up(**kwargs)#
abstract extract_raw_dict(**settings)#

Extract source data with page engine. Return a dict with the following structure: ```

{

“width” : w, “height”: h, “blocks”: [{…}, {…}, …], “shapes” : [{…}, {…}, …]

}

```

parse_section(**settings)#

Detect and create page sections.

Note

  • Only two-columns Sections are considered for now.

  • Page margin must be parsed before this step.

process_font(fonts: Fonts)#

Update font properties, e.g. font name, font line height ratio, of TextSpan.

Args:

fonts (Fonts): Fonts parsed by fonttools.

property raw_text#

Extracted raw text in current page. Should be run after restore() data.

restore(**kwargs)#
property text#

All extracted text in this page, with images considered as <image>. Should be run after restore() data.