pdf2docx.page.RawPage module#

A wrapper of pdf page engine (e.g. PyMuPDF, pdfminer) to do the following work:

  • extract source contents

  • clean up blocks/shapes, e.g. elements out of page

  • calculate page margin

  • parse page structure roughly, i.e. section and column

class pdf2docx.page.RawPage.RawPage(page_engine=None)#

Bases: BasePage, ABC

A wrapper of page engine.


Calculate page margin.


Ensure this method is run right after cleaning up the layout, so the page margin is calculated based on valid layout, and stay constant.

abstract extract_raw_dict(**settings)#

Extract source data with page engine. Return a dict with the following structure: ```


“width” : w, “height”: h, “blocks”: [{…}, {…}, …], “shapes” : [{…}, {…}, …]




Detect and create page sections.


  • Only two-columns Sections are considered for now.

  • Page margin must be parsed before this step.

process_font(fonts: Fonts)#

Update font properties, e.g. font name, font line height ratio, of TextSpan.


fonts (Fonts): Fonts parsed by fonttools.

property raw_text#

Extracted raw text in current page. Should be run after restore() data.

property text#

All extracted text in this page, with images considered as <image>. Should be run after restore() data.