pdf2docx.page.RawPage module#
A wrapper of pdf page engine (e.g. PyMuPDF, pdfminer) to do the following work:
extract source contents
clean up blocks/shapes, e.g. elements out of page
calculate page margin
parse page structure roughly, i.e. section and column
- class pdf2docx.page.RawPage.RawPage(page_engine=None)#
Bases:
BasePage
,ABC
A wrapper of page engine.
- calculate_margin(**settings)#
Calculate page margin.
Note
Ensure this method is run right after cleaning up the layout, so the page margin is calculated based on valid layout, and stay constant.
- clean_up(**kwargs)#
- abstract extract_raw_dict(**settings)#
Extract source data with page engine. Return a dict with the following structure: ```
- {
“width” : w, “height”: h, “blocks”: [{…}, {…}, …], “shapes” : [{…}, {…}, …]
}
- parse_section(**settings)#
Detect and create page sections.
Note
Only two-columns Sections are considered for now.
Page margin must be parsed before this step.
- process_font(fonts: Fonts)#
Update font properties, e.g. font name, font line height ratio, of
TextSpan
.- Args:
fonts (Fonts): Fonts parsed by
fonttools
.
- property raw_text#
Extracted raw text in current page. Should be run after
restore()
data.
- restore(**kwargs)#
- property text#
All extracted text in this page, with images considered as
<image>
. Should be run afterrestore()
data.