pdf2docx.page.Page module#
Page object parsed with PDF raw dict.
In addition to base structure described in RawPage
,
some new features, e.g. sections, table block, are also included.
Page elements structure:
{
"id": 0, # page index
"width" : w,
"height": h,
"margin": [left, right, top, bottom],
"sections": [{
... # section properties
}, ...],
"floats": [{
... # floating picture
}, ...]
}
- class pdf2docx.page.Page.Page(id: int = -1, skip_parsing: bool = True, width: float = 0.0, height: float = 0.0, header: Optional[str] = None, footer: Optional[str] = None, margin: Optional[tuple] = None, sections: Optional[Sections] = None, float_images: Optional[BaseCollection] = None)#
Bases:
BasePage
Object representing the whole page, e.g. margins, sections.
- extract_tables(**settings)#
Extract content from tables (top layout only).
Note
Before running this method, the page layout must be either parsed from source page or restored from parsed data.
- property finalized#
- make_docx(doc)#
Set page size, margin, and create page.
Note
Before running this method, the page layout must be either parsed from source page or restored from parsed data.
- Args:
doc (Document):
python-docx
document object
- parse(**kwargs)#
- restore(data: dict)#
Restore Layout from parsed results.
- store()#
Store parsed layout in dict format.