pdf2docx.page.Page module#

Page object parsed with PDF raw dict.

In addition to base structure described in RawPage, some new features, e.g. sections, table block, are also included. Page elements structure:

{
    "id": 0, # page index
    "width" : w,
    "height": h,
    "margin": [left, right, top, bottom],
    "sections": [{
        ... # section properties
    }, ...],
    "floats": [{
        ... # floating picture
    }, ...]
}
class pdf2docx.page.Page.Page(id: int = -1, skip_parsing: bool = True, width: float = 0.0, height: float = 0.0, header: Optional[str] = None, footer: Optional[str] = None, margin: Optional[tuple] = None, sections: Optional[Sections] = None, float_images: Optional[BaseCollection] = None)#

Bases: BasePage

Object representing the whole page, e.g. margins, sections.

extract_tables(**settings)#

Extract content from tables (top layout only).

Note

Before running this method, the page layout must be either parsed from source page or restored from parsed data.

property finalized#
make_docx(doc)#

Set page size, margin, and create page.

Note

Before running this method, the page layout must be either parsed from source page or restored from parsed data.

Args:

doc (Document): python-docx document object

parse(**kwargs)#
restore(data: dict)#

Restore Layout from parsed results.

store()#

Store parsed layout in dict format.