pdf2docx.page.Page module#

Page object parsed with PDF raw dict.

In addition to base structure described in RawPage, some new features, e.g. sections, table block, are also included. Page elements structure:

Page >> Section >> Column
- Blocks
  
  TextBlock >> Line >> TextSpan / ImageSpan >> Char
  
  TableBlock >> Row >> Cell
  
  Blocks
  
  Shapes
- Shapes
  
  Stroke
  
  Fill
  
  Hyperlink

{
    "id": 0, # page index
    "width" : w,
    "height": h,
    "margin": [left, right, top, bottom],
    "sections": [{
        ... # section properties
    }, ...],
    "floats": [{
        ... # floating picture
    }, ...]
}

class pdf2docx.page.Page.Page(id: int = -1, skip_parsing: bool = True, width: float = 0.0, height: float = 0.0, header: Optional[str] = None, footer: Optional[str] = None, margin: Optional[tuple] = None, sections: Optional[Sections] = None, float_images: Optional[BaseCollection] = None)#

Bases: BasePage

Object representing the whole page, e.g. margins, sections.

extract_tables(**settings)#: Extract content from tables (top layout only).

Note

Before running this method, the page layout must be either parsed from source page or restored from parsed data.

property finalized#

make_docx(doc)#

Set page size, margin, and create page.

Note

Before running this method, the page layout must be either parsed from source page or restored from parsed data.

Args:: doc (Document): python-docx document object

parse(**kwargs)#

restore(data: dict)#: Restore Layout from parsed results.

store()#: Store parsed layout in dict format.