pdf2docx.converter module#

PDF to Docx Converter.

exception pdf2docx.converter.ConversionException#: Bases: Exception

class pdf2docx.converter.Converter(pdf_file: Optional[str] = None, password: Optional[str] = None, stream: Optional[bytes] = None)#

Bases: object

The PDF to docx converter.

Read PDF file with PyMuPDF to get raw layout data page by page, including text, image, drawing and its properties, e.g. boundary box, font, size, image width, height.
Analyze layout in document level, e.g. page header, footer and margin.
Parse page layout to docx structure, e.g. paragraph and its properties like indentation, spacing, text alignment; table and its properties like border, shading, merging.
Finally, generate docx with python-docx.

close()#

convert(docx_filename: Optional[Union[str, IO]] = None, start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#

Convert specified PDF pages to docx file.

Args:: docx_filename (str, file-like, optional): docx file to write. Defaults to None. start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.

Refer to default_settings() for detail of configuration parameters.

Note

Change extension from pdf to docx if docx_file is None.

Note

start and end is counted from zero if --zero_based_index=True (by default).
Start from the first page if start is omitted.
End with the last page if end is omitted.

Note

pages has a higher priority than start and end. start and end works only if pages is omitted.

Note

Multi-processing works only for continuous pages specified by start and end only.

debug_page(i: int, docx_filename: Optional[str] = None, debug_pdf: Optional[str] = None, layout_file: Optional[str] = None, **kwargs)#

Parse, create and plot single page for debug purpose.

Args:: i (int): Page index to convert. docx_filename (str): docx filename to write to. debug_pdf (str): New pdf file storing layout information. Default to add prefix debug_. layout_file (str): New json file storing parsed layout data. Default to layout.json.

property default_settings#: Default parsing parameters.

deserialize(filename: str)#: Load parsed pages from specified JSON file.

extract_tables(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#

Extract table contents from specified PDF pages.

Args:: start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.
Returns:: list: A list of parsed table content.

property fitz_doc#

load_pages(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None)#

Step 1 of converting process: open PDF file with PyMuPDF, especially for password encrypted file.

Args:: start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None.

make_docx(filename_or_stream=None, **kwargs)#

Step 4 of converting process: create docx file with converted pages.

Args:: filename_or_stream (str, file-like): docx file to write. kwargs (dict, optional): Configuration parameters.

property pages#

parse(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#

Parse pages in three steps: * open PDF file with PyMuPDF * analyze whole document, e.g. page section, header/footer and margin * parse specified pages, e.g. paragraph, image and table

Args:: start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None. kwargs (dict, optional): Configuration parameters.

parse_document(**kwargs)#: Step 2 of converting process: analyze whole document, e.g. page section, header/footer and margin.

parse_pages(**kwargs)#: Step 3 of converting process: parse pages, e.g. paragraph, image and table.

restore(data: dict)#: Restore pages from parsed results.

serialize(filename: str)#: Write parsed pages to specified JSON file.

store()#: Store parsed pages in dict format.

exception pdf2docx.converter.MakedocxException#: Bases: ConversionException