pdf2docx.converter module#

PDF to Docx Converter.

exception pdf2docx.converter.ConversionException#

Bases: Exception

class pdf2docx.converter.Converter(pdf_file: Optional[str] = None, password: Optional[str] = None, stream: Optional[bytes] = None)#

Bases: object

The PDF to docx converter.

  • Read PDF file with PyMuPDF to get raw layout data page by page, including text, image, drawing and its properties, e.g. boundary box, font, size, image width, height.

  • Analyze layout in document level, e.g. page header, footer and margin.

  • Parse page layout to docx structure, e.g. paragraph and its properties like indentation, spacing, text alignment; table and its properties like border, shading, merging.

  • Finally, generate docx with python-docx.

close()#
convert(docx_filename: Optional[Union[str, IO]] = None, start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#

Convert specified PDF pages to docx file.

Args:

docx_filename (str, file-like, optional): docx file to write. Defaults to None. start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.

Refer to default_settings() for detail of configuration parameters.

Note

Change extension from pdf to docx if docx_file is None.

Note

  • start and end is counted from zero if --zero_based_index=True (by default).

  • Start from the first page if start is omitted.

  • End with the last page if end is omitted.

Note

pages has a higher priority than start and end. start and end works only if pages is omitted.

Note

Multi-processing works only for continuous pages specified by start and end only.

debug_page(i: int, docx_filename: Optional[str] = None, debug_pdf: Optional[str] = None, layout_file: Optional[str] = None, **kwargs)#

Parse, create and plot single page for debug purpose.

Args:

i (int): Page index to convert. docx_filename (str): docx filename to write to. debug_pdf (str): New pdf file storing layout information. Default to add prefix debug_. layout_file (str): New json file storing parsed layout data. Default to layout.json.

property default_settings#

Default parsing parameters.

deserialize(filename: str)#

Load parsed pages from specified JSON file.

extract_tables(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#

Extract table contents from specified PDF pages.

Args:

start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.

Returns:

list: A list of parsed table content.

property fitz_doc#
load_pages(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None)#

Step 1 of converting process: open PDF file with PyMuPDF, especially for password encrypted file.

Args:

start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None.

make_docx(filename_or_stream=None, **kwargs)#

Step 4 of converting process: create docx file with converted pages.

Args:

filename_or_stream (str, file-like): docx file to write. kwargs (dict, optional): Configuration parameters.

property pages#
parse(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#

Parse pages in three steps: * open PDF file with PyMuPDF * analyze whole document, e.g. page section, header/footer and margin * parse specified pages, e.g. paragraph, image and table

Args:

start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None. kwargs (dict, optional): Configuration parameters.

parse_document(**kwargs)#

Step 2 of converting process: analyze whole document, e.g. page section, header/footer and margin.

parse_pages(**kwargs)#

Step 3 of converting process: parse pages, e.g. paragraph, image and table.

restore(data: dict)#

Restore pages from parsed results.

serialize(filename: str)#

Write parsed pages to specified JSON file.

store()#

Store parsed pages in dict format.

exception pdf2docx.converter.MakedocxException#

Bases: ConversionException