pdf2docx.converter module#
PDF to Docx Converter.
- exception pdf2docx.converter.ConversionException#
Bases:
Exception
- class pdf2docx.converter.Converter(pdf_file: Optional[str] = None, password: Optional[str] = None, stream: Optional[bytes] = None)#
Bases:
object
The
PDF
todocx
converter.Read PDF file with
PyMuPDF
to get raw layout data page by page, including text, image, drawing and its properties, e.g. boundary box, font, size, image width, height.Analyze layout in document level, e.g. page header, footer and margin.
Parse page layout to docx structure, e.g. paragraph and its properties like indentation, spacing, text alignment; table and its properties like border, shading, merging.
Finally, generate docx with
python-docx
.
- close()#
- convert(docx_filename: Optional[Union[str, IO]] = None, start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#
Convert specified PDF pages to docx file.
- Args:
docx_filename (str, file-like, optional): docx file to write. Defaults to None. start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.
Refer to
default_settings()
for detail of configuration parameters.Note
Change extension from
pdf
todocx
ifdocx_file
is None.Note
start
andend
is counted from zero if--zero_based_index=True
(by default).Start from the first page if
start
is omitted.End with the last page if
end
is omitted.
Note
pages
has a higher priority thanstart
andend
.start
andend
works only ifpages
is omitted.Note
Multi-processing works only for continuous pages specified by
start
andend
only.
- debug_page(i: int, docx_filename: Optional[str] = None, debug_pdf: Optional[str] = None, layout_file: Optional[str] = None, **kwargs)#
Parse, create and plot single page for debug purpose.
- Args:
i (int): Page index to convert. docx_filename (str): docx filename to write to. debug_pdf (str): New pdf file storing layout information. Default to add prefix
debug_
. layout_file (str): New json file storing parsed layout data. Default tolayout.json
.
- property default_settings#
Default parsing parameters.
- deserialize(filename: str)#
Load parsed pages from specified JSON file.
- extract_tables(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#
Extract table contents from specified PDF pages.
- Args:
start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.
- Returns:
list: A list of parsed table content.
- property fitz_doc#
- load_pages(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None)#
Step 1 of converting process: open PDF file with
PyMuPDF
, especially for password encrypted file.- Args:
start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None.
- make_docx(filename_or_stream=None, **kwargs)#
Step 4 of converting process: create docx file with converted pages.
- Args:
filename_or_stream (str, file-like): docx file to write. kwargs (dict, optional): Configuration parameters.
- property pages#
- parse(start: int = 0, end: Optional[int] = None, pages: Optional[list] = None, **kwargs)#
Parse pages in three steps: * open PDF file with
PyMuPDF
* analyze whole document, e.g. page section, header/footer and margin * parse specified pages, e.g. paragraph, image and table- Args:
start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None. kwargs (dict, optional): Configuration parameters.
- parse_document(**kwargs)#
Step 2 of converting process: analyze whole document, e.g. page section, header/footer and margin.
- parse_pages(**kwargs)#
Step 3 of converting process: parse pages, e.g. paragraph, image and table.
- restore(data: dict)#
Restore pages from parsed results.
- serialize(filename: str)#
Write parsed pages to specified JSON file.
- store()#
Store parsed pages in dict format.
- exception pdf2docx.converter.MakedocxException#
Bases:
ConversionException