pdf2docx.image.ImagesExtractor module#

Extract images from PDF.

Both raster images and vector graphics are considered:

  • Normal images like jpeg or png could be extracted with method page.get_text('rawdict') and Page.get_images(). Note the process for png images with alpha channel.

  • Vector graphics are actually composed of a group of paths, represented by operators like re, m, l and c. They’re detected by finding the contours with opencv.

class pdf2docx.image.ImagesExtractor.ImagesExtractor(page: Page)#

Bases: object

clip_page_to_dict(bbox: Optional[Rect] = None, clip_image_res_ratio: float = 3.0)#

Clip page pixmap (without text) according to bbox and convert to source image.

Args:

bbox (fitz.Rect, optional): Target area to clip. Defaults to None, i.e. entire page. clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap. Defaults to 3.0.

Returns:

list: A list of image raw dict.

clip_page_to_pixmap(bbox: Optional[Rect] = None, zoom: float = 3.0)#

Clip page pixmap (without text) according to bbox.

Args:
bbox (fitz.Rect, optional): Target area to clip. Defaults to None, i.e. entire page.

Note that bbox depends on un-rotated page CS, while clipping page is based on the final page.

zoom (float, optional): Improve resolution by this rate. Defaults to 3.0.

Returns:

fitz.Pixmap: The extracted pixmap.

detect_svg_contours(min_svg_gap_dx: float, min_svg_gap_dy: float, min_w: float, min_h: float)#

Find contour of potential vector graphics.

Args:

min_svg_gap_dx (float): Merge svg if the horizontal gap is less than this value. min_svg_gap_dy (float): Merge svg if the vertical gap is less than this value. min_w (float): Ignore contours if the bbox width is less than this value. min_h (float): Ignore contours if the bbox height is less than this value.

Returns:

list: A list of potential svg region: (external_bbox, inner_bboxes:list).

extract_images(clip_image_res_ratio: float = 3.0)#

Extract normal images with Page.get_images().

Args:

clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap. Defaults to 3.0.

Returns:

list: A list of extracted and recovered image raw dict.

Note

Page.get_images() contains each image only once, which may less than the real count of images in a page.