pdf2docx.image.ImagesExtractor module#
Extract images from PDF.
Both raster images and vector graphics are considered:
Normal images like jpeg or png could be extracted with method
page.get_text('rawdict')
andPage.get_images()
. Note the process for png images with alpha channel.Vector graphics are actually composed of a group of paths, represented by operators like
re
,m
,l
andc
. They’re detected by finding the contours withopencv
.
- class pdf2docx.image.ImagesExtractor.ImagesExtractor(page: Page)#
Bases:
object
- clip_page_to_dict(bbox: Optional[Rect] = None, clip_image_res_ratio: float = 3.0)#
Clip page pixmap (without text) according to
bbox
and convert to source image.- Args:
bbox (fitz.Rect, optional): Target area to clip. Defaults to None, i.e. entire page. clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap. Defaults to 3.0.
- Returns:
list: A list of image raw dict.
- clip_page_to_pixmap(bbox: Optional[Rect] = None, zoom: float = 3.0)#
Clip page pixmap (without text) according to
bbox
.- Args:
- bbox (fitz.Rect, optional): Target area to clip. Defaults to None, i.e. entire page.
Note that
bbox
depends on un-rotated page CS, while clipping page is based on the final page.
zoom (float, optional): Improve resolution by this rate. Defaults to 3.0.
- Returns:
fitz.Pixmap: The extracted pixmap.
- detect_svg_contours(min_svg_gap_dx: float, min_svg_gap_dy: float, min_w: float, min_h: float)#
Find contour of potential vector graphics.
- Args:
min_svg_gap_dx (float): Merge svg if the horizontal gap is less than this value. min_svg_gap_dy (float): Merge svg if the vertical gap is less than this value. min_w (float): Ignore contours if the bbox width is less than this value. min_h (float): Ignore contours if the bbox height is less than this value.
- Returns:
list: A list of potential svg region: (external_bbox, inner_bboxes:list).
- extract_images(clip_image_res_ratio: float = 3.0)#
Extract normal images with
Page.get_images()
.- Args:
clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap. Defaults to 3.0.
- Returns:
list: A list of extracted and recovered image raw dict.
Note
Page.get_images()
contains each image only once, which may less than the real count of images in a page.