pdf2docx.font.Fonts module#

Extract fonts properties from PDF.

Font properties like font name, size are covered in TextSpan, but more generic properties are required further:

  • Font family name. The font name extracted and set in TextSpan might not valid when directly used in MS Word, e.g. “ArialMT” should be “Arial”. So, we need to get font family name, which should be accepted by MS Word, based on the font file itself.

  • Font line height ratio. As line height = font_size * line_height_ratio, it’s used to calculate relative line spacing. In general, 1.12 is an approximate value to this ratio, but it’s in fact a font-related value, especially for CJK font.

    • So, extract font metrics, e.g. ascender and descender, with third party library fontTools in first priority. This can obtain an accurate line height ratio, but sometimes the embedded font data might crash.

    • Then, we have to use the default properties, i.e. ascender and descender, extracted by PyMuPDF directly, but this value isn’t so accurate.

class pdf2docx.font.Fonts.Font(descriptor, name, line_height)#

Bases: tuple

descriptor#

Alias for field number 0

line_height#

Alias for field number 2

name#

Alias for field number 1

class pdf2docx.font.Fonts.Fonts(instances: Optional[list] = None, parent=None)#

Bases: BaseCollection

Extracted fonts properties from PDF.

classmethod extract(fitz_doc)#

Extract fonts from PDF and get properties. * Only embedded fonts (v.s. the base 14 fonts) can be extracted. * The extracted fonts may be invalid due to reason from PDF file itself.

get(font_name: str)#

Get matched font by font name, or return None.

static get_font_family_name(tt_font: TTFont)#

Get the font family name from the font’s names table.

https://gist.github.com/pklaus/dce37521579513c574d0

static get_line_height_factor(tt_font: TTFont)#

Calculate line height ratio based on hhea and OS/2 tables.

Fon non-CJK fonts:

f = (hhea.Ascent - hhea.Descent + hhea.LineGap) / units_per_em

For non-CJK fonts (Windows):

f = (OS/2.winAscent + OS/2.winDescent + [External Leading]) / units_per_em
External Leading = MAX(0, hhea.LineGap - ((OS/2.WinAscent + OS/2.winDescent) - (hhea.Ascent - hhea.Descent)))

For CJK fonts:

f = 1.3 * (hhea.Ascent - hhea.Descent) / units_per_em

Read more: * https://docs.microsoft.com/en-us/typography/opentype/spec/recom#baseline-to-baseline-distances * https://github.com/source-foundry/font-line#baseline-to-baseline-distance-calculations * https://www.zhihu.com/question/23349103 * https://github.com/source-foundry/font-line/blob/master/lib/fontline/metrics.py

static is_cjk_font(tt_font: TTFont)#

Test font object to confirm that it meets our definition of a CJK font file.

The definition is met if any of the following conditions are True: 1. The font has a CJK code page bit set in the OS/2 table 2. The font has a CJK Unicode range bit set in the OS/2 table 3. The font has any CJK Unicode code points defined in the cmap table

https://github.com/googlefonts/fontbakery/blob/main/Lib/fontbakery/profiles/shared_conditions.py