pdf2docx.text.TextSpan module#

Text Span object based on PDF raw dict extracted with PyMuPDF.

Data structure for Span refer to this link:

{
    # raw dict
    ---------------------------
    'bbox': (x0,y0,x1,y1),
    'color': sRGB
    'font': fontname,
    'size': fontsize,
    'flags': fontflags,
    'chars': [ chars ],

    # added dict
    ----------------------------
    'text': text,
    'style': [
        {
            'type': int,
            'color': int,
            'uri': str    # for hyperlink
        },
        ...
    ]
}
class pdf2docx.text.TextSpan.TextSpan(raw: Optional[dict] = None)#

Bases: Element

Object representing text span.

add(char: Char)#

Add char and update bbox accordingly.

bbox: Rect#
cal_bbox()#

Calculate bbox based on contained instances.

intersects(rect)#

Create new TextSpan object with chars contained in given bbox.

Args:

rect (fitz.Rect): Target bbox.

property is_valid_line_height#
lstrip()#

Remove blanks at the left side, but keep one blank.

make_docx(paragraph)#

Add text span to a docx paragraph, and set text style, e.g. font, color, underline, hyperlink, etc.

Note

Hyperlink and its style is parsed separately from pdf. For instance, regarding a general hyperlink with an underline, the text and uri is parsed as hyperlink itself, while the underline is treated as a normal text style.

plot(page, color: tuple)#

Plot bbox in PDF page for debug purpose.

rstrip()#

Remove blanks at the right side, but keep one blank.

split(rect: Shape, horizontal: bool = True)#

Split span with the intersection: span-intersection-span.

Args:

rect (Shape): Target shape to split this text span. horizontal (bool, optional): Text direction. Defaults to True.

Returns:

list: Split text spans.

store()#

Store properties in raw dict.

property text#

Get span text. Note joining chars is in a higher priority.