pdf2docx.text.TextSpan module#
Text Span object based on PDF raw dict extracted with PyMuPDF
.
Data structure for Span refer to this link:
{
# raw dict
---------------------------
'bbox': (x0,y0,x1,y1),
'color': sRGB
'font': fontname,
'size': fontsize,
'flags': fontflags,
'chars': [ chars ],
# added dict
----------------------------
'text': text,
'style': [
{
'type': int,
'color': int,
'uri': str # for hyperlink
},
...
]
}
- class pdf2docx.text.TextSpan.TextSpan(raw: Optional[dict] = None)#
Bases:
Element
Object representing text span.
- bbox: Rect#
- cal_bbox()#
Calculate bbox based on contained instances.
- intersects(rect)#
Create new TextSpan object with chars contained in given bbox.
- Args:
rect (fitz.Rect): Target bbox.
- property is_valid_line_height#
- lstrip()#
Remove blanks at the left side, but keep one blank.
- make_docx(paragraph)#
Add text span to a docx paragraph, and set text style, e.g. font, color, underline, hyperlink, etc.
Note
Hyperlink and its style is parsed separately from pdf. For instance, regarding a general hyperlink with an underline, the text and uri is parsed as hyperlink itself, while the underline is treated as a normal text style.
- plot(page, color: tuple)#
Plot bbox in PDF page for debug purpose.
- rstrip()#
Remove blanks at the right side, but keep one blank.
- split(rect: Shape, horizontal: bool = True)#
Split span with the intersection: span-intersection-span.
- Args:
rect (Shape): Target shape to split this text span. horizontal (bool, optional): Text direction. Defaults to True.
- Returns:
list: Split text spans.
- store()#
Store properties in raw dict.
- property text#
Get span text. Note joining chars is in a higher priority.