pdf2docx.text.Line module#
Text Line objects based on PDF raw dict extracted with PyMuPDF
.
Data structure of line in text block referring to this link:
{
'bbox': (x0,y0,x1,y1),
'wmode': m,
'dir': [x,y],
'spans': [ spans ]
}
- class pdf2docx.text.Line.Line(raw: Optional[dict] = None)#
Bases:
Element
Object representing a line in text block.
- add(span_or_list)#
Add span list to current Line.
- Args:
span_or_list (Span, Iterable): TextSpan or TextSpan list to add.
- bbox: fitz.Rect#
- property image_spans#
Get image spans in this Line.
- intersects(rect)#
Create new Line object with spans contained in given bbox.
- Args:
rect (fitz.Rect): Target bbox.
- Returns:
Line: The created Line instance.
- make_docx(p)#
Create docx line, i.e. a run in
python-docx
.
- property raw_text#
Joining span text with image ignored.
- store()#
Store properties in raw dict.
- strip()#
Remove redundant blanks at the begin/end span.
- property text#
Joining span text. Note image is translated to a placeholder
<image>
.
- property text_direction#
Get text direction. Consider
LEFT_RIGHT
andLEFT_RIGHT
only.- Returns:
TextDirection: Text direction of this line.
- property white_space_only#
If this line contains only white space or not. If True, this line is safe to be removed.