pdf2docx.text.Line module#

Text Line objects based on PDF raw dict extracted with PyMuPDF.

Data structure of line in text block referring to this link:

{
    'bbox': (x0,y0,x1,y1),
    'wmode': m,
    'dir': [x,y],
    'spans': [ spans ]
}
class pdf2docx.text.Line.Line(raw: Optional[dict] = None)#

Bases: Element

Object representing a line in text block.

add(span_or_list)#

Add span list to current Line.

Args:

span_or_list (Span, Iterable): TextSpan or TextSpan list to add.

add_span(span: Element)#

Add span to current Line.

bbox: fitz.Rect#
property image_spans#

Get image spans in this Line.

intersects(rect)#

Create new Line object with spans contained in given bbox.

Args:

rect (fitz.Rect): Target bbox.

Returns:

Line: The created Line instance.

make_docx(p)#

Create docx line, i.e. a run in python-docx.

property raw_text#

Joining span text with image ignored.

store()#

Store properties in raw dict.

strip()#

Remove redundant blanks at the begin/end span.

property text#

Joining span text. Note image is translated to a placeholder <image>.

property text_direction#

Get text direction. Consider LEFT_RIGHT and LEFT_RIGHT only.

Returns:

TextDirection: Text direction of this line.

property white_space_only#

If this line contains only white space or not. If True, this line is safe to be removed.