tesseract  5.0.0-alpha-619-ge9db
tesseract::PageIterator Class Reference

#include <pageiterator.h>

Inheritance diagram for tesseract::PageIterator:
tesseract::LTRResultIterator tesseract::ResultIterator tesseract::MutableIterator

Public Member Functions

 PageIterator (PAGE_RES *page_res, Tesseract *tesseract, int scale, int scaled_yres, int rect_left, int rect_top, int rect_width, int rect_height)
 
virtual ~PageIterator ()
 
 PageIterator (const PageIterator &src)
 
const PageIteratoroperator= (const PageIterator &src)
 
bool PositionedAtSameWord (const PAGE_RES_IT *other) const
 
virtual void Begin ()
 
virtual void RestartParagraph ()
 
bool IsWithinFirstTextlineOfParagraph () const
 
virtual void RestartRow ()
 
virtual bool Next (PageIteratorLevel level)
 
virtual bool IsAtBeginningOf (PageIteratorLevel level) const
 
virtual bool IsAtFinalElement (PageIteratorLevel level, PageIteratorLevel element) const
 
int Cmp (const PageIterator &other) const
 
void SetBoundingBoxComponents (bool include_upper_dots, bool include_lower_dots)
 
bool BoundingBox (PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const
 
bool BoundingBox (PageIteratorLevel level, int padding, int *left, int *top, int *right, int *bottom) const
 
bool BoundingBoxInternal (PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const
 
bool Empty (PageIteratorLevel level) const
 
PolyBlockType BlockType () const
 
Pta * BlockPolygon () const
 
Pix * GetBinaryImage (PageIteratorLevel level) const
 
Pix * GetImage (PageIteratorLevel level, int padding, Pix *original_img, int *left, int *top) const
 
bool Baseline (PageIteratorLevel level, int *x1, int *y1, int *x2, int *y2) const
 
void Orientation (tesseract::Orientation *orientation, tesseract::WritingDirection *writing_direction, tesseract::TextlineOrder *textline_order, float *deskew_angle) const
 
void ParagraphInfo (tesseract::ParagraphJustification *justification, bool *is_list_item, bool *is_crown, int *first_line_indent) const
 
bool SetWordBlamerBundle (BlamerBundle *blamer_bundle)
 

Protected Member Functions

TESS_LOCAL void BeginWord (int offset)
 

Protected Attributes

PAGE_RESpage_res_
 
Tesseracttesseract_
 
PAGE_RES_ITit_
 
WERDword_
 
int word_length_
 
int blob_index_
 
C_BLOB_IT * cblob_it_
 
bool include_upper_dots_
 
bool include_lower_dots_
 
int scale_
 
int scaled_yres_
 
int rect_left_
 
int rect_top_
 
int rect_width_
 
int rect_height_
 

Detailed Description

Class to iterate over tesseract page structure, providing access to all levels of the page hierarchy, without including any tesseract headers or having to handle any tesseract structures. WARNING! This class points to data held within the TessBaseAPI class, and therefore can only be used while the TessBaseAPI class still exists and has not been subjected to a call of Init, SetImage, Recognize, Clear, End DetectOS, or anything else that changes the internal PAGE_RES. See apitypes.h for the definition of PageIteratorLevel. See also ResultIterator, derived from PageIterator, which adds in the ability to access OCR output with text-specific methods.

Definition at line 52 of file pageiterator.h.

Constructor & Destructor Documentation

◆ PageIterator() [1/2]

tesseract::PageIterator::PageIterator ( PAGE_RES page_res,
Tesseract tesseract,
int  scale,
int  scaled_yres,
int  rect_left,
int  rect_top,
int  rect_width,
int  rect_height 
)

page_res and tesseract come directly from the BaseAPI. The rectangle parameters are copied indirectly from the Thresholder, via the BaseAPI. They represent the coordinates of some rectangle in an original image (in top-left-origin coordinates) and therefore the top-left needs to be added to any output boxes in order to specify coordinates in the original image. See TessBaseAPI::SetRectangle. The scale and scaled_yres are in case the Thresholder scaled the image rectangle prior to thresholding. Any coordinates in tesseract's image must be divided by scale before adding (rect_left, rect_top). The scaled_yres indicates the effective resolution of the binary image that tesseract has been given by the Thresholder. After the constructor, Begin has already been called.

Definition at line 30 of file pageiterator.cpp.

33  : page_res_(page_res),
35  word_(nullptr),
36  word_length_(0),
37  blob_index_(0),
38  cblob_it_(nullptr),
39  include_upper_dots_(false),
40  include_lower_dots_(false),
41  scale_(scale),
42  scaled_yres_(scaled_yres),
43  rect_left_(rect_left),
44  rect_top_(rect_top),
45  rect_width_(rect_width),
46  rect_height_(rect_height) {
47  it_ = new PAGE_RES_IT(page_res);
49 }

◆ ~PageIterator()

tesseract::PageIterator::~PageIterator ( )
virtual

Definition at line 51 of file pageiterator.cpp.

51  {
52  delete it_;
53  delete cblob_it_;
54 }

◆ PageIterator() [2/2]

tesseract::PageIterator::PageIterator ( const PageIterator src)

Page/ResultIterators may be copied! This makes it possible to iterate over all the objects at a lower level, while maintaining an iterator to objects at a higher level. These constructors DO NOT CALL Begin, so iterations will continue from the location of src.

PageIterators may be copied! This makes it possible to iterate over all the objects at a lower level, while maintaining an iterator to objects at a higher level.

Definition at line 61 of file pageiterator.cpp.

62  : page_res_(src.page_res_),
63  tesseract_(src.tesseract_),
64  word_(nullptr),
65  word_length_(src.word_length_),
66  blob_index_(src.blob_index_),
67  cblob_it_(nullptr),
68  include_upper_dots_(src.include_upper_dots_),
69  include_lower_dots_(src.include_lower_dots_),
70  scale_(src.scale_),
71  scaled_yres_(src.scaled_yres_),
72  rect_left_(src.rect_left_),
73  rect_top_(src.rect_top_),
74  rect_width_(src.rect_width_),
75  rect_height_(src.rect_height_) {
76  it_ = new PAGE_RES_IT(*src.it_);
77  BeginWord(src.blob_index_);
78 }

Member Function Documentation

◆ Baseline()

bool tesseract::PageIterator::Baseline ( PageIteratorLevel  level,
int *  x1,
int *  y1,
int *  x2,
int *  y2 
) const

Returns the baseline of the current object at the given level. The baseline is the line that passes through (x1, y1) and (x2, y2). WARNING: with vertical text, baselines may be vertical! Returns false if there is no baseline at the current position.

Returns the baseline of the current object at the given level. The baseline is the line that passes through (x1, y1) and (x2, y2). WARNING: with vertical text, baselines may be vertical!

Definition at line 496 of file pageiterator.cpp.

497  {
498  if (it_->word() == nullptr) return false; // Already at the end!
499  ROW* row = it_->row()->row;
500  WERD* word = it_->word()->word;
501  TBOX box = (level == RIL_WORD || level == RIL_SYMBOL)
502  ? word->bounding_box()
503  : row->bounding_box();
504  int left = box.left();
505  ICOORD startpt(left, static_cast<int16_t>(row->base_line(left) + 0.5));
506  int right = box.right();
507  ICOORD endpt(right, static_cast<int16_t>(row->base_line(right) + 0.5));
508  // Rotate to image coordinates and convert to global image coords.
509  startpt.rotate(it_->block()->block->re_rotation());
510  endpt.rotate(it_->block()->block->re_rotation());
511  *x1 = startpt.x() / scale_ + rect_left_;
512  *y1 = (rect_height_ - startpt.y()) / scale_ + rect_top_;
513  *x2 = endpt.x() / scale_ + rect_left_;
514  *y2 = (rect_height_ - endpt.y()) / scale_ + rect_top_;
515  return true;
516 }

◆ Begin()

void tesseract::PageIterator::Begin ( )
virtual

Moves the iterator to point to the start of the page to begin an iteration.

Resets the iterator to point to the start of the page.

Reimplemented in tesseract::ResultIterator.

Definition at line 105 of file pageiterator.cpp.

105  {
107  BeginWord(0);
108 }

◆ BeginWord()

void tesseract::PageIterator::BeginWord ( int  offset)
protected

Sets up the internal data for iterating the blobs of a new word, then moves the iterator to the given offset.

Definition at line 585 of file pageiterator.cpp.

585  {
586  WERD_RES* word_res = it_->word();
587  if (word_res == nullptr) {
588  // This is a non-text block, so there is no word.
589  word_length_ = 0;
590  blob_index_ = 0;
591  word_ = nullptr;
592  return;
593  }
594  if (word_res->best_choice != nullptr) {
595  // Recognition has been done, so we are using the box_word, which
596  // is already baseline denormalized.
597  word_length_ = word_res->best_choice->length();
598  if (word_res->box_word != nullptr) {
599  if (word_res->box_word->length() != word_length_) {
600  tprintf("Corrupted word! best_choice[len=%d] = %s, box_word[len=%d]: ",
602  word_res->box_word->length());
603  word_res->box_word->bounding_box().print();
604  }
605  ASSERT_HOST(word_res->box_word->length() == word_length_);
606  }
607  word_ = nullptr;
608  // We will be iterating the box_word.
609  delete cblob_it_;
610  cblob_it_ = nullptr;
611  } else {
612  // No recognition yet, so a "symbol" is a cblob.
613  word_ = word_res->word;
614  ASSERT_HOST(word_->cblob_list() != nullptr);
615  word_length_ = word_->cblob_list()->length();
616  if (cblob_it_ == nullptr) cblob_it_ = new C_BLOB_IT;
617  cblob_it_->set_to_list(word_->cblob_list());
618  }
619  for (blob_index_ = 0; blob_index_ < offset; ++blob_index_) {
620  if (cblob_it_ != nullptr)
621  cblob_it_->forward();
622  }
623 }

◆ BlockPolygon()

Pta * tesseract::PageIterator::BlockPolygon ( ) const

Returns the polygon outline of the current block. The returned Pta must be ptaDestroy-ed after use. Note that the returned Pta lists the vertices of the polygon, and the last edge is the line segment between the last point and the first point. nullptr will be returned if the iterator is at the end of the document or layout analysis was not used.

Returns the polygon outline of the current block. The returned Pta must be ptaDestroy-ed after use.

Definition at line 368 of file pageiterator.cpp.

368  {
369  if (it_->block() == nullptr || it_->block()->block == nullptr)
370  return nullptr; // Already at the end!
371  if (it_->block()->block->pdblk.poly_block() == nullptr)
372  return nullptr; // No layout analysis used - no polygon.
373  // Copy polygon, so we can unrotate it to image coordinates.
374  POLY_BLOCK* internal_poly = it_->block()->block->pdblk.poly_block();
375  ICOORDELT_LIST vertices;
376  vertices.deep_copy(internal_poly->points(), ICOORDELT::deep_copy);
377  POLY_BLOCK poly(&vertices, internal_poly->isA());
378  poly.rotate(it_->block()->block->re_rotation());
379  ICOORDELT_IT it(poly.points());
380  Pta* pta = ptaCreate(it.length());
381  int num_pts = 0;
382  for (it.mark_cycle_pt(); !it.cycled_list(); it.forward(), ++num_pts) {
383  ICOORD* pt = it.data();
384  // Convert to top-down coords within the input image.
385  int x = static_cast<float>(pt->x()) / scale_ + rect_left_;
386  int y = rect_top_ + rect_height_ - static_cast<float>(pt->y()) / scale_;
389  ptaAddPt(pta, x, y);
390  }
391  return pta;
392 }

◆ BlockType()

PolyBlockType tesseract::PageIterator::BlockType ( ) const

Returns the type of the current block. See apitypes.h for PolyBlockType.

Returns the type of the current block. See tesseract/apitypes.h for PolyBlockType.

Definition at line 358 of file pageiterator.cpp.

358  {
359  if (it_->block() == nullptr || it_->block()->block == nullptr)
360  return PT_UNKNOWN; // Already at the end!
361  if (it_->block()->block->pdblk.poly_block() == nullptr)
362  return PT_FLOWING_TEXT; // No layout analysis used - assume text.
363  return it_->block()->block->pdblk.poly_block()->isA();
364 }

◆ BoundingBox() [1/2]

bool tesseract::PageIterator::BoundingBox ( PageIteratorLevel  level,
int *  left,
int *  top,
int *  right,
int *  bottom 
) const

Returns the bounding rectangle of the current object at the given level. See comment on coordinate system above. Returns false if there is no such object at the current position. The returned bounding box is guaranteed to match the size and position of the image returned by GetBinaryImage, but may clip foreground pixels from a grey image. The padding argument to GetImage can be used to expand the image to include more foreground pixels. See GetImage below.

Returns the bounding rectangle of the current object at the given level in coordinates of the original image. See comment on coordinate system above. Returns false if there is no such object at the current position.

Definition at line 325 of file pageiterator.cpp.

327  {
328  return BoundingBox(level, 0, left, top, right, bottom);
329 }

◆ BoundingBox() [2/2]

bool tesseract::PageIterator::BoundingBox ( PageIteratorLevel  level,
int  padding,
int *  left,
int *  top,
int *  right,
int *  bottom 
) const

Definition at line 331 of file pageiterator.cpp.

333  {
334  if (!BoundingBoxInternal(level, left, top, right, bottom))
335  return false;
336  // Convert to the coordinate system of the original image.
337  *left = ClipToRange(*left / scale_ + rect_left_ - padding,
339  *top = ClipToRange(*top / scale_ + rect_top_ - padding,
341  *right = ClipToRange((*right + scale_ - 1) / scale_ + rect_left_ + padding,
342  *left, rect_left_ + rect_width_);
343  *bottom = ClipToRange((*bottom + scale_ - 1) / scale_ + rect_top_ + padding,
344  *top, rect_top_ + rect_height_);
345  return true;
346 }

◆ BoundingBoxInternal()

bool tesseract::PageIterator::BoundingBoxInternal ( PageIteratorLevel  level,
int *  left,
int *  top,
int *  right,
int *  bottom 
) const

Returns the bounding rectangle of the object in a coordinate system of the working image rectangle having its origin at (rect_left_, rect_top_) with respect to the original image and is scaled by a factor scale_.

Returns the bounding rectangle of the current object at the given level in the coordinates of the working image that is pix_binary(). See comment on coordinate system above. Returns false if there is no such object at the current position.

Definition at line 265 of file pageiterator.cpp.

267  {
268  if (Empty(level))
269  return false;
270  TBOX box;
271  PARA *para = nullptr;
272  switch (level) {
273  case RIL_BLOCK:
276  break;
277  case RIL_PARA:
278  para = it_->row()->row->para();
279  // Fall through.
280  case RIL_TEXTLINE:
283  break;
284  case RIL_WORD:
287  break;
288  case RIL_SYMBOL:
289  if (cblob_it_ == nullptr)
290  box = it_->word()->box_word->BlobBox(blob_index_);
291  else
292  box = cblob_it_->data()->bounding_box();
293  }
294  if (level == RIL_PARA) {
295  PageIterator other = *this;
296  other.Begin();
297  do {
298  if (other.it_->block() &&
299  other.it_->block()->block == it_->block()->block &&
300  other.it_->row() && other.it_->row()->row &&
301  other.it_->row()->row->para() == para) {
302  box = box.bounding_union(other.it_->row()->row->bounding_box());
303  }
304  } while (other.Next(RIL_TEXTLINE));
305  }
306  if (level != RIL_SYMBOL || cblob_it_ != nullptr)
307  box.rotate(it_->block()->block->re_rotation());
308  // Now we have a box in tesseract coordinates relative to the image rectangle,
309  // we have to convert the coords to a top-down system.
310  const int pix_height = pixGetHeight(tesseract_->pix_binary());
311  const int pix_width = pixGetWidth(tesseract_->pix_binary());
312  *left = ClipToRange(static_cast<int>(box.left()), 0, pix_width);
313  *top = ClipToRange(pix_height - box.top(), 0, pix_height);
314  *right = ClipToRange(static_cast<int>(box.right()), *left, pix_width);
315  *bottom = ClipToRange(pix_height - box.bottom(), *top, pix_height);
316  return true;
317 }

◆ Cmp()

int tesseract::PageIterator::Cmp ( const PageIterator other) const

Returns whether this iterator is positioned before other: -1 equal to other: 0 after other: 1

Definition at line 235 of file pageiterator.cpp.

235  {
236  int word_cmp = it_->cmp(*other.it_);
237  if (word_cmp != 0)
238  return word_cmp;
239  if (blob_index_ < other.blob_index_)
240  return -1;
241  if (blob_index_ == other.blob_index_)
242  return 0;
243  return 1;
244 }

◆ Empty()

bool tesseract::PageIterator::Empty ( PageIteratorLevel  level) const

Returns whether there is no object of a given level.

Return that there is no such object at a given level.

Definition at line 349 of file pageiterator.cpp.

349  {
350  if (it_->block() == nullptr) return true; // Already at the end!
351  if (it_->word() == nullptr && level != RIL_BLOCK) return true; // image block
352  if (level == RIL_SYMBOL && blob_index_ >= word_length_)
353  return true; // Zero length word, or already at the end of it.
354  return false;
355 }

◆ GetBinaryImage()

Pix * tesseract::PageIterator::GetBinaryImage ( PageIteratorLevel  level) const

Returns a binary image of the current object at the given level. The position and size match the return from BoundingBoxInternal, and so this could be upscaled with respect to the original input image. Use pixDestroy to delete the image after use.

Returns a binary image of the current object at the given level. The position and size match the return from BoundingBoxInternal, and so this could be upscaled with respect to the original input image. Use pixDestroy to delete the image after use. The following methods are used to generate the images: RIL_BLOCK: mask the page image with the block polygon. RIL_TEXTLINE: Clip the rectangle of the line box from the page image. TODO(rays) fix this to generate and use a line polygon. RIL_WORD: Clip the rectangle of the word box from the page image. RIL_SYMBOL: Render the symbol outline to an image for cblobs (prior to recognition) or the bounding box otherwise. A reconstruction of the original image (using xor to check for double representation) should be reasonably accurate, apart from removed noise, at the block level. Below the block level, the reconstruction will be missing images and line separators. At the symbol level, kerned characters will be invade the bounding box if rendered after recognition, making an xor reconstruction inaccurate, but an or construction better. Before recognition, symbol-level reconstruction should be good, even with xor, since the images come from the connected components.

Definition at line 416 of file pageiterator.cpp.

416  {
417  int left, top, right, bottom;
418  if (!BoundingBoxInternal(level, &left, &top, &right, &bottom))
419  return nullptr;
420  if (level == RIL_SYMBOL && cblob_it_ != nullptr &&
421  cblob_it_->data()->area() != 0)
422  return cblob_it_->data()->render();
423  Box* box = boxCreate(left, top, right - left, bottom - top);
424  Pix* pix = pixClipRectangle(tesseract_->pix_binary(), box, nullptr);
425  boxDestroy(&box);
426  if (level == RIL_BLOCK || level == RIL_PARA) {
427  // Clip to the block polygon as well.
428  TBOX mask_box;
429  Pix* mask = it_->block()->block->render_mask(&mask_box);
430  int mask_x = left - mask_box.left();
431  int mask_y = top - (tesseract_->ImageHeight() - mask_box.top());
432  // AND the mask and pix, putting the result in pix.
433  pixRasterop(pix, std::max(0, -mask_x), std::max(0, -mask_y), pixGetWidth(pix),
434  pixGetHeight(pix), PIX_SRC & PIX_DST, mask, std::max(0, mask_x),
435  std::max(0, mask_y));
436  pixDestroy(&mask);
437  }
438  return pix;
439 }

◆ GetImage()

Pix * tesseract::PageIterator::GetImage ( PageIteratorLevel  level,
int  padding,
Pix *  original_img,
int *  left,
int *  top 
) const

Returns an image of the current object at the given level in greyscale if available in the input. To guarantee a binary image use BinaryImage. NOTE that in order to give the best possible image, the bounds are expanded slightly over the binary connected component, by the supplied padding, so the top-left position of the returned image is returned in (left,top). These will most likely not match the coordinates returned by BoundingBox. If you do not supply an original image, you will get a binary one. Use pixDestroy to delete the image after use.

Definition at line 452 of file pageiterator.cpp.

454  {
455  int right, bottom;
456  if (!BoundingBox(level, left, top, &right, &bottom))
457  return nullptr;
458  if (original_img == nullptr)
459  return GetBinaryImage(level);
460 
461  // Expand the box.
462  *left = std::max(*left - padding, 0);
463  *top = std::max(*top - padding, 0);
464  right = std::min(right + padding, rect_width_);
465  bottom = std::min(bottom + padding, rect_height_);
466  Box* box = boxCreate(*left, *top, right - *left, bottom - *top);
467  Pix* grey_pix = pixClipRectangle(original_img, box, nullptr);
468  boxDestroy(&box);
469  if (level == RIL_BLOCK || level == RIL_PARA) {
470  // Clip to the block polygon as well.
471  TBOX mask_box;
472  Pix* mask = it_->block()->block->render_mask(&mask_box);
473  // Copy the mask registered correctly into an image the size of grey_pix.
474  int mask_x = *left - mask_box.left();
475  int mask_y = *top - (pixGetHeight(original_img) - mask_box.top());
476  int width = pixGetWidth(grey_pix);
477  int height = pixGetHeight(grey_pix);
478  Pix* resized_mask = pixCreate(width, height, 1);
479  pixRasterop(resized_mask, std::max(0, -mask_x), std::max(0, -mask_y), width, height,
480  PIX_SRC, mask, std::max(0, mask_x), std::max(0, mask_y));
481  pixDestroy(&mask);
482  pixDilateBrick(resized_mask, resized_mask, 2 * padding + 1,
483  2 * padding + 1);
484  pixInvert(resized_mask, resized_mask);
485  pixSetMasked(grey_pix, resized_mask, UINT32_MAX);
486  pixDestroy(&resized_mask);
487  }
488  return grey_pix;
489 }

◆ IsAtBeginningOf()

bool tesseract::PageIterator::IsAtBeginningOf ( PageIteratorLevel  level) const
virtual

Returns true if the iterator is at the start of an object at the given level.

For instance, suppose an iterator it is pointed to the first symbol of the first word of the third line of the second paragraph of the first block in a page, then: it.IsAtBeginningOf(RIL_BLOCK) = false it.IsAtBeginningOf(RIL_PARA) = false it.IsAtBeginningOf(RIL_TEXTLINE) = true it.IsAtBeginningOf(RIL_WORD) = true it.IsAtBeginningOf(RIL_SYMBOL) = true

Returns true if the iterator is at the start of an object at the given level. Possible uses include determining if a call to Next(RIL_WORD) moved to the start of a RIL_PARA.

Reimplemented in tesseract::ResultIterator.

Definition at line 185 of file pageiterator.cpp.

185  {
186  if (it_->block() == nullptr) return false; // Already at the end!
187  if (it_->word() == nullptr) return true; // In an image block.
188  switch (level) {
189  case RIL_BLOCK:
190  return blob_index_ == 0 && it_->block() != it_->prev_block();
191  case RIL_PARA:
192  return blob_index_ == 0 &&
193  (it_->block() != it_->prev_block() ||
194  it_->row()->row->para() != it_->prev_row()->row->para());
195  case RIL_TEXTLINE:
196  return blob_index_ == 0 && it_->row() != it_->prev_row();
197  case RIL_WORD:
198  return blob_index_ == 0;
199  case RIL_SYMBOL:
200  return true;
201  }
202  return false;
203 }

◆ IsAtFinalElement()

bool tesseract::PageIterator::IsAtFinalElement ( PageIteratorLevel  level,
PageIteratorLevel  element 
) const
virtual

Returns whether the iterator is positioned at the last element in a given level. (e.g. the last word in a line, the last line in a block)

Here's some two-paragraph example

text. It starts off innocuously enough but quickly turns bizarre. The author inserts a cornucopia of words to guard against confused references.

Now take an iterator it pointed to the start of "bizarre." it.IsAtFinalElement(RIL_PARA, RIL_SYMBOL) = false it.IsAtFinalElement(RIL_PARA, RIL_WORD) = true it.IsAtFinalElement(RIL_BLOCK, RIL_WORD) = false

Returns whether the iterator is positioned at the last element in a given level. (e.g. the last word in a line, the last line in a block)

Reimplemented in tesseract::ResultIterator.

Definition at line 209 of file pageiterator.cpp.

210  {
211  if (Empty(element)) return true; // Already at the end!
212  // The result is true if we step forward by element and find we are
213  // at the the end of the page or at beginning of *all* levels in:
214  // [level, element).
215  // When there is more than one level difference between element and level,
216  // we could for instance move forward one symbol and still be at the first
217  // word on a line, so we also have to be at the first symbol in a word.
218  PageIterator next(*this);
219  next.Next(element);
220  if (next.Empty(element)) return true; // Reached the end of the page.
221  while (element > level) {
222  element = static_cast<PageIteratorLevel>(element - 1);
223  if (!next.IsAtBeginningOf(element))
224  return false;
225  }
226  return true;
227 }

◆ IsWithinFirstTextlineOfParagraph()

bool tesseract::PageIterator::IsWithinFirstTextlineOfParagraph ( ) const

Return whether this iterator points anywhere in the first textline of a paragraph.

Definition at line 123 of file pageiterator.cpp.

123  {
124  PageIterator p_start(*this);
125  p_start.RestartParagraph();
126  return p_start.it_->row() == it_->row();
127 }

◆ Next()

bool tesseract::PageIterator::Next ( PageIteratorLevel  level)
virtual

Moves to the start of the next object at the given level in the page hierarchy, and returns false if the end of the page was reached. NOTE that RIL_SYMBOL will skip non-text blocks, but all other PageIteratorLevel level values will visit each non-text block once. Think of non text blocks as containing a single para, with a single line, with a single imaginary word. Calls to Next with different levels may be freely intermixed. This function iterates words in right-to-left scripts correctly, if the appropriate language has been loaded into Tesseract.

Moves to the start of the next object at the given level in the page hierarchy, and returns false if the end of the page was reached. NOTE (CHANGED!) that ALL PageIteratorLevel level values will visit each non-text block at least once. Think of non text blocks as containing a single para, with at least one line, with a single imaginary word, containing a single symbol. The bounding boxes mark out any polygonal nature of the block, and PTIsTextType(BLockType()) is false for non-text blocks. Calls to Next with different levels may be freely intermixed. This function iterates words in right-to-left scripts correctly, if the appropriate language has been loaded into Tesseract.

Reimplemented in tesseract::ResultIterator.

Definition at line 147 of file pageiterator.cpp.

147  {
148  if (it_->block() == nullptr) return false; // Already at the end!
149  if (it_->word() == nullptr)
150  level = RIL_BLOCK;
151 
152  switch (level) {
153  case RIL_BLOCK:
154  it_->forward_block();
155  break;
156  case RIL_PARA:
158  break;
159  case RIL_TEXTLINE:
160  for (it_->forward_with_empties(); it_->row() == it_->prev_row();
162  break;
163  case RIL_WORD:
165  break;
166  case RIL_SYMBOL:
167  if (cblob_it_ != nullptr)
168  cblob_it_->forward();
169  ++blob_index_;
170  if (blob_index_ >= word_length_)
172  else
173  return true;
174  break;
175  }
176  BeginWord(0);
177  return it_->block() != nullptr;
178 }

◆ operator=()

const PageIterator & tesseract::PageIterator::operator= ( const PageIterator src)

Definition at line 80 of file pageiterator.cpp.

80  {
81  page_res_ = src.page_res_;
82  tesseract_ = src.tesseract_;
83  include_upper_dots_ = src.include_upper_dots_;
84  include_lower_dots_ = src.include_lower_dots_;
85  scale_ = src.scale_;
86  scaled_yres_ = src.scaled_yres_;
87  rect_left_ = src.rect_left_;
88  rect_top_ = src.rect_top_;
89  rect_width_ = src.rect_width_;
90  rect_height_ = src.rect_height_;
91  delete it_;
92  it_ = new PAGE_RES_IT(*src.it_);
93  BeginWord(src.blob_index_);
94  return *this;
95 }

◆ Orientation()

void tesseract::PageIterator::Orientation ( tesseract::Orientation orientation,
tesseract::WritingDirection writing_direction,
tesseract::TextlineOrder textline_order,
float *  deskew_angle 
) const

Returns orientation for the block the iterator points to. orientation, writing_direction, textline_order: see publictypes.h deskew_angle: after rotating the block so the text orientation is upright, how many radians does one have to rotate the block anti-clockwise for it to be level? -Pi/4 <= deskew_angle <= Pi/4

Definition at line 518 of file pageiterator.cpp.

521  {
522  BLOCK* block = it_->block()->block;
523 
524  // Orientation
525  FCOORD up_in_image(0.0, 1.0);
526  up_in_image.unrotate(block->classify_rotation());
527  up_in_image.rotate(block->re_rotation());
528 
529  if (up_in_image.x() == 0.0F) {
530  if (up_in_image.y() > 0.0F) {
531  *orientation = ORIENTATION_PAGE_UP;
532  } else {
533  *orientation = ORIENTATION_PAGE_DOWN;
534  }
535  } else if (up_in_image.x() > 0.0F) {
536  *orientation = ORIENTATION_PAGE_RIGHT;
537  } else {
538  *orientation = ORIENTATION_PAGE_LEFT;
539  }
540 
541  // Writing direction
542  bool is_vertical_text = (block->classify_rotation().x() == 0.0);
543  bool right_to_left = block->right_to_left();
544  *writing_direction =
545  is_vertical_text
547  : (right_to_left
550 
551  // Textline Order
552  const bool is_mongolian = false; // TODO(eger): fix me
553  *textline_order = is_vertical_text
554  ? (is_mongolian
558 
559  // Deskew angle
560  FCOORD skew = block->skew(); // true horizontal for textlines
561  *deskew_angle = -skew.angle();
562 }

◆ ParagraphInfo()

void tesseract::PageIterator::ParagraphInfo ( tesseract::ParagraphJustification justification,
bool *  is_list_item,
bool *  is_crown,
int *  first_line_indent 
) const

Returns information about the current paragraph, if available.

justification - LEFT if ragged right, or fully justified and script is left-to-right. RIGHT if ragged left, or fully justified and script is right-to-left. unknown if it looks like source code or we have very few lines. is_list_item - true if we believe this is a member of an ordered or unordered list. is_crown - true if the first line of the paragraph is aligned with the other lines of the paragraph even though subsequent paragraphs have first line indents. This typically indicates that this is the continuation of a previous paragraph or that it is the very first paragraph in the chapter. first_line_indent - For LEFT aligned paragraphs, the first text line of paragraphs of this kind are indented this many pixels from the left edge of the rest of the paragraph. for RIGHT aligned paragraphs, the first text line of paragraphs of this kind are indented this many pixels from the right edge of the rest of the paragraph. NOTE 1: This value may be negative. NOTE 2: if *is_crown == true, the first line of this paragraph is actually flush, and first_line_indent is set to the "common" first_line_indent for subsequent paragraphs in this block of text.

Definition at line 564 of file pageiterator.cpp.

567  {
569  if (!it_->row() || !it_->row()->row || !it_->row()->row->para() ||
570  !it_->row()->row->para()->model)
571  return;
572 
573  PARA *para = it_->row()->row->para();
574  *is_list_item = para->is_list_item;
575  *is_crown = para->is_very_first_or_continuation;
576  *first_line_indent = para->model->first_indent() -
577  para->model->body_indent();
578  *just = para->model->justification();
579 }

◆ PositionedAtSameWord()

bool tesseract::PageIterator::PositionedAtSameWord ( const PAGE_RES_IT other) const

Are we positioned at the same location as other?

Definition at line 97 of file pageiterator.cpp.

97  {
98  return (it_ == nullptr && it_ == other) ||
99  ((other != nullptr) && (it_ != nullptr) && (*it_ == *other));
100 }

◆ RestartParagraph()

void tesseract::PageIterator::RestartParagraph ( )
virtual

Moves the iterator to the beginning of the paragraph. This class implements this functionality by moving it to the zero indexed blob of the first (leftmost) word on the first row of the paragraph.

Definition at line 110 of file pageiterator.cpp.

110  {
111  if (it_->block() == nullptr) return; // At end of the document.
112  PAGE_RES_IT para(page_res_);
113  PAGE_RES_IT next_para(para);
114  next_para.forward_paragraph();
115  while (next_para.cmp(*it_) <= 0) {
116  para = next_para;
117  next_para.forward_paragraph();
118  }
119  *it_ = para;
120  BeginWord(0);
121 }

◆ RestartRow()

void tesseract::PageIterator::RestartRow ( )
virtual

Moves the iterator to the beginning of the text line. This class implements this functionality by moving it to the zero indexed blob of the first (leftmost) word of the row.

Definition at line 129 of file pageiterator.cpp.

129  {
130  it_->restart_row();
131  BeginWord(0);
132 }

◆ SetBoundingBoxComponents()

void tesseract::PageIterator::SetBoundingBoxComponents ( bool  include_upper_dots,
bool  include_lower_dots 
)
inline

Controls what to include in a bounding box. Bounding boxes of all levels between RIL_WORD and RIL_BLOCK can include or exclude potential diacritics. Between layout analysis and recognition, it isn't known where all diacritics belong, so this control is used to include or exclude some diacritics that are above or below the main body of the word. In most cases where the placement is obvious, and after recognition, it doesn't make as much difference, as the diacritics will already be included in the word.

Definition at line 190 of file pageiterator.h.

191  {
192  include_upper_dots_ = include_upper_dots;
193  include_lower_dots_ = include_lower_dots;
194  }

◆ SetWordBlamerBundle()

bool tesseract::PageIterator::SetWordBlamerBundle ( BlamerBundle blamer_bundle)

Definition at line 625 of file pageiterator.cpp.

625  {
626  if (it_->word() != nullptr) {
627  it_->word()->blamer_bundle = blamer_bundle;
628  return true;
629  } else {
630  return false;
631  }
632 }

Member Data Documentation

◆ blob_index_

int tesseract::PageIterator::blob_index_
protected

The current blob index within the word.

Definition at line 341 of file pageiterator.h.

◆ cblob_it_

C_BLOB_IT* tesseract::PageIterator::cblob_it_
protected

Iterator to the blobs within the word. If nullptr, then we are iterating OCR results in the box_word. Owned by this ResultIterator.

Definition at line 347 of file pageiterator.h.

◆ include_lower_dots_

bool tesseract::PageIterator::include_lower_dots_
protected

Definition at line 350 of file pageiterator.h.

◆ include_upper_dots_

bool tesseract::PageIterator::include_upper_dots_
protected

Control over what to include in bounding boxes.

Definition at line 349 of file pageiterator.h.

◆ it_

PAGE_RES_IT* tesseract::PageIterator::it_
protected

The iterator to the page_res_. Owned by this ResultIterator. A pointer just to avoid dragging in Tesseract includes.

Definition at line 332 of file pageiterator.h.

◆ page_res_

PAGE_RES* tesseract::PageIterator::page_res_
protected

Pointer to the page_res owned by the API.

Definition at line 325 of file pageiterator.h.

◆ rect_height_

int tesseract::PageIterator::rect_height_
protected

Definition at line 357 of file pageiterator.h.

◆ rect_left_

int tesseract::PageIterator::rect_left_
protected

Definition at line 354 of file pageiterator.h.

◆ rect_top_

int tesseract::PageIterator::rect_top_
protected

Definition at line 355 of file pageiterator.h.

◆ rect_width_

int tesseract::PageIterator::rect_width_
protected

Definition at line 356 of file pageiterator.h.

◆ scale_

int tesseract::PageIterator::scale_
protected

Parameters saved from the Thresholder. Needed to rebuild coordinates.

Definition at line 352 of file pageiterator.h.

◆ scaled_yres_

int tesseract::PageIterator::scaled_yres_
protected

Definition at line 353 of file pageiterator.h.

◆ tesseract_

Tesseract* tesseract::PageIterator::tesseract_
protected

Pointer to the Tesseract object owned by the API.

Definition at line 327 of file pageiterator.h.

◆ word_

WERD* tesseract::PageIterator::word_
protected

The current input WERD being iterated. If there is an output from OCR, then word_ is nullptr. Owned by the API

Definition at line 337 of file pageiterator.h.

◆ word_length_

int tesseract::PageIterator::word_length_
protected

The length of the current word_.

Definition at line 339 of file pageiterator.h.


The documentation for this class was generated from the following files:
ParagraphModel::body_indent
int body_indent() const
Definition: ocrpara.h:169
WERD_CHOICE::unichar_string
const STRING & unichar_string() const
Definition: ratngs.h:529
tesseract::WRITING_DIRECTION_LEFT_TO_RIGHT
Definition: publictypes.h:132
ClipToRange
T ClipToRange(const T &x, const T &lower_bound, const T &upper_bound)
Definition: helpers.h:106
ROW::para
PARA * para() const
Definition: ocrrow.h:117
tesseract::RIL_WORD
Definition: publictypes.h:220
WERD_RES::box_word
tesseract::BoxWord * box_word
Definition: pageres.h:266
FCOORD::angle
float angle() const
find angle
Definition: points.h:246
PAGE_RES_IT::forward_with_empties
WERD_RES * forward_with_empties()
Definition: pageres.h:732
BLOCK::skew
FCOORD skew() const
Definition: ocrblock.h:145
ROW::base_line
float base_line(float xpos) const
Definition: ocrrow.h:58
tesseract::PageIterator::rect_height_
int rect_height_
Definition: pageiterator.h:357
tesseract::PageIterator::it_
PAGE_RES_IT * it_
Definition: pageiterator.h:332
ASSERT_HOST
#define ASSERT_HOST(x)
Definition: errcode.h:87
WERD::bounding_box
TBOX bounding_box() const
Definition: werd.cpp:147
PAGE_RES_IT::forward_paragraph
WERD_RES * forward_paragraph()
Definition: pageres.cpp:1637
tesseract::TEXTLINE_ORDER_RIGHT_TO_LEFT
Definition: publictypes.h:150
PARA::is_list_item
bool is_list_item
Definition: ocrpara.h:38
tesseract::PageIterator::page_res_
PAGE_RES * page_res_
Definition: pageiterator.h:325
tesseract::RIL_BLOCK
Definition: publictypes.h:217
PAGE_RES_IT::block
BLOCK_RES * block() const
Definition: pageres.h:754
PAGE_RES_IT::row
ROW_RES * row() const
Definition: pageres.h:751
ICOORD
integer coordinate
Definition: points.h:30
ICOORD::rotate
void rotate(const FCOORD &vec)
Definition: points.h:522
TBOX::print
void print() const
Definition: rect.h:277
tesseract::PageIterator::scale_
int scale_
Definition: pageiterator.h:352
FCOORD::x
float x() const
Definition: points.h:206
TBOX::top
int16_t top() const
Definition: rect.h:57
tesseract::ORIENTATION_PAGE_RIGHT
Definition: publictypes.h:118
TBOX::bounding_union
TBOX bounding_union(const TBOX &box) const
Definition: rect.cpp:124
WERD_RES
Definition: pageres.h:160
BLOCK::render_mask
Pix * render_mask(TBOX *mask_box)
Definition: ocrblock.h:159
tesseract::WRITING_DIRECTION_TOP_TO_BOTTOM
Definition: publictypes.h:134
PAGE_RES_IT::prev_row
ROW_RES * prev_row() const
Definition: pageres.h:742
tesseract::ORIENTATION_PAGE_LEFT
Definition: publictypes.h:120
ICOORD::x
int16_t x() const
access function
Definition: points.h:51
FCOORD
Definition: points.h:187
TBOX::rotate
void rotate(const FCOORD &vec)
Definition: rect.h:196
BLOCK::right_to_left
bool right_to_left() const
Definition: ocrblock.h:78
tesseract::PageIterator::BoundingBoxInternal
bool BoundingBoxInternal(PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const
Definition: pageiterator.cpp:265
ParagraphModel::first_indent
int first_indent() const
Definition: ocrpara.h:168
WERD::cblob_list
C_BLOB_LIST * cblob_list()
Definition: werd.h:94
WERD_RES::blamer_bundle
BlamerBundle * blamer_bundle
Definition: pageres.h:246
tesseract::RIL_SYMBOL
Definition: publictypes.h:221
tesseract::ORIENTATION_PAGE_DOWN
Definition: publictypes.h:119
tesseract::PageIterator::include_lower_dots_
bool include_lower_dots_
Definition: pageiterator.h:350
POLY_BLOCK::rotate
void rotate(FCOORD rotation)
Definition: polyblk.cpp:183
tesseract::PageIterator::word_length_
int word_length_
Definition: pageiterator.h:339
BLOCK
Definition: ocrblock.h:28
BLOCK::pdblk
PDBLK pdblk
Page Description Block.
Definition: ocrblock.h:189
PAGE_RES_IT::restart_row
WERD_RES * restart_row()
Definition: pageres.cpp:1623
PAGE_RES_IT::forward_block
WERD_RES * forward_block()
Definition: pageres.cpp:1651
tesseract::WRITING_DIRECTION_RIGHT_TO_LEFT
Definition: publictypes.h:133
tesseract::BoxWord::BlobBox
const TBOX & BlobBox(int index) const
Definition: boxword.h:83
WERD_RES::best_choice
WERD_CHOICE * best_choice
Definition: pageres.h:235
ROW::restricted_bounding_box
TBOX restricted_bounding_box(bool upper_dots, bool lower_dots) const
Definition: ocrrow.cpp:81
tesseract::PageIterator::Empty
bool Empty(PageIteratorLevel level) const
Definition: pageiterator.cpp:349
STRING::c_str
const char * c_str() const
Definition: strngs.cpp:192
PDBLK::poly_block
POLY_BLOCK * poly_block() const
Definition: pdblock.h:54
ROW_RES::row
ROW * row
Definition: pageres.h:136
tesseract::JUSTIFICATION_UNKNOWN
Definition: publictypes.h:249
tesseract::PageIterator::tesseract_
Tesseract * tesseract_
Definition: pageiterator.h:327
tesseract::TEXTLINE_ORDER_LEFT_TO_RIGHT
Definition: publictypes.h:149
tesseract::PageIterator::GetBinaryImage
Pix * GetBinaryImage(PageIteratorLevel level) const
Definition: pageiterator.cpp:416
ICOORDELT::deep_copy
static ICOORDELT * deep_copy(const ICOORDELT *src)
Definition: points.h:178
tesseract::TEXTLINE_ORDER_TOP_TO_BOTTOM
Definition: publictypes.h:151
PAGE_RES_IT::prev_block
BLOCK_RES * prev_block() const
Definition: pageres.h:745
TBOX::bottom
int16_t bottom() const
Definition: rect.h:64
tesseract::PageIterator::word_
WERD * word_
Definition: pageiterator.h:337
tesseract::Tesseract::ImageHeight
int ImageHeight() const
Definition: tesseractclass.h:253
ROW::bounding_box
TBOX bounding_box() const
Definition: ocrrow.h:87
BLOCK::restricted_bounding_box
TBOX restricted_bounding_box(bool upper_dots, bool lower_dots) const
Definition: ocrblock.cpp:84
tesseract
Definition: baseapi.h:65
PAGE_RES_IT::word
WERD_RES * word() const
Definition: pageres.h:748
tesseract::PageIterator::include_upper_dots_
bool include_upper_dots_
Definition: pageiterator.h:349
tesseract::PageIterator::cblob_it_
C_BLOB_IT * cblob_it_
Definition: pageiterator.h:347
PT_UNKNOWN
Definition: capi.h:108
tesseract::RIL_TEXTLINE
Definition: publictypes.h:219
PAGE_RES_IT
Definition: pageres.h:668
POLY_BLOCK::points
ICOORDELT_LIST * points()
Definition: polyblk.h:52
tesseract::ORIENTATION_PAGE_UP
Definition: publictypes.h:117
tesseract::PageIterator::Begin
virtual void Begin()
Definition: pageiterator.cpp:105
tesseract::PageIterator::BoundingBox
bool BoundingBox(PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const
Definition: pageiterator.cpp:325
WERD_CHOICE::length
int length() const
Definition: ratngs.h:291
tesseract::BoxWord::length
int length() const
Definition: boxword.h:82
ParagraphModel::justification
tesseract::ParagraphJustification justification() const
Definition: ocrpara.h:164
WERD
Definition: werd.h:55
PAGE_RES_IT::cmp
int cmp(const PAGE_RES_IT &other) const
Definition: pageres.cpp:1141
BLOCK_RES::block
BLOCK * block
Definition: pageres.h:113
tesseract::Tesseract::pix_binary
Pix * pix_binary() const
Definition: tesseractclass.h:200
TBOX::left
int16_t left() const
Definition: rect.h:71
WERD::restricted_bounding_box
TBOX restricted_bounding_box(bool upper_dots, bool lower_dots) const
Definition: werd.cpp:151
ROW
Definition: ocrrow.h:35
tesseract::PageIterator::rect_top_
int rect_top_
Definition: pageiterator.h:355
PT_FLOWING_TEXT
Definition: capi.h:109
TBOX::right
int16_t right() const
Definition: rect.h:78
PARA::model
const ParagraphModel * model
Definition: ocrpara.h:36
tprintf
DLLSYM void tprintf(const char *format,...)
Definition: tprintf.cpp:34
BLOCK::classify_rotation
FCOORD classify_rotation() const
Definition: ocrblock.h:139
tesseract::PageIterator::scaled_yres_
int scaled_yres_
Definition: pageiterator.h:353
POLY_BLOCK
Definition: polyblk.h:26
PARA
Definition: ocrpara.h:29
PARA::is_very_first_or_continuation
bool is_very_first_or_continuation
Definition: ocrpara.h:43
WERD_RES::word
WERD * word
Definition: pageres.h:180
tesseract::PageIterator::BeginWord
TESS_LOCAL void BeginWord(int offset)
Definition: pageiterator.cpp:585
BLOCK::re_rotation
FCOORD re_rotation() const
Definition: ocrblock.h:133
POLY_BLOCK::isA
PolyBlockType isA() const
Definition: polyblk.h:58
tesseract::BoxWord::bounding_box
const TBOX & bounding_box() const
Definition: boxword.h:79
tesseract::PageIterator::rect_width_
int rect_width_
Definition: pageiterator.h:356
tesseract::PageIterator::rect_left_
int rect_left_
Definition: pageiterator.h:354
tesseract::RIL_PARA
Definition: publictypes.h:218
PAGE_RES_IT::restart_page_with_empties
WERD_RES * restart_page_with_empties()
Definition: pageres.h:698
ICOORD::y
int16_t y() const
access_function
Definition: points.h:55
TBOX
Definition: rect.h:33
tesseract::PageIterator::blob_index_
int blob_index_
Definition: pageiterator.h:341
tesseract::PageIterator::PageIterator
PageIterator(PAGE_RES *page_res, Tesseract *tesseract, int scale, int scaled_yres, int rect_left, int rect_top, int rect_width, int rect_height)
Definition: pageiterator.cpp:30