tesseract  4.0.0-1-g2a2b
pdfrenderer.cpp
Go to the documentation of this file.
1 // File: pdfrenderer.cpp
3 // Description: PDF rendering interface to inject into TessBaseAPI
4 //
5 // (C) Copyright 2011, Google Inc.
6 // Licensed under the Apache License, Version 2.0 (the "License");
7 // you may not use this file except in compliance with the License.
8 // You may obtain a copy of the License at
9 // http://www.apache.org/licenses/LICENSE-2.0
10 // Unless required by applicable law or agreed to in writing, software
11 // distributed under the License is distributed on an "AS IS" BASIS,
12 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 // See the License for the specific language governing permissions and
14 // limitations under the License.
15 //
17 
18 // Include automatically generated configuration file if running autoconf.
19 #ifdef HAVE_CONFIG_H
20 #include "config_auto.h"
21 #endif
22 
23 #include <memory> // std::unique_ptr
24 #include "allheaders.h"
25 #include "baseapi.h"
26 #include <cmath>
27 #include "renderer.h"
28 #include <cstring>
29 #include "tprintf.h"
30 
31 /*
32 
33 Design notes from Ken Sharp, with light editing.
34 
35 We think one solution is a font with a single glyph (.notdef) and a
36 CIDToGIDMap which maps all the CIDs to 0. That map would then be
37 stored as a stream in the PDF file, and when flate compressed should
38 be pretty small. The font, of course, will be approximately the same
39 size as the one you currently use.
40 
41 I'm working on such a font now, the CIDToGIDMap is trivial, you just
42 create a stream object which contains 128k bytes (2 bytes per possible
43 CID and your CIDs range from 0 to 65535) and where you currently have
44 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
45 
46 Note that if, in future, you were to use a different (ie not 2 byte)
47 CMap for character codes you could trivially extend the CIDToGIDMap.
48 
49 The following is an explanation of how some of the font stuff works,
50 this may be too simple for you in which case please accept my
51 apologies, its hard to know how much knowledge someone has. You can
52 skip all this anyway, its just for information.
53 
54 The font embedded in a PDF file is usually intended just to be
55 rendered, but extensions allow for at least some ability to locate (or
56 copy) text from a document. This isn't something which was an original
57 goal of the PDF format, but its been retro-fitted, presumably due to
58 popular demand.
59 
60 To do this reliably the PDF file must contain a ToUnicode CMap, a
61 device for mapping character codes to Unicode code points. If one of
62 these is present, then this will be used to convert the character
63 codes into Unicode values. If its not present then the reader will
64 fall back through a series of heuristics to try and guess the
65 result. This is, as you would expect, prone to failure.
66 
67 This doesn't concern you of course, since you always write a ToUnicode
68 CMap, so because you are writing the text in text rendering mode 3 it
69 would seem that you don't really need to worry about this, but in the
70 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
71 attached to a font, so in order to get even copy/paste to work you
72 need to define a font.
73 
74 This is what leads to problems, tools like pdfwrite assume that they
75 are going to be able to (or even have to) modify the font entries, so
76 they require that the font being embedded be valid, and to be honest
77 the font Tesseract embeds isn't valid (for this purpose).
78 
79 
80 To see why lets look at how text is specified in a PDF file:
81 
82 (Test) Tj
83 
84 Now that looks like text but actually it isn't. Each of those bytes is
85 a 'character code'. When it comes to rendering the text a complex
86 sequence of events takes place, which converts the character code into
87 'something' which the font understands. Its entirely possible via
88 character mappings to have that text render as 'Sftu'
89 
90 For simple fonts (PostScript type 1), we use the character code as the
91 index into an Encoding array (256 elements), each element of which is
92 a glyph name, so this gives us a glyph name. We then consult the
93 CharStrings dictionary in the font, that's a complex object which
94 contains pairs of keys and values, you can use the key to retrieve a
95 given value. So we have a glyph name, we then use that as the key to
96 the dictionary and retrieve the associated value. For a type 1 font,
97 the value is a glyph program that describes how to draw the glyph.
98 
99 For CIDFonts, its a little more complicated. Because CIDFonts can be
100 large, using a glyph name as the key is unreasonable (it would also
101 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
102 as the key. CIDs are just numbers.
103 
104 But.... We don't use the character code as the CID. What we do is use
105 a CMap to convert the character code into a CID. We then use the CID
106 to key the CharStrings dictionary and proceed as before. So the 'CMap'
107 is the equivalent of the Encoding array, but its a more compact and
108 flexible representation.
109 
110 Note that you have to use the CMap just to find out how many bytes
111 constitute a character code, and it can be variable. For example you
112 can say if the first byte is 0x00->0x7f then its just one byte, if its
113 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
114 have seen CMaps defining character codes up to 5 bytes wide.
115 
116 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
117 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
118 a Glyph ID (GID) (and the LOCA table) which may well not be anything
119 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
120 the CIDs to GIDs, and we can then use the GID to get the glyph
121 description from the GLYF table of the font.
122 
123 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
124 
125 Looking at the PDF file I was supplied with we see that it contains
126 text like :
127 
128 <0x0075> Tj
129 
130 So we start by taking the character code (117) and look it up in the
131 CMap. Well you don't supply a CMap, you just use the Identity-H one
132 which is predefined. So character code 117 maps to CID 117. Then we
133 use the CIDToGIDMap, again you don't supply one, you just use the
134 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
135 were supplied with only contains 116 glyphs.
136 
137 Now for Latin that's not a huge problem, you can just supply a bigger
138 font. But for more complex languages that *is* going to be more of a
139 problem. Either you need to supply a font which contains glyphs for
140 all the possible CID->GID mappings, or we need to think laterally.
141 
142 Our solution using a TrueType CIDFont is to intervene at the
143 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
144 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
145 looking into now.
146 
147 It would also be possible to have a 'PostScript' (ie type 1 outlines)
148 CIDFont which contained 1 glyph, and a CMap which mapped all character
149 codes to CID 0. The effect would be the same.
150 
151 Its possible (I haven't checked) that the PostScript CIDFont and
152 associated CMap would be smaller than the TrueType font and associated
153 CIDToGIDMap.
154 
155 --- in a followup ---
156 
157 OK there is a small problem there, if I use GID 0 then Acrobat gets
158 upset about it and complains it cannot extract the font. If I set the
159 CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
160 mad......
161 
162 */
163 
164 namespace tesseract {
165 
166 // Use for PDF object fragments. Must be large enough
167 // to hold a colormap with 256 colors in the verbose
168 // PDF representation.
169 static const int kBasicBufSize = 2048;
170 
171 // If the font is 10 pts, nominal character width is 5 pts
172 static const int kCharWidth = 2;
173 
174 // Used for memory allocation. A codepoint must take no more than this
175 // many bytes, when written in the PDF way. e.g. "<0063>" for the
176 // letter 'c'
177 static const int kMaxBytesPerCodepoint = 20;
178 
179 /**********************************************************************
180  * PDF Renderer interface implementation
181  **********************************************************************/
182 TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
183  bool textonly)
184  : TessResultRenderer(outputbase, "pdf"),
185  datadir_(datadir) {
186  obj_ = 0;
187  textonly_ = textonly;
188  offsets_.push_back(0);
189 }
190 
191 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
192  offsets_.push_back(objectsize + offsets_.back());
193  obj_++;
194 }
195 
196 void TessPDFRenderer::AppendPDFObject(const char *data) {
197  AppendPDFObjectDIY(strlen(data));
198  AppendString(data);
199 }
200 
201 // Helper function to prevent us from accidentally writing
202 // scientific notation to an HOCR or PDF file. Besides, three
203 // decimal points are all you really need.
204 static double prec(double x) {
205  double kPrecision = 1000.0;
206  double a = round(x * kPrecision) / kPrecision;
207  if (a == -0)
208  return 0;
209  return a;
210 }
211 
212 static long dist2(int x1, int y1, int x2, int y2) {
213  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
214 }
215 
216 // Viewers like evince can get really confused during copy-paste when
217 // the baseline wanders around. So I've decided to project every word
218 // onto the (straight) line baseline. All numbers are in the native
219 // PDF coordinate system, which has the origin in the bottom left and
220 // the unit is points, which is 1/72 inch. Tesseract reports baselines
221 // left-to-right no matter what the reading order is. We need the
222 // word baseline in reading order, so we do that conversion here. Returns
223 // the word's baseline origin and length.
224 static void GetWordBaseline(int writing_direction, int ppi, int height,
225  int word_x1, int word_y1, int word_x2, int word_y2,
226  int line_x1, int line_y1, int line_x2, int line_y2,
227  double *x0, double *y0, double *length) {
228  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
229  Swap(&word_x1, &word_x2);
230  Swap(&word_y1, &word_y2);
231  }
232  double word_length;
233  double x, y;
234  {
235  int px = word_x1;
236  int py = word_y1;
237  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
238  if (l2 == 0) {
239  x = line_x1;
240  y = line_y1;
241  } else {
242  double t = ((px - line_x2) * (line_x2 - line_x1) +
243  (py - line_y2) * (line_y2 - line_y1)) / l2;
244  x = line_x2 + t * (line_x2 - line_x1);
245  y = line_y2 + t * (line_y2 - line_y1);
246  }
247  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
248  word_x2, word_y2)));
249  word_length = word_length * 72.0 / ppi;
250  x = x * 72 / ppi;
251  y = height - (y * 72.0 / ppi);
252  }
253  *x0 = x;
254  *y0 = y;
255  *length = word_length;
256 }
257 
258 // Compute coefficients for an affine matrix describing the rotation
259 // of the text. If the text is right-to-left such as Arabic or Hebrew,
260 // we reflect over the Y-axis. This matrix will set the coordinate
261 // system for placing text in the PDF file.
262 //
263 // RTL
264 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
265 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
266 static void AffineMatrix(int writing_direction,
267  int line_x1, int line_y1, int line_x2, int line_y2,
268  double *a, double *b, double *c, double *d) {
269  double theta = atan2(static_cast<double>(line_y1 - line_y2),
270  static_cast<double>(line_x2 - line_x1));
271  *a = cos(theta);
272  *b = sin(theta);
273  *c = -sin(theta);
274  *d = cos(theta);
275  switch(writing_direction) {
277  *a = -*a;
278  *b = -*b;
279  break;
281  // TODO(jbreiden) Consider using the vertical PDF writing mode.
282  break;
283  default:
284  break;
285  }
286 }
287 
288 // There are some really awkward PDF viewers in the wild, such as
289 // 'Preview' which ships with the Mac. They do a better job with text
290 // selection and highlighting when given perfectly flat baseline
291 // instead of very slightly tilted. We clip small tilts to appease
292 // these viewers. I chose this threshold large enough to absorb noise,
293 // but small enough that lines probably won't cross each other if the
294 // whole page is tilted at almost exactly the clipping threshold.
295 static void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
296  int *line_x1, int *line_y1,
297  int *line_x2, int *line_y2) {
298  *line_x1 = x1;
299  *line_y1 = y1;
300  *line_x2 = x2;
301  *line_y2 = y2;
302  int rise = abs(y2 - y1) * 72;
303  int run = abs(x2 - x1) * 72;
304  if (rise < 2 * ppi && 2 * ppi < run)
305  *line_y1 = *line_y2 = (y1 + y2) / 2;
306 }
307 
308 static bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
309  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
310  tprintf("Dropping invalid codepoint %d\n", code);
311  return false;
312  }
313  if (code < 0x10000) {
314  snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
315  } else {
316  int a = code - 0x010000;
317  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
318  int low_surrogate = (0x03FF & a) + 0xDC00;
319  snprintf(utf16, kMaxBytesPerCodepoint,
320  "%04X%04X", high_surrogate, low_surrogate);
321  }
322  return true;
323 }
324 
325 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
326  double width, double height) {
327  STRING pdf_str("");
328  double ppi = api->GetSourceYResolution();
329 
330  // These initial conditions are all arbitrary and will be overwritten
331  double old_x = 0.0, old_y = 0.0;
332  int old_fontsize = 0;
333  tesseract::WritingDirection old_writing_direction =
335  bool new_block = true;
336  int fontsize = 0;
337  double a = 1;
338  double b = 0;
339  double c = 0;
340  double d = 1;
341 
342  // TODO(jbreiden) This marries the text and image together.
343  // Slightly cleaner from an abstraction standpoint if this were to
344  // live inside a separate text object.
345  pdf_str += "q ";
346  pdf_str.add_str_double("", prec(width));
347  pdf_str += " 0 0 ";
348  pdf_str.add_str_double("", prec(height));
349  pdf_str += " 0 0 cm";
350  if (!textonly_) {
351  pdf_str += " /Im1 Do";
352  }
353  pdf_str += " Q\n";
354 
355  int line_x1 = 0;
356  int line_y1 = 0;
357  int line_x2 = 0;
358  int line_y2 = 0;
359 
360  ResultIterator *res_it = api->GetIterator();
361  while (!res_it->Empty(RIL_BLOCK)) {
362  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
363  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
364  old_fontsize = 0; // Every block will declare its fontsize
365  new_block = true; // Every block will declare its affine matrix
366  }
367 
368  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
369  int x1, y1, x2, y2;
370  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
371  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
372  }
373 
374  if (res_it->Empty(RIL_WORD)) {
375  res_it->Next(RIL_WORD);
376  continue;
377  }
378 
379  // Writing direction changes at a per-word granularity
380  tesseract::WritingDirection writing_direction;
381  {
382  tesseract::Orientation orientation;
383  tesseract::TextlineOrder textline_order;
384  float deskew_angle;
385  res_it->Orientation(&orientation, &writing_direction,
386  &textline_order, &deskew_angle);
387  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
388  switch (res_it->WordDirection()) {
389  case DIR_LEFT_TO_RIGHT:
390  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
391  break;
392  case DIR_RIGHT_TO_LEFT:
393  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
394  break;
395  default:
396  writing_direction = old_writing_direction;
397  }
398  }
399  }
400 
401  // Where is word origin and how long is it?
402  double x, y, word_length;
403  {
404  int word_x1, word_y1, word_x2, word_y2;
405  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
406  GetWordBaseline(writing_direction, ppi, height,
407  word_x1, word_y1, word_x2, word_y2,
408  line_x1, line_y1, line_x2, line_y2,
409  &x, &y, &word_length);
410  }
411 
412  if (writing_direction != old_writing_direction || new_block) {
413  AffineMatrix(writing_direction,
414  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
415  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
416  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
417  pdf_str.add_str_double(" ", prec(c)); // . system for all
418  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
419  pdf_str.add_str_double(" ", prec(x)); // .
420  pdf_str.add_str_double(" ", prec(y)); // .
421  pdf_str += (" Tm "); // Place cursor absolutely
422  new_block = false;
423  } else {
424  double dx = x - old_x;
425  double dy = y - old_y;
426  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
427  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
428  pdf_str += (" Td "); // Relative moveto
429  }
430  old_x = x;
431  old_y = y;
432  old_writing_direction = writing_direction;
433 
434  // Adjust font size on a per word granularity. Pay attention to
435  // fontsize, old_fontsize, and pdf_str. We've found that for
436  // in Arabic, Tesseract will happily return a fontsize of zero,
437  // so we make up a default number to protect ourselves.
438  {
439  bool bold, italic, underlined, monospace, serif, smallcaps;
440  int font_id;
441  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
442  &serif, &smallcaps, &fontsize, &font_id);
443  const int kDefaultFontsize = 8;
444  if (fontsize <= 0)
445  fontsize = kDefaultFontsize;
446  if (fontsize != old_fontsize) {
447  char textfont[20];
448  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
449  pdf_str += textfont;
450  old_fontsize = fontsize;
451  }
452  }
453 
454  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
455  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
456  STRING pdf_word("");
457  int pdf_word_len = 0;
458  do {
459  const std::unique_ptr<const char[]> grapheme(
460  res_it->GetUTF8Text(RIL_SYMBOL));
461  if (grapheme && grapheme[0] != '\0') {
462  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(grapheme.get());
463  char utf16[kMaxBytesPerCodepoint];
464  for (char32 code : unicodes) {
465  if (CodepointToUtf16be(code, utf16)) {
466  pdf_word += utf16;
467  pdf_word_len++;
468  }
469  }
470  }
471  res_it->Next(RIL_SYMBOL);
472  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
473  if (word_length > 0 && pdf_word_len > 0) {
474  double h_stretch =
475  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
476  pdf_str.add_str_double("", h_stretch);
477  pdf_str += " Tz"; // horizontal stretch
478  pdf_str += " [ <";
479  pdf_str += pdf_word; // UTF-16BE representation
480  pdf_str += "> ] TJ"; // show the text
481  }
482  if (last_word_in_line) {
483  pdf_str += " \n";
484  }
485  if (last_word_in_block) {
486  pdf_str += "ET\n"; // end the text object
487  }
488  }
489  char *ret = new char[pdf_str.length() + 1];
490  strcpy(ret, pdf_str.string());
491  delete res_it;
492  return ret;
493 }
494 
496  char buf[kBasicBufSize];
497  size_t n;
498 
499  n = snprintf(buf, sizeof(buf),
500  "%%PDF-1.5\n"
501  "%%%c%c%c%c\n",
502  0xDE, 0xAD, 0xBE, 0xEB);
503  if (n >= sizeof(buf)) return false;
504  AppendPDFObject(buf);
505 
506  // CATALOG
507  n = snprintf(buf, sizeof(buf),
508  "1 0 obj\n"
509  "<<\n"
510  " /Type /Catalog\n"
511  " /Pages %ld 0 R\n"
512  ">>\n"
513  "endobj\n",
514  2L);
515  if (n >= sizeof(buf)) return false;
516  AppendPDFObject(buf);
517 
518  // We are reserving object #2 for the /Pages
519  // object, which I am going to create and write
520  // at the end of the PDF file.
521  AppendPDFObject("");
522 
523  // TYPE0 FONT
524  n = snprintf(buf, sizeof(buf),
525  "3 0 obj\n"
526  "<<\n"
527  " /BaseFont /GlyphLessFont\n"
528  " /DescendantFonts [ %ld 0 R ]\n"
529  " /Encoding /Identity-H\n"
530  " /Subtype /Type0\n"
531  " /ToUnicode %ld 0 R\n"
532  " /Type /Font\n"
533  ">>\n"
534  "endobj\n",
535  4L, // CIDFontType2 font
536  6L // ToUnicode
537  );
538  if (n >= sizeof(buf)) return false;
539  AppendPDFObject(buf);
540 
541  // CIDFONTTYPE2
542  n = snprintf(buf, sizeof(buf),
543  "4 0 obj\n"
544  "<<\n"
545  " /BaseFont /GlyphLessFont\n"
546  " /CIDToGIDMap %ld 0 R\n"
547  " /CIDSystemInfo\n"
548  " <<\n"
549  " /Ordering (Identity)\n"
550  " /Registry (Adobe)\n"
551  " /Supplement 0\n"
552  " >>\n"
553  " /FontDescriptor %ld 0 R\n"
554  " /Subtype /CIDFontType2\n"
555  " /Type /Font\n"
556  " /DW %d\n"
557  ">>\n"
558  "endobj\n",
559  5L, // CIDToGIDMap
560  7L, // Font descriptor
561  1000 / kCharWidth);
562  if (n >= sizeof(buf)) return false;
563  AppendPDFObject(buf);
564 
565  // CIDTOGIDMAP
566  const int kCIDToGIDMapSize = 2 * (1 << 16);
567  const std::unique_ptr<unsigned char[]> cidtogidmap(
568  new unsigned char[kCIDToGIDMapSize]);
569  for (int i = 0; i < kCIDToGIDMapSize; i++) {
570  cidtogidmap[i] = (i % 2) ? 1 : 0;
571  }
572  size_t len;
573  unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
574  n = snprintf(buf, sizeof(buf),
575  "5 0 obj\n"
576  "<<\n"
577  " /Length %lu /Filter /FlateDecode\n"
578  ">>\n"
579  "stream\n",
580  (unsigned long)len);
581  if (n >= sizeof(buf)) {
582  lept_free(comp);
583  return false;
584  }
585  AppendString(buf);
586  long objsize = strlen(buf);
587  AppendData(reinterpret_cast<char *>(comp), len);
588  objsize += len;
589  lept_free(comp);
590  const char *endstream_endobj =
591  "endstream\n"
592  "endobj\n";
593  AppendString(endstream_endobj);
594  objsize += strlen(endstream_endobj);
595  AppendPDFObjectDIY(objsize);
596 
597  const char *stream =
598  "/CIDInit /ProcSet findresource begin\n"
599  "12 dict begin\n"
600  "begincmap\n"
601  "/CIDSystemInfo\n"
602  "<<\n"
603  " /Registry (Adobe)\n"
604  " /Ordering (UCS)\n"
605  " /Supplement 0\n"
606  ">> def\n"
607  "/CMapName /Adobe-Identify-UCS def\n"
608  "/CMapType 2 def\n"
609  "1 begincodespacerange\n"
610  "<0000> <FFFF>\n"
611  "endcodespacerange\n"
612  "1 beginbfrange\n"
613  "<0000> <FFFF> <0000>\n"
614  "endbfrange\n"
615  "endcmap\n"
616  "CMapName currentdict /CMap defineresource pop\n"
617  "end\n"
618  "end\n";
619 
620  // TOUNICODE
621  n = snprintf(buf, sizeof(buf),
622  "6 0 obj\n"
623  "<< /Length %lu >>\n"
624  "stream\n"
625  "%s"
626  "endstream\n"
627  "endobj\n", (unsigned long) strlen(stream), stream);
628  if (n >= sizeof(buf)) return false;
629  AppendPDFObject(buf);
630 
631  // FONT DESCRIPTOR
632  n = snprintf(buf, sizeof(buf),
633  "7 0 obj\n"
634  "<<\n"
635  " /Ascent %d\n"
636  " /CapHeight %d\n"
637  " /Descent -1\n" // Spec says must be negative
638  " /Flags 5\n" // FixedPitch + Symbolic
639  " /FontBBox [ 0 0 %d %d ]\n"
640  " /FontFile2 %ld 0 R\n"
641  " /FontName /GlyphLessFont\n"
642  " /ItalicAngle 0\n"
643  " /StemV 80\n"
644  " /Type /FontDescriptor\n"
645  ">>\n"
646  "endobj\n",
647  1000,
648  1000,
649  1000 / kCharWidth,
650  1000,
651  8L // Font data
652  );
653  if (n >= sizeof(buf)) return false;
654  AppendPDFObject(buf);
655 
656  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_.c_str());
657  if (n >= sizeof(buf)) return false;
658  FILE *fp = fopen(buf, "rb");
659  if (!fp) {
660  tprintf("Can not open file \"%s\"!\n", buf);
661  return false;
662  }
663  fseek(fp, 0, SEEK_END);
664  long int size = ftell(fp);
665  if (size < 0) {
666  fclose(fp);
667  return false;
668  }
669  fseek(fp, 0, SEEK_SET);
670  const std::unique_ptr<char[]> buffer(new char[size]);
671  if (!tesseract::DeSerialize(fp, buffer.get(), size)) {
672  fclose(fp);
673  return false;
674  }
675  fclose(fp);
676  // FONTFILE2
677  n = snprintf(buf, sizeof(buf),
678  "8 0 obj\n"
679  "<<\n"
680  " /Length %ld\n"
681  " /Length1 %ld\n"
682  ">>\n"
683  "stream\n", size, size);
684  if (n >= sizeof(buf)) {
685  return false;
686  }
687  AppendString(buf);
688  objsize = strlen(buf);
689  AppendData(buffer.get(), size);
690  objsize += size;
691  AppendString(endstream_endobj);
692  objsize += strlen(endstream_endobj);
693  AppendPDFObjectDIY(objsize);
694  return true;
695 }
696 
697 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
698  const char* filename,
699  long int objnum,
700  char **pdf_object,
701  long int* pdf_object_size,
702  const int jpg_quality) {
703  size_t n;
704  char b0[kBasicBufSize];
705  char b1[kBasicBufSize];
706  char b2[kBasicBufSize];
707  if (!pdf_object_size || !pdf_object)
708  return false;
709  *pdf_object = nullptr;
710  *pdf_object_size = 0;
711  if (!filename && !pix)
712  return false;
713 
714  L_Compressed_Data *cid = nullptr;
715 
716  int sad = 0;
717  if (pixGetInputFormat(pix) == IFF_PNG)
718  sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
719  if (!cid) {
720  sad = l_generateCIDataForPdf(filename, pix, jpg_quality, &cid);
721  }
722 
723  if (sad || !cid) {
724  l_CIDataDestroy(&cid);
725  return false;
726  }
727 
728  const char *group4 = "";
729  const char *filter;
730  switch(cid->type) {
731  case L_FLATE_ENCODE:
732  filter = "/FlateDecode";
733  break;
734  case L_JPEG_ENCODE:
735  filter = "/DCTDecode";
736  break;
737  case L_G4_ENCODE:
738  filter = "/CCITTFaxDecode";
739  group4 = " /K -1\n";
740  break;
741  case L_JP2K_ENCODE:
742  filter = "/JPXDecode";
743  break;
744  default:
745  l_CIDataDestroy(&cid);
746  return false;
747  }
748 
749  // Maybe someday we will accept RGBA but today is not that day.
750  // It requires creating an /SMask for the alpha channel.
751  // http://stackoverflow.com/questions/14220221
752  const char *colorspace;
753  if (cid->ncolors > 0) {
754  n = snprintf(b0, sizeof(b0),
755  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
756  cid->ncolors - 1, cid->cmapdatahex);
757  if (n >= sizeof(b0)) {
758  l_CIDataDestroy(&cid);
759  return false;
760  }
761  colorspace = b0;
762  } else {
763  switch (cid->spp) {
764  case 1:
765  colorspace = " /ColorSpace /DeviceGray\n";
766  break;
767  case 3:
768  colorspace = " /ColorSpace /DeviceRGB\n";
769  break;
770  default:
771  l_CIDataDestroy(&cid);
772  return false;
773  }
774  }
775 
776  int predictor = (cid->predictor) ? 14 : 1;
777 
778  // IMAGE
779  n = snprintf(b1, sizeof(b1),
780  "%ld 0 obj\n"
781  "<<\n"
782  " /Length %ld\n"
783  " /Subtype /Image\n",
784  objnum, (unsigned long) cid->nbytescomp);
785  if (n >= sizeof(b1)) {
786  l_CIDataDestroy(&cid);
787  return false;
788  }
789 
790  n = snprintf(b2, sizeof(b2),
791  " /Width %d\n"
792  " /Height %d\n"
793  " /BitsPerComponent %d\n"
794  " /Filter %s\n"
795  " /DecodeParms\n"
796  " <<\n"
797  " /Predictor %d\n"
798  " /Colors %d\n"
799  "%s"
800  " /Columns %d\n"
801  " /BitsPerComponent %d\n"
802  " >>\n"
803  ">>\n"
804  "stream\n",
805  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
806  group4, cid->w, cid->bps);
807  if (n >= sizeof(b2)) {
808  l_CIDataDestroy(&cid);
809  return false;
810  }
811 
812  const char *b3 =
813  "endstream\n"
814  "endobj\n";
815 
816  size_t b1_len = strlen(b1);
817  size_t b2_len = strlen(b2);
818  size_t b3_len = strlen(b3);
819  size_t colorspace_len = strlen(colorspace);
820 
821  *pdf_object_size =
822  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
823  *pdf_object = new char[*pdf_object_size];
824 
825  char *p = *pdf_object;
826  memcpy(p, b1, b1_len);
827  p += b1_len;
828  memcpy(p, colorspace, colorspace_len);
829  p += colorspace_len;
830  memcpy(p, b2, b2_len);
831  p += b2_len;
832  memcpy(p, cid->datacomp, cid->nbytescomp);
833  p += cid->nbytescomp;
834  memcpy(p, b3, b3_len);
835  l_CIDataDestroy(&cid);
836  return true;
837 }
838 
840  size_t n;
841  char buf[kBasicBufSize];
842  char buf2[kBasicBufSize];
843  Pix *pix = api->GetInputImage();
844  const char* filename = api->GetInputName();
845  int ppi = api->GetSourceYResolution();
846  if (!pix || ppi <= 0)
847  return false;
848  double width = pixGetWidth(pix) * 72.0 / ppi;
849  double height = pixGetHeight(pix) * 72.0 / ppi;
850 
851  snprintf(buf2, sizeof(buf2), "/XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
852  const char *xobject = (textonly_) ? "" : buf2;
853 
854  // PAGE
855  n = snprintf(buf, sizeof(buf),
856  "%ld 0 obj\n"
857  "<<\n"
858  " /Type /Page\n"
859  " /Parent %ld 0 R\n"
860  " /MediaBox [0 0 %.2f %.2f]\n"
861  " /Contents %ld 0 R\n"
862  " /Resources\n"
863  " <<\n"
864  " %s"
865  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
866  " /Font << /f-0-0 %ld 0 R >>\n"
867  " >>\n"
868  ">>\n"
869  "endobj\n",
870  obj_,
871  2L, // Pages object
872  width, height,
873  obj_ + 1, // Contents object
874  xobject, // Image object
875  3L); // Type0 Font
876  if (n >= sizeof(buf)) return false;
877  pages_.push_back(obj_);
878  AppendPDFObject(buf);
879 
880  // CONTENTS
881  const std::unique_ptr<char[]> pdftext(GetPDFTextObjects(api, width, height));
882  const size_t pdftext_len = strlen(pdftext.get());
883  size_t len;
884  unsigned char *comp_pdftext = zlibCompress(
885  reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
886  long comp_pdftext_len = len;
887  n = snprintf(buf, sizeof(buf),
888  "%ld 0 obj\n"
889  "<<\n"
890  " /Length %ld /Filter /FlateDecode\n"
891  ">>\n"
892  "stream\n", obj_, comp_pdftext_len);
893  if (n >= sizeof(buf)) {
894  lept_free(comp_pdftext);
895  return false;
896  }
897  AppendString(buf);
898  long objsize = strlen(buf);
899  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
900  objsize += comp_pdftext_len;
901  lept_free(comp_pdftext);
902  const char *b2 =
903  "endstream\n"
904  "endobj\n";
905  AppendString(b2);
906  objsize += strlen(b2);
907  AppendPDFObjectDIY(objsize);
908 
909  if (!textonly_) {
910  char *pdf_object = nullptr;
911  int jpg_quality;
912  api->GetIntVariable("jpg_quality", &jpg_quality);
913  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize,
914  jpg_quality)) {
915  return false;
916  }
917  AppendData(pdf_object, objsize);
918  AppendPDFObjectDIY(objsize);
919  delete[] pdf_object;
920  }
921  return true;
922 }
923 
924 
926  size_t n;
927  char buf[kBasicBufSize];
928 
929  // We reserved the /Pages object number early, so that the /Page
930  // objects could refer to their parent. We finally have enough
931  // information to go fill it in. Using lower level calls to manipulate
932  // the offset record in two spots, because we are placing objects
933  // out of order in the file.
934 
935  // PAGES
936  const long int kPagesObjectNumber = 2;
937  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
938  n = snprintf(buf, sizeof(buf),
939  "%ld 0 obj\n"
940  "<<\n"
941  " /Type /Pages\n"
942  " /Kids [ ", kPagesObjectNumber);
943  if (n >= sizeof(buf)) return false;
944  AppendString(buf);
945  size_t pages_objsize = strlen(buf);
946  for (size_t i = 0; i < pages_.unsigned_size(); i++) {
947  n = snprintf(buf, sizeof(buf),
948  "%ld 0 R ", pages_[i]);
949  if (n >= sizeof(buf)) return false;
950  AppendString(buf);
951  pages_objsize += strlen(buf);
952  }
953  n = snprintf(buf, sizeof(buf),
954  "]\n"
955  " /Count %d\n"
956  ">>\n"
957  "endobj\n", pages_.size());
958  if (n >= sizeof(buf)) return false;
959  AppendString(buf);
960  pages_objsize += strlen(buf);
961  offsets_.back() += pages_objsize; // manipulation #2
962 
963  // INFO
964  STRING utf16_title = "FEFF"; // byte_order_marker
965  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(title());
966  char utf16[kMaxBytesPerCodepoint];
967  for (char32 code : unicodes) {
968  if (CodepointToUtf16be(code, utf16)) {
969  utf16_title += utf16;
970  }
971  }
972 
973  char* datestr = l_getFormattedDate();
974  n = snprintf(buf, sizeof(buf),
975  "%ld 0 obj\n"
976  "<<\n"
977  " /Producer (Tesseract %s)\n"
978  " /CreationDate (D:%s)\n"
979  " /Title <%s>\n"
980  ">>\n"
981  "endobj\n",
983  datestr, utf16_title.c_str());
984  lept_free(datestr);
985  if (n >= sizeof(buf)) return false;
986  AppendPDFObject(buf);
987  n = snprintf(buf, sizeof(buf),
988  "xref\n"
989  "0 %ld\n"
990  "0000000000 65535 f \n", obj_);
991  if (n >= sizeof(buf)) return false;
992  AppendString(buf);
993  for (int i = 1; i < obj_; i++) {
994  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
995  if (n >= sizeof(buf)) return false;
996  AppendString(buf);
997  }
998  n = snprintf(buf, sizeof(buf),
999  "trailer\n"
1000  "<<\n"
1001  " /Size %ld\n"
1002  " /Root %ld 0 R\n"
1003  " /Info %ld 0 R\n"
1004  ">>\n"
1005  "startxref\n"
1006  "%ld\n"
1007  "%%%%EOF\n",
1008  obj_,
1009  1L, // catalog
1010  obj_ - 1, // info
1011  offsets_.back());
1012  if (n >= sizeof(buf)) return false;
1013  AppendString(buf);
1014  return true;
1015 }
1016 } // namespace tesseract
signed int char32
virtual bool AddImageHandler(TessBaseAPI *api)
int size() const
Definition: genericvector.h:71
void Swap(T *p1, T *p2)
Definition: helpers.h:98
virtual bool BeginDocumentHandler()
struct TessBaseAPI TessBaseAPI
Definition: capi.h:89
signed int char32
Definition: unichar.h:52
T & back() const
const char * c_str() const
Definition: strngs.cpp:207
void AppendData(const char *s, int len)
Definition: renderer.cpp:106
TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly=false)
virtual bool EndDocumentHandler()
size_t unsigned_size() const
Definition: genericvector.h:75
bool GetIntVariable(const char *name, int *value) const
Definition: baseapi.cpp:305
DLLSYM void tprintf(const char *format,...)
Definition: tprintf.cpp:37
const char * title() const
Definition: renderer.h:81
int push_back(T object)
static std::vector< char32 > UTF8ToUTF32(const char *utf8_str)
Definition: unichar.cpp:213
void AppendString(const char *s)
Definition: renderer.cpp:102
Definition: strngs.h:45
static const char * Version()
Definition: baseapi.cpp:223
const char * GetInputName()
Definition: baseapi.cpp:972
bool DeSerialize(FILE *fp, char *data, size_t n)
Definition: serialis.cpp:27