Fonts for Tesseract training
Tesseract training can use images made from text which was rendered with a list of fonts. Those fonts must be available on the host where the training process is running.
The fonts that were used to train 3.05’s OCR engine and the legacy OCR engine in 4.0.0 are defined in training/language-specific.sh.
Many more fonts are listed in langdata/font_properties. If you add fonts to the first file (or specify them explicitly via command line parameter), you must add them to the second as well.
The fonts that were used to train the LSTM OCR engine in 4.0.0 are defined in <lang>/okfonts.txt
files in the langdata_lstm repo.
Find Fonts
To find fonts already installed on your system which will render a given training text, you can use the following command (change the language code and directory locations to match your setup). fontslist.txt will provide text that can be used in training/language-specific.sh
.
text2image --find_fonts \
--fonts_dir /usr/share/fonts \
--text ./langdata/eng/eng.training_text \
--min_coverage .9 \
--outputbase ./langdata/eng/eng \
|& grep raw \
| sed -e 's/ :.*/@ \\/g' \
| sed -e "s/^/ '/" \
| sed -e "s/@/'/g" >./langdata/eng/fontslist.txt
The above will not work for Fraktur fonts, it will identify all Latin fonts also. Review the generated images and choose appropriate fonts.
Font installation
Debian
On Debian GNU Linux and similar distributions (Linux Mint, Ubuntu, …), the required fonts can be installed like that:
# AMHARIC_FONTS (todo)
# ANCIENT_GREEK_FONTS (todo)
# ARABIC_FONTS (todo)
# ARMENIAN_FONTS (todo)
# BENGALI_FONTS (todo)
# BURMESE_FONTS (todo)
# CHI_SIM_FONTS (todo)
# CHI_TRA_FONTS (todo)
# DEVANAGARI_FONTS (see also external links below)
apt-get install fonts-deva
# EARLY_LATIN_FONTS (todo)
# FRAKTUR_FONTS (todo)
# GEORGIAN_FONTS (todo)
# GREEK_FONTS (todo)
# GUJARATI_FONTS (todo)
# HEBREW_FONTS (todo)
# JPN_FONTS (todo)
apt-get install fonts-noto-cjk fonts-japanese-mincho.ttf fonts-takao-gothic fonts-vlgothic
# KANNADA_FONTS (todo)
# KHMER_FONTS (todo)
# KOREAN_FONTS (todo)
# KURDISH_FONTS (todo)
# KYRGYZ_FONTS (todo)
# LAOTHIAN_FONTS (todo)
# LATIN_FONTS
apt-get install fonts-dejavu gsfonts ttf-mscorefonts-installer
# MALAYALAM_FONTS (todo)
# NEOLATIN_FONTS (still incomplete)
apt-get install fonts-ebgaramond fonts-gfs-didot fonts-gfs-didot-classic fonts-junicode
# NORTH_AMERICAN_ABORIGINAL_FONTS (todo)
# OLD_GEORGIAN_FONTS (todo)
# ORIYA_FONTS (todo)
# PERSIAN_FONTS (todo)
# PUNJABI_FONTS (todo)
# RUSSIAN_FONTS (todo)
# SINHALA_FONTS (todo)
# SYRIAC_FONTS (todo)
# TAMIL_FONTS (todo)
# TELUGU_FONTS (todo)
# THAANA_FONTS (todo)
# THAI_FONTS (todo)
# TIBETAN_FONTS (todo)
# VERTICAL_FONTS (todo)
# VIETNAMESE_FONTS (todo)
The installed fonts are shown by the command fc-list
. See also the Debian wiki.
text2image --fonts_dir /usr/share/fonts --list_available_fonts
will also show all fonts.
Links
Sources of (mostly free) fonts
Fonts which cover many scripts
- https://savannah.gnu.org/projects/unifont/
Latin Fonts
- https://fontlibrary.org/en (GFS Bodoni)
- https://fonts.google.com/
- http://iginomarini.com/fell/the-revival-fonts/
- http://scholarsfonts.net/ (Cardo)
- http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=FontDownloads (SIL Fonts)
- http://www.ctan.org/tex-archive/fonts (GFS Bodoni)
- http://www.steffmann.de/wordpress/test-2/
Arabic Fonts
- https://fonts.google.com/?subset=arabic
Devanagari Fonts
- Aksharayogini2
- AksharayoginiBoldItalic
- AksharayoginiBold
- AksharayoginiItalic
- Aksharayogini
- Ananda Akchyar Devanagari
- AnnapurnaSIL
- CDAC-Surekh Bold
- CDAC-Surekh Normal
- CDAC-Yogesh Bold
- CDAC-Yogesh Italic
- CDAC-Yogesh Normal
- Chandas
- Gotu
- Jaini
- Jaini Purva
- Lohit Devanagari
- Nakula
- Mukta
- Murty Hindi
- Murty Sanskrit
- Sahadeva
- Sanskrit2003
- Santipur OT
- Sharad76
- Shobhika
- Shree-DV0726-OT
- Siddhanta
- Uttara
- Yashomudra Fonts
- Google Devanagari Fonts
- fonts from TDIL Hindi CD
- Linked from Bihar Vidhan Parishad
- Linked from bih.nic.in
Fraktur Fonts
- http://unifraktur.sourceforge.net/maguntia.html (UnifrakturMaguntia)
- http://www.orbitals.com/self/ligature/ligature.htm (Wyld)
- https://www.fontyukle.net/de/1,Walbaum
- http://de.ffonts.net/Walbaum-Fraktur.font.download
- http://www.1001fonts.com/fraktur-fonts.html
- http://www.dafont.com/fette-unz-fraktur.font
- http://www.1001freefonts.com/fette_fraktur.font
- http://www.ligafaktur.de/Schriften.html
- http://www.morscher.com/3r/fonts/fraktur.htm
Hebrew Fonts
Collections of fonts
- http://www.abstractfonts.com/
- http://www.schriftarten-fonts.de/ (German)
More information on fonts
- https://en.wikipedia.org/wiki/Fraktur
- http://www.orbitals.com/self/ligature/ligature.htm 18th Century Ligatures and Fonts
- http://www.steffmann.de/wordpress/ (German)