public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
@ 2018-08-09  7:23 Marco Atzeri
  2018-08-09  8:05 ` Stefan Weil
  0 siblings, 1 reply; 4+ messages in thread
From: Marco Atzeri @ 2018-08-09  7:23 UTC (permalink / raw)
  To: cygwin

Version 4.0.0-0.4  of packages

    libtesseract-ocr_4   (API bump)
    tesseract-ocr
    tesseract-ocr-devel
    tesseract-training-util

and version 4.00-0.4 of relative language data

    tesseract-ocr-languages (source only)
    tesseract-ocr-deu
    tesseract-ocr-eng
    tesseract-ocr-fra
    tesseract-ocr-ita
    tesseract-ocr-nld
    tesseract-ocr-por
    tesseract-ocr-spa
    tesseract-ocr-vie
    tesseract-training-core
    tesseract-training-deu
    tesseract-training-eng
    tesseract-training-fra
    tesseract-training-ita
    tesseract-training-nld
    tesseract-training-por
    tesseract-training-spa
    tesseract-training-vie

are available in the Cygwin distribution:

Other language specific data are available upstream
   https://github.com/tesseract-ocr/tessdata/

while training data for building new language data are in
   https://github.com/tesseract-ocr/langdata

CHANGES
Upstream Beta 4 release of next 4.x series.
https://github.com/tesseract-ocr/tesseract/releases

DESCRIPTION
Tesseract is probably the most accurate open source OCR engine
available. Combined with the Leptonica Image Processing Library
it can read a wide variety of image formats and convert them to
text in over 60 languages. It was one of the top 3 engines in
the 1995 UNLV Accuracy test.
Improved extensively by Google.
It is released under the Apache License 2.0.


HOMEPAGE
https://github.com/tesseract-ocr/


Marco Atzeri

If you have questions or comments, please send them to the
cygwin mailing list at: cygwin (at) cygwin (dot) com .

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
  2018-08-09  7:23 [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4 Marco Atzeri
@ 2018-08-09  8:05 ` Stefan Weil
  2018-08-09  8:19   ` Marco Atzeri
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan Weil @ 2018-08-09  8:05 UTC (permalink / raw)
  To: cygwin

Am 08.08.2018 um 19:27 schrieb Marco Atzeri:
> Version 4.0.0-0.4  of packages
> 
>    libtesseract-ocr_4   (API bump)
>    tesseract-ocr
>    tesseract-ocr-devel
>    tesseract-training-util
> 
> and version 4.00-0.4 of relative language data
> 
>    tesseract-ocr-languages (source only)
>    tesseract-ocr-deu
>    tesseract-ocr-eng
>    tesseract-ocr-fra
>    tesseract-ocr-ita
>    tesseract-ocr-nld
>    tesseract-ocr-por
>    tesseract-ocr-spa
>    tesseract-ocr-vie
>    tesseract-training-core
>    tesseract-training-deu
>    tesseract-training-eng
>    tesseract-training-fra
>    tesseract-training-ita
>    tesseract-training-nld
>    tesseract-training-por
>    tesseract-training-spa
>    tesseract-training-vie
> 
> are available in the Cygwin distribution:
> 
> Other language specific data are available upstream
>   https://github.com/tesseract-ocr/tessdata/
> 
> while training data for building new language data are in
>   https://github.com/tesseract-ocr/langdata


Hi Marco,

thank you for providing those Tesseract packages.

A hint: I suggest to remove the tesseract-training-* packages as there
currently does not exist training data for Tesseract 4.0.0.

Regards
Stefan Weil

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
  2018-08-09  8:05 ` Stefan Weil
@ 2018-08-09  8:19   ` Marco Atzeri
  2018-08-09  9:19     ` Stefan Weil
  0 siblings, 1 reply; 4+ messages in thread
From: Marco Atzeri @ 2018-08-09  8:19 UTC (permalink / raw)
  To: cygwin

Am 09.08.2018 um 10:05 schrieb Stefan Weil:
> Am 08.08.2018 um 19:27 schrieb Marco Atzeri:

>
> Hi Marco,
>
> thank you for providing those Tesseract packages.
>
> A hint: I suggest to remove the tesseract-training-* packages as there
> currently does not exist training data for Tesseract 4.0.0.
>
> Regards
> Stefan Weil
>

My understanding is that the trained data "tessdata, tessdata_fast,
tessdata_best" are coming from the same training data then version 3

https://github.com/tesseract-ocr/langdata

It is not that the languages raw data should be changed.

Regards
Marco




---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
  2018-08-09  8:19   ` Marco Atzeri
@ 2018-08-09  9:19     ` Stefan Weil
  0 siblings, 0 replies; 4+ messages in thread
From: Stefan Weil @ 2018-08-09  9:19 UTC (permalink / raw)
  To: cygwin

Am 09.08.2018 um 10:19 schrieb Marco Atzeri:
> My understanding is that the trained data "tessdata, tessdata_fast,
> tessdata_best" are coming from the same training data then version 3
> 
> https://github.com/tesseract-ocr/langdata
> 
> It is not that the languages raw data should be changed.
> 
> Regards
> Marco

https://github.com/tesseract-ocr/langdata is valid for Tesseract 3.05.x
and earlier versions.

Tesseract 4.0.0 still supports the old traineddata format, but added new
(and typically better) traineddata based on neural networks. There is
currently no langdata available for those new traineddata.

tessdata_best only contains the new traineddata.

tessdata_fast also contains only new traineddata, but is faster and less
accurate.

tessdata still contains old traineddata for most languages and
additionally new traineddata made from tessdata_best, but using integer
instead of float models (which makes them faster).

tessdata_best, tessdata_fast and tessdata not only contain traineddata
for many languages, but also for "scripts", for example in
https://github.com/tesseract-ocr/tessdata/tree/master/script. Those
models support all languages using the same script, so
https://github.com/tesseract-ocr/tessdata/blob/master/script/Latin.traineddata
supports all languages which use Latin characters (English, French,
Spanish, Italian, German, Danish, ...). A selection of those script
models would be useful for Cygwin, too.

Regards,
Stefan

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-08-09  9:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-09  7:23 [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4 Marco Atzeri
2018-08-09  8:05 ` Stefan Weil
2018-08-09  8:19   ` Marco Atzeri
2018-08-09  9:19     ` Stefan Weil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).