* [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
@ 2018-08-09 7:23 Marco Atzeri
2018-08-09 8:05 ` Stefan Weil
0 siblings, 1 reply; 4+ messages in thread
From: Marco Atzeri @ 2018-08-09 7:23 UTC (permalink / raw)
To: cygwin
Version 4.0.0-0.4 of packages
libtesseract-ocr_4 (API bump)
tesseract-ocr
tesseract-ocr-devel
tesseract-training-util
and version 4.00-0.4 of relative language data
tesseract-ocr-languages (source only)
tesseract-ocr-deu
tesseract-ocr-eng
tesseract-ocr-fra
tesseract-ocr-ita
tesseract-ocr-nld
tesseract-ocr-por
tesseract-ocr-spa
tesseract-ocr-vie
tesseract-training-core
tesseract-training-deu
tesseract-training-eng
tesseract-training-fra
tesseract-training-ita
tesseract-training-nld
tesseract-training-por
tesseract-training-spa
tesseract-training-vie
are available in the Cygwin distribution:
Other language specific data are available upstream
https://github.com/tesseract-ocr/tessdata/
while training data for building new language data are in
https://github.com/tesseract-ocr/langdata
CHANGES
Upstream Beta 4 release of next 4.x series.
https://github.com/tesseract-ocr/tesseract/releases
DESCRIPTION
Tesseract is probably the most accurate open source OCR engine
available. Combined with the Leptonica Image Processing Library
it can read a wide variety of image formats and convert them to
text in over 60 languages. It was one of the top 3 engines in
the 1995 UNLV Accuracy test.
Improved extensively by Google.
It is released under the Apache License 2.0.
HOMEPAGE
https://github.com/tesseract-ocr/
Marco Atzeri
If you have questions or comments, please send them to the
cygwin mailing list at: cygwin (at) cygwin (dot) com .
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
2018-08-09 7:23 [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4 Marco Atzeri
@ 2018-08-09 8:05 ` Stefan Weil
2018-08-09 8:19 ` Marco Atzeri
0 siblings, 1 reply; 4+ messages in thread
From: Stefan Weil @ 2018-08-09 8:05 UTC (permalink / raw)
To: cygwin
Am 08.08.2018 um 19:27 schrieb Marco Atzeri:
> Version 4.0.0-0.4Â of packages
>
> Â Â libtesseract-ocr_4Â Â (API bump)
> Â Â tesseract-ocr
> Â Â tesseract-ocr-devel
> Â Â tesseract-training-util
>
> and version 4.00-0.4 of relative language data
>
> Â Â tesseract-ocr-languages (source only)
> Â Â tesseract-ocr-deu
> Â Â tesseract-ocr-eng
> Â Â tesseract-ocr-fra
> Â Â tesseract-ocr-ita
> Â Â tesseract-ocr-nld
> Â Â tesseract-ocr-por
> Â Â tesseract-ocr-spa
> Â Â tesseract-ocr-vie
> Â Â tesseract-training-core
> Â Â tesseract-training-deu
> Â Â tesseract-training-eng
> Â Â tesseract-training-fra
> Â Â tesseract-training-ita
> Â Â tesseract-training-nld
> Â Â tesseract-training-por
> Â Â tesseract-training-spa
> Â Â tesseract-training-vie
>
> are available in the Cygwin distribution:
>
> Other language specific data are available upstream
> Â https://github.com/tesseract-ocr/tessdata/
>
> while training data for building new language data are in
> Â https://github.com/tesseract-ocr/langdata
Hi Marco,
thank you for providing those Tesseract packages.
A hint: I suggest to remove the tesseract-training-* packages as there
currently does not exist training data for Tesseract 4.0.0.
Regards
Stefan Weil
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
2018-08-09 8:05 ` Stefan Weil
@ 2018-08-09 8:19 ` Marco Atzeri
2018-08-09 9:19 ` Stefan Weil
0 siblings, 1 reply; 4+ messages in thread
From: Marco Atzeri @ 2018-08-09 8:19 UTC (permalink / raw)
To: cygwin
Am 09.08.2018 um 10:05 schrieb Stefan Weil:
> Am 08.08.2018 um 19:27 schrieb Marco Atzeri:
>
> Hi Marco,
>
> thank you for providing those Tesseract packages.
>
> A hint: I suggest to remove the tesseract-training-* packages as there
> currently does not exist training data for Tesseract 4.0.0.
>
> Regards
> Stefan Weil
>
My understanding is that the trained data "tessdata, tessdata_fast,
tessdata_best" are coming from the same training data then version 3
https://github.com/tesseract-ocr/langdata
It is not that the languages raw data should be changed.
Regards
Marco
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
2018-08-09 8:19 ` Marco Atzeri
@ 2018-08-09 9:19 ` Stefan Weil
0 siblings, 0 replies; 4+ messages in thread
From: Stefan Weil @ 2018-08-09 9:19 UTC (permalink / raw)
To: cygwin
Am 09.08.2018 um 10:19 schrieb Marco Atzeri:
> My understanding is that the trained data "tessdata, tessdata_fast,
> tessdata_best" are coming from the same training data then version 3
>
> https://github.com/tesseract-ocr/langdata
>
> It is not that the languages raw data should be changed.
>
> Regards
> Marco
https://github.com/tesseract-ocr/langdata is valid for Tesseract 3.05.x
and earlier versions.
Tesseract 4.0.0 still supports the old traineddata format, but added new
(and typically better) traineddata based on neural networks. There is
currently no langdata available for those new traineddata.
tessdata_best only contains the new traineddata.
tessdata_fast also contains only new traineddata, but is faster and less
accurate.
tessdata still contains old traineddata for most languages and
additionally new traineddata made from tessdata_best, but using integer
instead of float models (which makes them faster).
tessdata_best, tessdata_fast and tessdata not only contain traineddata
for many languages, but also for "scripts", for example in
https://github.com/tesseract-ocr/tessdata/tree/master/script. Those
models support all languages using the same script, so
https://github.com/tesseract-ocr/tessdata/blob/master/script/Latin.traineddata
supports all languages which use Latin characters (English, French,
Spanish, Italian, German, Danish, ...). A selection of those script
models would be useful for Cygwin, too.
Regards,
Stefan
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-08-09 9:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-09 7:23 [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4 Marco Atzeri
2018-08-09 8:05 ` Stefan Weil
2018-08-09 8:19 ` Marco Atzeri
2018-08-09 9:19 ` Stefan Weil
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).