Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used as a form of data entry from some sort of original paper data source, whether documents, sales receipts, mail, or any number of printed records. It is crucial to the computerization of printed texts so that they can be electronically searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
Early versions needed to be programmed with images of each character,
and worked on one font at a time. "Intelligent" systems with a high
degree of recognition accuracy for most fonts are now common. Some
systems are capable of reproducing formatted output that closely
approximates the original scanned page including images, columns and
other non-textual components.
Importance of OCR to the Blind
In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognize text printed in virtually any font. He decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologies — the CCD flatbed scanner and the text-to-speech synthesizer. On January 13, 1976 the successful finished product was unveiled during a widely-reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercializing paper-to-computer text conversion. Xerox eventually spun it off as Scansoft, which merged with Nuance Communications.OCR software
Desktop & Server OCR Software
OCR software and ICR software
technology are analytical artificial intelligence systems that consider
sequences of characters rather than whole words or phrases. Based on
the analysis of sequential lines and curves, OCR and ICR make 'best
guesses' at characters using database look-up tables to closely
associate or match the strings of characters that form words.
WebOCR & OnlineOCR
With IT technology development, the platform for people to use
software has been changed from single PC platform to multi-platforms
such as PC +Web-based+ Cloud Computing + Mobile devices. After 30 years
development, OCR software started to adapt to new application
requirements. WebOCR also known as OnlineOCR or Web-based OCR service,
has been a new trend to meet larger volume and larger group of users
after 30 years development of the desktop OCR. Internet and broadband
technologies have made WebOCR & OnlineOCR practically available to
both individual users and enterprise customers. Since 2000, some major
OCR vendors began offering WebOCR & Online software, a number of new
entrants companies to seize the opportunity to develop innovative
Web-based OCR service, some of which are free of charge services.
Application-Oriented OCR
Since OCR technology has been more and more widely applied to
paper-intensive industry, it is facing more complex images environment
in the real world. For example: complicated backgrounds,
degraded-images, heavy-noise, paper skew, picture distortion,
low-resolution, disturbed by grid & lines, text image consisting of
special fonts, symbols, glossary words and etc. All the factors affect
OCR products’ stability in recognition accuracy.
In recent years, the major OCR technology providers began to develop
dedicated OCR systems, each for special types of images. They combine
various optimization methods related to the special image, such as
business rules, standard expression, glossary or dictionary and rich
information contained in color images, to improve the recognition
accuracy.
Such strategy to customize OCR technology is called
“Application-Oriented OCR” or "Customized OCR", widely used in the
fields of Business-card OCR, Invoice OCR, Screenshot OCR, ID card OCR,
Driver-license OCR or Auto plant OCR, and so on.
Current state of OCR technology
Commissioned by the U.S. Department of Energy
(DOE), the Information Science Research Institute (ISRI) had the
mission to foster the improvement of automated technologies for
understanding machine printed documents, and it conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s.
Recognition of Latin-script,
typewritten text is still not 100% accurate even where clear imaging is
available. One study based on recognition of 19th- and early
20th-century newspaper pages concluded that character-by-character OCR
accuracy for commercial OCR software varied from 71% to 98%; total accuracy can be achieved only by human review. Other areas—including recognition of hand printing, cursive
handwriting, and printed text in other scripts (especially those East
Asian language characters which have many strokes for a single
character)—are still the subject of active research.
Accuracy rates can be measured in several ways, and how they are
measured can greatly affect the reported accuracy rate. For example, if
word context (basically a lexicon of words) is not used to correct
software finding non-existent words, a character error rate of 1% (99%
accuracy) may result in an error rate of 5% (95% accuracy) or worse if
the measurement is based on whether each whole word was recognized with
no incorrect letters.
On-line character recognition is sometimes confused with Optical Character Recognition. OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for gestures in the Penpoint OS or the Tablet PC
can tell whether a horizontal mark was drawn right-to-left, or
left-to-right. On-line character recognition is also referred to by
other terms such as dynamic character recognition, real-time character
recognition, and Intelligent Character Recognition or ICR.
On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history). Among these are the input devices for personal digital assistants such as those running Palm OS. The Apple Newton
pioneered this product. The algorithms used in these devices take
advantage of the fact that the order, speed, and direction of individual
lines segments at input are known. Also, the user can be retrained to
use only specific letter shapes. These methods cannot be used in
software that scans paper documents, so accurate recognition of
hand-printed documents is still largely an open problem. Accuracy rates
of 80% to 90% on neat, clean hand-printed characters can be achieved,
but that accuracy rate still translates to dozens of errors per page,
making the technology useful only in very limited applications.
Recognition of cursive text is an active area of research, with
recognition rates even lower than that of hand-printed text. Higher
rates of recognition of general cursive script will likely not be
possible without the use of contextual or grammatical information. For
example, recognizing entire words from a dictionary is easier than
trying to parse individual characters from script. Reading the Amount line of a cheque
(which is always a written-out number) is an example where using a
smaller dictionary can increase recognition rates greatly. Knowledge of
the grammar of the language being scanned can also help determine if a
word is likely to be a verb or a noun, for example, allowing greater
accuracy. The shapes of individual cursive characters themselves simply
do not contain enough information to accurately (greater than 98%)
recognise all handwritten cursive script.
It is necessary to understand that OCR technology is a basic
technology also used in advanced scanning applications. Due to this, an
advanced scanning solution can be unique and patented and not easily
copied despite being based on this basic OCR technology.
For more complex recognition problems, intelligent character recognition systems are generally used, as artificial neural networks can be made indifferent to both affine and non-linear transformations.
A technique which is having considerable success in recognising
difficult words and character groups within documents generally amenable
to computer OCR is to submit them automatically to humans in the reCAPTCHA system.
No comments:
Post a Comment