Handwritten text recognition in 2020

Main challenges in recognizing handwritten text

  • Tremendous variability and ambiguity of strokes from person to person
  • Inconsistency in writing style
  • Poor quality of the source document due to degradation over time
  • Oftentimes text is not written in a strictly straight line
  • Cursive handwriting makes separation and recognition of characters even more challenging
  • Text in handwriting can have a variable rotation
  • Collecting a high-quality dataset with labels is relatively expensive

Use cases

Healthcare and pharmaceuticals



Source: The Pennsylvania State University

Online Libraries

Source: akg-images


Some of the techniques

Multi-dimensional Recurrent Neural Networks

  • Best-path decoding is the generic decoding we have implicitly discussed so far. In this type, at each position, we take the output of the model and simply find the result with the highest probability.
  • When shooting for beam search decoding, the beam search suggests keeping multiple output paths with the highest probabilities and thereby expands the chain with new outputs while dropping paths with lower probabilities to keep the beam size constant.
  • Beam search usually provides more accurate results than grid search. But there is still more room to get truly meaningful results. A way to aim for greater performance is to use a language model alongside beam search using both probabilities from the model and the language model (which, for example, scores character-sequences according to probabilities) to generate the final results.

Encoder-decoder and attention networks

  • Content-based attention: The idea behind this methodology is to find the similarity between the current hidden state of the decoder and the feature map from the encoder. We can find the most correlated feature vectors in the feature map of the encoder, which can be used to predict the current character at the current time step. A more intuitive way of looking at how attention mechanisms work can be found here.
  • Location-based attention: The main disadvantage of location-based mechanisms is that there is an implicit assumption that the location information is embedded in the output of the encoder. Otherwise, would be no way to differentiate between character outputs which are repeated from the decoder. But let’s look at an example to make things more clear. Let’s assume we have the word Parashift, in which the character a is repeated twice. Without location information, the decoder cannot predict them as separate characters. To address this problem, the current character and its alignment are predicted by using both the encoder output and previous alignment. If you are curious about how the methodology works in greater detail, you can dig deeper here.

Transformer models

Handwriting text generation




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store