Technology Overview

Even with the ongoing digital transformation, many companies and financial institutions still spend quite a bit of time manually processing information from countless documents. As a result of the nature of digital files such as PDFs and images, various records and figures must be processed and entered by hand.


Consequently, extracting relevant information remains problematic. It is virtually impossible to scale this error-prone operation that also tends to be costly when all is said and done.

The enormous challenge of automating data-capture from documents requires a new set of advanced technologies. Gemina harnesses these ground-breaking algorithms to automatically capture 85% of the Invoice Data, leaving only a small portion for correction and monitoring.

Invoice OCR Technology

How do we do that?

Before answering that question, let’s scope out the history of NLP approaches for data extraction and classification first.

The first used approach was Regular Expression Matching (RegEx), which is a sequence of characters that specifies a search pattern in text. Usually, such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Patterns such as date-formats and invoice-numbers are applied to a rule-engine that performs RegEx operations on input texts and looks for a match.

That approach, however, requires complete feature-knowledge, and it is difficult to maintain. We believe it has a low glass-ceiling of roughly 50% precision.


Next, with the introduction of more advanced Machine Learning algorithms and enhanced computational power, an ML approach has been implemented by many organizations. A combination of this approach with the previous one is still very much widespread. While this approach can deliver an average of 60%-70% success rates, it still requires a particularly large dataset, as well as meticulous feature-selection. Last but not least, each field/data-type requires customized algorithms e.g. Classification, Extraction and Ranking.


Finally, a new set of NLP algorithms has been developed, and managed to break the impenetrable glass-ceiling of average quality. It is based on multiple approaches such as of Recurrent Neural Networks, Deep Learning, and Language Models, just to name a few. This approach has famously revolutionized primary internet services e.g. Google Translate, and the Google Search Engine in 2019.

This approach takes data-capture to new heights and can reach an unprecedented level of 85% accuracy across thousands of invoice-formats and designs. It does not come at ease, though – not only it requires a high level of expertise, but it is equally difficult to develop and deploy. Just to exemplify: Training a one single financial language model, can take several weeks to complete!

Gemina is the only company that implements these state-of-the-art algorithms in the Hebrew language. Our service is one of a kind, and our superior set of technologies creates a clear competitive advantage among Fintech providers.

Have More Questions?

Contact one of our specialists and find out how our product can work for your company.