Knoldus Inc

Tax Notices, Entity Extraction and Document Classification Using Document AI



Tax Notices are documents composed of many different types of notices and have information which can be recorded and saved in order to keep track of tax status of an individual account. Identifying the document types and extracting data manually from n-number of different types of notices, costs for resources required for this process. It is time consuming and even prone to errors. Client wanted this task to be automated by using Machine Learning and OCR (Optical Character Recognition) to classify different types of notices and extract useful information from them.

To make the Extraction and Tax Notice Document Classification process automated, efficient and standardized, the Knoldus team was asked to conduct a Proof of Concept (POC) and provide a solution. The objective of this Proof of Concept was to perform data extraction from Tax Notices accurately by leveraging the services provided by Google Cloud Platform(GCP) which included Optical Character Recognition(OCR), Document AI and Cloud Data Loss Prevention(DLP). For Document Classification, the modeling approaches implemented and tested were Jaccard Similarity with MinHash, Jaccard Similarity, Naive Bayes Classifier model on Term-Frequency(TF) vectors using Hashing trick for handling OOV terms. The target was to automate the Tax Notices document classification and data extraction from those documents in a reliable

About the Organization

Ultimate Kronos Group (UKG) is an American multinational technology company with dual headquarters in Lowell, Massachusetts, and Weston, Florida. It provides workforce management and human resource management services. As a leading global provider of HCM, payroll, HR service delivery, and workforce management solutions, UKG’s award-winning Pro, Dimensions, and Ready solutions help tens of thousands of organizations across geographies and in every industry drive better business outcomes, improve HR effectiveness, streamline the payroll process, and help make work a betterand more connected experience for everyone.

Client is provided with nearly 200-400 different types of US Tax Notice Documents. The Knoldus team’s job was to create a generalized approach for data/entity extraction from documents using Google Cloud Platform services which were cost efficient and provided information extraction solutions with higher confidence and/or accuracy and then classify those documents using Machine Learning algorithm like Naive Bayes which uses the OCR extracted data. This OCR data is processed using Hashing Trick which addresses the problem of memory consumption for large vocabulary and Term Frequency which makes use of context-relevant terms only which are less frequent in any particular type of document.

The Challenges

The most important challenges faced were with respect to the data provided. The Tax Notices data was provided in the form of PDF files which had issues like rotated pages, pages with bad scan and/or incorrect orientation, blank pages randomly on any page of the PDF document.

Another challenge was about the types of Tax Notices and differentiation between them. Since the notice types were large in number ranging from 200-400 different types, classifying those documents was a challenge since the text/context in the documents was mostly similar and finding the terms that exactly differentiate the documents from each other was difficult. It required us to implement different approaches like Jaccard Similarity, Naive Bayes etc.

For information or entity extraction from documents Google Cloud Platform services, Document AI was used. The Document AI Forms Parser was used in order to process and convert unstructured data in a structured format. Although the required data was extracted with high accuracy, the parser extracted few important entities with low confidence. While saving this extracted data, garbage value was extracted with them as well. Needed to analyze it carefully and avoid the usage of garbage values.


Our team started with analyzing the different notice types and what relevant information could be extracted from different notices. The first task was to leverage the Document AI service which extracted the data from documents using Form Parser. The Doc AI form parser parses the data in the form of key-value pairs. This extracted data from all the documents was stored in the BigQuery table for use at later stages. Here is a sample of how the Document AI Form Parser extracts data from documents.


But using this service/processor our team was able to extract data which only had key-value pairs. Some data like dates, company name, identification number etc which did not have any key associated with it while few entities were present in
the paragraph content which was necessary to be extracted. The Google Cloud Data Loss Prevention (DLP) proved a very good solution. The Data Loss Prevention (DLP) uses built-in infotype detectors to extract information from documents. It
has nearly 150 different infotypes. Infotypes are a type of sensitive PII data such as email address, identification number, credit card number DOB etc. Our team used the built-in infotype and also created custom-infotype for extracting entities.
With both the services ie. Google Document AI and Data Loss Prevention(DLP) our team was able to extract all the data that was needed. The data and/or entities extraction accuracy was 85%-95%.

In next step, we used this extracted data to manually map it with the expected data fields that were needed from each notice type. This was done manually in order to analyze the associated labels representing a particular entity in the notice document. Consider an example for expected field “NAME” there can be different labels in different documents like [Name, Tax-Payer Name, etc]. This manually mapped data was used to create an automated pipeline which maps the exact entity’s label and its value to the expected field.

After this the Document Classification of Tax Notices was implemented in which we firstly used Jaccard Similarity MinHash Approach for classification. MinHash is an LSH family for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union. MinHash applies a random hash function to each element in the set and takes the minimum of all hashed values. But we observed that the accuracy obtained was quite low from this approach and the results were not satisfying.

The next approach used for classification was to implement Multi-Class Classification using Naive Bayes Classifier along with Hashing Trick to handle OOV terms. The usage of a Hashing Trick helps address the problem of memory consumption of a large vocabulary, and it also mitigates the problem of filter circumvention. By this approach, the accuracy increased to 75-85% and the model was able to more accurately classify the tax notices.


Download the detailed version of the case study:

    Get In Touch:

    Looking for similar or other solution for the healthcare industry? Get in touch or send us an email at We are proven, experienced Certified Lightbend Partner, available for partnering to make your product a reality.

    Follow our thinking:

    Share this successful story: