Tax Notices, Entity Extraction and Document Classification Using Document AI
GCPS - OCR, Document AI, DLP & ML/ NLP - Term-Frequency, Hashing Trick, Jaccard Similarity, Naive Bayes Algorithm.
Human Resource Management
“Technology is a great equalizer that enables our clients to compete with the largest banks in the world. One of the significant technology advantages that Knoldus expertise Solution provides is the ability to share across our product portfolio. The significant events that occur throughout an end user’s financial journey, from opening an account to initiating a home or small business loan to saving for college or retirement,” said Vice President, hosting architecture.
Our team started by analyzing the different notice types and what relevant information could be extracted from different notices. The ﬁrst task was to leverage the Document AI service, which extracted the data from documents using Form Parser. The Doc AI form parser parses the data in the form of key-value pairs. This extracted data from all the documents were stored in the BigQuery table for use at later stages. Here is a sample of how the Document AI Form Parser extracts data from documents.
But using this service/processor our team was able to extract data that only had key-value pairs. Some data like dates, company name, identiﬁcation number, etc. did not have any key associated with it while few entities were present in the paragraph content which was necessary to be extracted. Google Cloud Data Loss Prevention (DLP) proved a very good solution. Data Loss Prevention (DLP) uses built-in infotype detectors to extract information from documents. It has nearly 150 different info types. Infotypes are a type of sensitive PII data such as email address, identiﬁcation number, credit card number DOB, etc. Our team used the built-in infotype and also created a custom-infotype for extracting entities. With both the services, i.e. Google Document AI and Data Loss Prevention(DLP) our team was able to extract all the data that was needed. The data and/or entity extraction accuracy was 85%-95%.
In the next step, we used this extracted data to manually map it with the expected data ﬁelds that were needed from each notice type. This was done manually to analyze the associated labels representing a particular entity in the notice document. Consider an example for expected ﬁeld “NAME” there can be different labels in different documents like [Name, Tax-Payer Name, etc.] This manually mapped data was used to create an automated pipeline that maps the exact entity’s label and its value to the expected ﬁeld.
After this the Document Classiﬁcation of Tax Notices was implemented, in which we ﬁrstly used Jaccard Similarity MinHash Approach for classiﬁcation. MinHash is an LSH family for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is deﬁned by the cardinality of their intersection and union. MinHash applies a random hash function to each element in the set and takes the minimum of all hashed values. But we observed that the accuracy obtained was quite low from this approach, and the results were not satisfying.
The next approach used for classiﬁcation was implementing Multi-Class Classiﬁcation using Naive Bayes Classiﬁer and Hashing Trick to handle OOV terms. The usage of a Hashing Trick helps address the problem of memory consumption of a large vocabulary, and it also mitigates the problem of ﬁlter circumvention. With this approach, the accuracy increased to 75-85%, and the model could more accurately classify the tax notices.