Dictionary Based Annotation at Scale with Spark SolrTextTagger and OpenNLP
Description
Dictionary Matching is the inverse of full text search. It is the problem of finding all the matches of a list of strings in a single document. This is easy when the number of strings is small, but is far from trivial when dealing with millions of patterns to search. We describe a system to annotate large volumes of text held in Spark DataFrames using Solr to hold one or more dictionaries. The system supports tagging of exact matches in the incoming text using SolrTextTagger, a Solr plugin which wraps Lucene’s Finite State Transducer (FST) technology to provide a very low-memory matcher implementation. The system also supports fuzzy tagging by using OpenNLP to chunk the incoming text into phrases and matching various normalized forms of the phrases against the dictionary. The functionality is accessed from Spark via a map() call, and returns a list of 4-tuples consisting of the start and end character offsets of the match in the text, the entity ID that matched, and a confidence level indicator between 0 and 1, indicating the degree of match between the dictionary entity and the text segment that was matched. A modest Solr setup with 8 GB RAM and 30 GB disk can support up to 120 million dictionary entries from one or more dictionaries on a single box. Near infinite horizontal scaling can be achieved by routing specific sets of dictionaries to specific boxes.