|
Hello all,
We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms. (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words ) We would like to remove some *fraction* of these nonsense words caused by OCR errors prior to indexing. ( We don't want to remove "real" words, so we need some method with very few false positives.) A dictionary based approach does not seem feasible given the number of languages and the inclusion of proper names, place names, and technical terms. We are considering using some heuristics, such as looking for strings over a certain length or strings containing more than some number of punctuation characters. This paper has a few such heuristics: Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? Tom Burton-West University of Michigan Library www.hathitrust.org |
|
> Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream?
one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir [hidden email] |
|
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir <[hidden email]> wrote:
> > Can anyone suggest any practical solutions to removing some fraction of > the tokens containing OCR errors from our input stream? > > one approach would be to try > http://issues.apache.org/jira/browse/LUCENE-1812 > > and filter terms that only appear once in the document. > In another life (and with another search engine) I also had to find a solution to the dirty OCR problem. Fortunately only in English, unfortunately a corpus containing many non-American/non-English names, so we also had to be very conservative and reduce the number of false positives. There wasn't any completely satisfactory solution; there were a large number of two and three letter n-grams so we were able to use a dictionary approach to eliminate those (names tend to be longer). We also looked for runs of punctuation, unlikely mixes of alpha/numeric/punctuation, and also eliminated longer words which consisted of runs of not-ocurring-in-English bigrams. Hope this helps -Simon > > -- > |
|
In reply to this post by Robert Muir
Thanks Robert,
I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.) There are a couple of use cases where removing words that occur only once might be a problem. One is for genealogical searches where a user might want to retrieve a document if their relative is only mentioned once in the document. We have quite a few government documents and other resources such as the "Lineage Book" of the Daughters of the American Revolution. Another use case is humanities researchers doing phrase searching for quotes. In this case, if we remove one of the words in the quote because it occurs only once in a document, then the phrase search would fail. For example if someone were searching Macbeth and entered the phrase query: "Eye of newt and toe of frog" it would fail if we had removed "newt" from the index because "newt" occurs only once in Macbeth. I ran a quick check against a couple of our copies of Macbeth and found out of about 5,000 unique words about 3,000 occurred only once. Of these about 1,800 were in the unix dictionary, so at least 1800 words that would be removed would be "real" words as opposed to OCR errors (a spot check of the words not in the unix /usr/share/dict/words file revealed most of them also as real words rather than OCR errors.) I also ran a quick check against a document with bad OCR and out of about 30,000 unique words, 20,000 occurred only once. Of those 20,000 only about 300 were in the unix dictionary so your intuition that a lot of OCR errors will occur only once seems spot on. A quick look at the words not in the dictionary revealed a mix of technical terms, common names, and obvious OCR nonsense such as "ffllllll.lj'slall'lm" " I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above. When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index. I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? Tom -----Original Message----- From: Robert Muir [mailto:[hidden email]] Sent: Tuesday, March 09, 2010 2:36 PM To: [hidden email] Subject: Re: Cleaning up dirty OCR > Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir [hidden email] |
|
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom <[hidden email]> wrote:
> Thanks Robert, > > I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.) > You are correct. I really hate recommending you 'remove data', but at the same time, as perhaps an intermediate step, this could be a brutally simple approach to move you along. > I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above. When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index. Well, one thing I did with Andrzej's patch is immediately relevance-test this approach against some corpora I had. The results are on the JIRA issue, and the test collection itself is in openrelevance. In my opinion the P@N is probably overstated, and the MAP values are probably understated (due to it being a pooled relevance collection), but I think its fair to say for that specific large text collection, pruning terms that only appear in the document a single time does not hurt relevance. At the same time I will not dispute that it could actually help P@N, I am just saying I'm not sold :) Either way its extremely interesting, cut your index size in half, and get the same relevance! > > I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? > I would recommend posting it to the JIRA issue: http://issues.apache.org/jira/browse/LUCENE-1812 This way someone who knows more (Andrzej) could see it, too. -- Robert Muir [hidden email] |
|
In reply to this post by simon-2748
Thanks Simon,
We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we would have to watch out for any Russian-Chinese dictionaries:) Tom
|
|
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West <[hidden email]> wrote:
> > Thanks Simon, > > We can probably implement your suggestion about runs of punctuation and > unlikely mixes of alpha/numeric/punctuation. I'm also thinking about > looking for unlikely mixes of unicode character blocks. For example some of > the CJK material ends up with Cyrillic characters. (except we would have to > watch out for any Russian-Chinese dictionaries:) > Ok this is a new one for me, I am just curious, have you figured out why this is happening? Separately, i would love to know some sort of character frequency data for your non-english text, are you OCR'ing that data too? Are you using Unicode normalization or anything to prevent explosion of terms that are really the same? -- Robert Muir [hidden email] |
|
In reply to this post by Tom Burton-West
: We can probably implement your suggestion about runs of punctuation and : unlikely mixes of alpha/numeric/punctuation. I'm also thinking about : looking for unlikely mixes of unicode character blocks. For example some of : the CJK material ends up with Cyrillic characters. (except we would have to : watch out for any Russian-Chinese dictionaries:) Since you are dealing with multiple langugaes, and multiple varient usages of langauges (ie: olde english) I wonder if one way to try and generalize the idea of "unlikely" letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index based on the frequency of (character) N-grams in each of those words, relative the entire corpus, and then eliminate any of the hapax legomenon words whose score is below some cut off threshold (that you'd have to pick arbitrarily, probably by eyeballing the sorted list of words and their contexts to deide if they are legitimate) ? -Hoss |
|
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:
> I wonder if one way to try and generalize > the idea of "unlikely" letter combinations into a math problem (instead of > grammer/spelling problem) would be to score all the hapax legomenon > words in your index Hmm, how about a classifier? Common words are the "yes" training set, hapax legomenons are the "no" set, and n-grams are the features. But why isn't the OCR program already doing this? wunder |
|
In reply to this post by hossman
Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus.
|
|
In reply to this post by Walter Underwood
We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing. Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas mentioned above.
I'm not sure I understand your suggestion. Since real word hapax legomenons are generally pretty common (maybe 40-60% of unique words) wouldn't using them as the "no" set provide mixed signals to the classifier? Tom
|
|
In reply to this post by Tom Burton-West
: Interesting. I wonder though if we have 4 million English documents and 250 : in Urdu, if the Urdu words would score badly when compared to ngram : statistics for the entire corpus. Well it doesn't have to be a strict ratio cutoff .. you could look at the average frequency of all character Ngrams in your index, and then consider any Ngram that appeared fewer then X stddev's below the average to be suspicious, and eliminate any work that contains Y or more suspicious Ngrams. Of you could just start really simple and eliminate any work that contains an Ngram that doesn't appear in *any* other word in your corpus. I don't deal with a lot of multi-lingual stuff, but my understanding is that this sort of thing gets a lot easier if you can partition your docs by language -- and even if you can't, doing some langauge detection on the (dirty) OCRed text to get a language guess (and then partition by language and attempt to find the suspicious words in each partition) -Hoss |
|
>
> I don't deal with a lot of multi-lingual stuff, but my understanding is > that this sort of thing gets a lot easier if you can partition your docs > by language -- and even if you can't, doing some langauge detection on the > (dirty) OCRed text to get a language guess (and then partition by language > and attempt to find the suspicious words in each partition) > and if you are really OCR'ing Urdu text and trying to search it automatically, then this is your last priority. -- Robert Muir [hidden email] |
| Powered by Nabble | Edit this page |
