Quantcast

Cleaning up dirty OCR

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Cleaning up dirty OCR

Tom Burton-West-2
Hello all,

We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages.  With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms.  (See  http://www.hathitrust.org/blogs/large-scale-search/too-many-words )

We would like to remove some *fraction* of these nonsense words caused by OCR errors prior to indexing. ( We don't want to remove "real" words, so we need some method with very few false positives.)

A dictionary based approach does not seem feasible given the number of languages and the inclusion of proper names, place names, and technical terms.   We are considering using some heuristics, such as looking for strings over a certain length or strings containing more than some number of punctuation characters.

This paper has a few such heuristics:
Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf

Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream?

Tom Burton-West
University of Michigan Library
www.hathitrust.org

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Robert Muir
> Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


--
Robert Muir
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

simon-2748
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir <[hidden email]> wrote:

> > Can anyone suggest any practical solutions to removing some fraction of
> the tokens containing OCR errors from our input stream?
>
> one approach would be to try
> http://issues.apache.org/jira/browse/LUCENE-1812
>
> and filter terms that only appear once in the document.
>

In another life (and with another search engine) I also had to find a
solution to the dirty OCR problem. Fortunately only in English,
unfortunately a corpus containing many non-American/non-English names, so we
also had to be very conservative and reduce the number of false positives.

There wasn't any completely satisfactory solution; there were a large number
of two and three letter n-grams so we were able to use a dictionary approach
to eliminate those (names tend to be longer).  We also looked for runs of
punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.

Hope this helps

-Simon

>
> --
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Cleaning up dirty OCR

Tom Burton-West-2
In reply to this post by Robert Muir
Thanks Robert,

I've been thinking about this since you suggested it on another thread.  One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.)  

There are a couple of use cases where removing words that occur only once might be a problem.  

One is for genealogical searches where a user might want to retrieve a document if their relative is only mentioned once in the document.  We have quite a few government documents and other resources such as the "Lineage Book" of the Daughters of the American Revolution.  

Another use case is humanities researchers doing phrase searching for quotes.  In this case, if we remove one of the words in the quote because it occurs only once in a document, then the phrase search would fail.  For example if someone were searching Macbeth and entered the phrase query: "Eye of newt and toe of frog" it would fail if we had removed "newt" from the index because "newt" occurs only once in Macbeth.

I ran a quick check against a couple of our copies of Macbeth and found out of about 5,000 unique words about 3,000 occurred only once.  Of these about 1,800 were in the unix dictionary, so at least 1800 words that would be removed would be "real" words as opposed to OCR errors (a spot check of the words not in the unix /usr/share/dict/words file revealed most of them also as real words rather than OCR errors.)

I also ran a quick check against a document with bad OCR and out of about 30,000 unique words, 20,000 occurred only once.  Of those 20,000 only about 300 were in the unix dictionary so your intuition that a lot of OCR errors will occur only once seems spot on.  A quick look at the words not in the dictionary revealed a mix of technical terms, common names, and obvious OCR nonsense such as "ffllllll.lj'slall'lm" "

I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above.   When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index.

I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list.  Can you suggest which list I should post an index pruning question to?

Tom








-----Original Message-----
From: Robert Muir [mailto:[hidden email]]
Sent: Tuesday, March 09, 2010 2:36 PM
To: [hidden email]
Subject: Re: Cleaning up dirty OCR

> Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


--
Robert Muir
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Robert Muir
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom <[hidden email]> wrote:
> Thanks Robert,
>
> I've been thinking about this since you suggested it on another thread.  One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.)
>

You are correct. I really hate recommending you 'remove data', but at
the same time, as perhaps an intermediate step, this could be a
brutally simple approach to move you along.


> I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above.   When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index.

Well, one thing I did with Andrzej's patch is immediately
relevance-test this approach against some corpora I had. The results
are on the JIRA issue, and the test collection itself is in
openrelevance.

In my opinion the P@N is probably overstated, and the MAP values are
probably understated (due to it being a pooled relevance collection),
but I think its fair to say for that specific large text collection,
pruning terms that only appear in the document a single time does not
hurt relevance.

At the same time I will not dispute that it could actually help P@N, I
am just saying I'm not sold :)

Either way its extremely interesting, cut your index size in half, and
get the same relevance!

>
> I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list.  Can you suggest which list I should post an index pruning question to?
>

I would recommend posting it to the JIRA issue:
http://issues.apache.org/jira/browse/LUCENE-1812

This way someone who knows more (Andrzej) could see it, too.


--
Robert Muir
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Tom Burton-West
In reply to this post by simon-2748
Thanks Simon,

We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about looking for unlikely mixes of unicode character blocks.  For example some of the CJK material ends up with Cyrillic characters. (except we would have to watch out for any Russian-Chinese dictionaries:)

Tom


There wasn't any completely satisfactory solution; there were a large number
of two and three letter n-grams so we were able to use a dictionary approach
to eliminate those (names tend to be longer).  We also looked for runs of
punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.

Hope this helps

-Simon

>
> --
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Robert Muir
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West <[hidden email]> wrote:
>
> Thanks Simon,
>
> We can probably implement your suggestion about runs of punctuation and
> unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
> looking for unlikely mixes of unicode character blocks.  For example some of
> the CJK material ends up with Cyrillic characters. (except we would have to
> watch out for any Russian-Chinese dictionaries:)
>

Ok this is a new one for me, I am just curious, have you figured out
why this is happening?

Separately, i would love to know some sort of character frequency data
for your non-english text, are you OCR'ing that data too? Are you
using Unicode normalization or anything to prevent explosion of terms
that are really the same?

--
Robert Muir
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

hossman
In reply to this post by Tom Burton-West

: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
: looking for unlikely mixes of unicode character blocks.  For example some of
: the CJK material ends up with Cyrillic characters. (except we would have to
: watch out for any Russian-Chinese dictionaries:)

Since you are dealing with multiple langugaes, and multiple varient usages
of langauges (ie: olde english) I wonder if one way to try and generalize
the idea of "unlikely" letter combinations into a math problem (instead of
grammer/spelling problem) would be to score all the hapax legomenon
words in your index based on the frequency of (character) N-grams in
each of those words, relative the entire corpus, and then eliminate any of
the hapax legomenon words whose score is below some cut off threshold
(that you'd have to pick arbitrarily, probably by eyeballing the sorted
list of words and their contexts to deide if they are legitimate)

        ?


-Hoss

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Walter Underwood
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:

> I wonder if one way to try and generalize
> the idea of "unlikely" letter combinations into a math problem (instead of
> grammer/spelling problem) would be to score all the hapax legomenon
> words in your index


Hmm, how about a classifier? Common words are the "yes" training set, hapax legomenons are the "no" set, and n-grams are the features.

But why isn't the OCR program already doing this?

wunder




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Tom Burton-West
In reply to this post by hossman
Interesting.  I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus.  

hossman wrote

Since you are dealing with multiple langugaes, and multiple varient usages
of langauges (ie: olde english) I wonder if one way to try and generalize
the idea of "unlikely" letter combinations into a math problem (instead of
grammer/spelling problem) would be to score all the hapax legomenon
words in your index based on the frequency of (character) N-grams in
each of those words, relative the entire corpus, and then eliminate any of
the hapax legomenon words whose score is below some cut off threshold
(that you'd have to pick arbitrarily, probably by eyeballing the sorted
list of words and their contexts to deide if they are legitimate)

        ?


-Hoss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Tom Burton-West
In reply to this post by Walter Underwood
We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing.  Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas mentioned above.  

I'm not sure I understand your suggestion. Since real word hapax legomenons are generally pretty common (maybe 40-60% of unique words) wouldn't  using them as the "no" set provide mixed signals to the classifier?

Tom

Walter Underwood-2 wrote
Hmm, how about a classifier? Common words are the "yes" training set, hapax legomenons are the "no" set, and n-grams are the features.

But why isn't the OCR program already doing this?

wunder



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

hossman
In reply to this post by Tom Burton-West

: Interesting.  I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.  

Well it doesn't have to be a strict ratio cutoff .. you could look at the
average frequency of all character Ngrams in your index, and then
consider any Ngram that appeared fewer then X stddev's below the average
to be suspicious, and eliminate any work that contains Y or more
suspicious Ngrams.

Of you could just start really simple and eliminate any work that contains
an Ngram that doesn't appear in *any* other word in your corpus.

I don't deal with a lot of multi-lingual stuff, but my understanding is
that this sort of thing gets a lot easier if you can partition your docs
by language -- and even if you can't, doing some langauge detection on the
(dirty) OCRed text to get a language guess (and then partition by language
and attempt to find the suspicious words in each partition)


-Hoss

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning up dirty OCR

Robert Muir
>
> I don't deal with a lot of multi-lingual stuff, but my understanding is
> that this sort of thing gets a lot easier if you can partition your docs
> by language -- and even if you can't, doing some langauge detection on the
> (dirty) OCRed text to get a language guess (and then partition by language
> and attempt to find the suspicious words in each partition)
>

and if you are really OCR'ing Urdu text and trying to search it automatically,
then this is your last priority.

--
Robert Muir
[hidden email]
Loading...