Indexing and Hit Highlighting OCR Data

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing and Hit Highlighting OCR Data

Corey Keith
Hi,
 
I am involved in a project which is trying to provide searching and hit highlighting on the scanned image of historical newspapers.  We have an XML based OCR format.  A sample is below.  We need to index the CONTENT attribute of the String element which is the easy part.  We would like to be able find the "hits" within this XML document in order to use the positioning information to draw the highlight boxes on the image.  It doesn't make a lot of sense to just extract the CONTENT and index that because we loose the positioning information.  My second thought was to make a custom analyzer which dropped everything except for the content element and then used the highlighting class in the sandbox to reanalyze the XML document and mark the hits.  With the marked hits in the XML we could find the position information and draw on the image.  Has anyone else worked with OCR information and lucene.  What was your approach?  Does this approach seem sound?  Any recommendations?
 
Thanks, Corey
 
     <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
      <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>
      <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
      <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>
      <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
      <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>
      <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>
      <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>
      <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
      <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
     </TextLine>


Reply | Threaded
Open this post in threaded view
|

Re: Indexing and Hit Highlighting OCR Data

Chris Hostetter-3

This is a pretty interesting problem.  I envy you.

I would avoid the existing highlighter for your purposes -- highlighting
in token space is a very differnet problem from "highlihgting" in 2D
space.

based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" form of the orriginal OCR data -- by which i
mean the page has allready been tokenized into words who position is
recorded.

I would parse these XML docs to generate two things:
    1) a stream of words for analysis/filtering (ie: stop words, stemming,
       synonyms)
    2) a datastructure mapping words to lists of positions (ie: if the
       same word apears in multiple places, list the word once, followed
       by each set of coordinates)

use #1 in the usual way, and add a serialized form of #2 to your index as
a Stored Keyword -- at query time, the words from your initial query can
be looked up in that data strucutre to find the regions to "highlight"



: I am involved in a project which is trying to provide searching and hit highlighting on the scanned image of historical newspapers.  We have an XML based OCR format.  A sample is below.  We need to index the CONTENT attribute of the String element which is the easy part.  We would like to be able find the "hits" within this XML document in order to use the positioning information to draw the highlight boxes on the image.  It doesn't make a lot of sense to just extract the CONTENT and index that because we loose the positioning information.  My second thought was to make a custom analyzer which dropped everything except for the content element and then used the highlighting class in the sandbox to reanalyze the XML document and mark the hits.  With the marked hits in the XML we could find the position information and draw on the image.  Has anyone else worked with OCR information and lucene.  What was your approach?  Does this approach seem sound?  Any recommendations?
:
: Thanks, Corey
:
:      <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
:       <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>
:       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>
:       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
:       <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>
:       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>
:       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>
:       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
:       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
:      </TextLine>
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing and Hit Highlighting OCR Data

Erik Hatcher

On Jun 2, 2005, at 9:02 PM, Chris Hostetter wrote:

> This is a pretty interesting problem.  I envy you.
>
> I would avoid the existing highlighter for your purposes --  
> highlighting
> in token space is a very differnet problem from "highlihgting" in 2D
> space.
>
> based on the XML sample you provided, it looks like your XML files
> are allready a "tokenized" form of the orriginal OCR data -- by  
> which i
> mean the page has allready been tokenized into words who position is
> recorded.
>
> I would parse these XML docs to generate two things:
>     1) a stream of words for analysis/filtering (ie: stop words,  
> stemming,
>        synonyms)
>     2) a datastructure mapping words to lists of positions (ie: if the
>        same word apears in multiple places, list the word once,  
> followed
>        by each set of coordinates)
>
> use #1 in the usual way, and add a serialized form of #2 to your  
> index as
> a Stored Keyword -- at query time, the words from your initial  
> query can
> be looked up in that data strucutre to find the regions to "highlight"

Chris - that is great recommendation.  I second it.  The only minor  
thing I'll add is that you probably should use an unindexed field for  
#2 rather than literally a Field.Keyword - no point in indexing it as  
you would never search on that data structure.

     Erik

>
>
>
> : I am involved in a project which is trying to provide searching  
> and hit highlighting on the scanned image of historical  
> newspapers.  We have an XML based OCR format.  A sample is below.  
> We need to index the CONTENT attribute of the String element which  
> is the easy part.  We would like to be able find the "hits" within  
> this XML document in order to use the positioning information to  
> draw the highlight boxes on the image.  It doesn't make a lot of  
> sense to just extract the CONTENT and index that because we loose  
> the positioning information.  My second thought was to make a  
> custom analyzer which dropped everything except for the content  
> element and then used the highlighting class in the sandbox to  
> reanalyze the XML document and mark the hits.  With the marked hits  
> in the XML we could find the position information and draw on the  
> image.  Has anyone else worked with OCR information and lucene.  
> What was your approach?  Does this approach seem sound?  Any  
> recommendations?
> :
> : Thanks, Corey
> :
> :      <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0"  
> VPOS="123644.0">
> :       <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0"  
> HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>
> :       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0"  
> HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>
> :       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0"  
> HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>
> :       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0"  
> HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>
> :       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0"  
> HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>
> :       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0"  
> HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
> :       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
> :      </TextLine>
> :
> :
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing and Hit Highlighting OCR Data

Corey Keith
In reply to this post by Corey Keith
With this approach all work is done at the word level.  When we have a phrase query the results will contain pages with the entire phrase but when we go to highlight the document _all_ words in the phrase regardless of being in the phrase will be highlighted.  Is that correct?  It would also be difficult to get the best fragment in a similar way to the current highlighter?  

>>> [hidden email] 06/02/05 9:02 PM >>>

This is a pretty interesting problem.  I envy you.

I would avoid the existing highlighter for your purposes -- highlighting
in token space is a very differnet problem from "highlihgting" in 2D
space.

based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" form of the orriginal OCR data -- by which i
mean the page has allready been tokenized into words who position is
recorded.

I would parse these XML docs to generate two things:
    1) a stream of words for analysis/filtering (ie: stop words, stemming,
       synonyms)
    2) a datastructure mapping words to lists of positions (ie: if the
       same word apears in multiple places, list the word once, followed
       by each set of coordinates)

use #1 in the usual way, and add a serialized form of #2 to your index as
a Stored Keyword -- at query time, the words from your initial query can
be looked up in that data strucutre to find the regions to "highlight"



: I am involved in a project which is trying to provide searching and hit highlighting on the scanned image of historical newspapers.  We have an XML based OCR format.  A sample is below.  We need to index the CONTENT attribute of the String element which is the easy part.  We would like to be able find the "hits" within this XML document in order to use the positioning information to draw the highlight boxes on the image.  It doesn't make a lot of sense to just extract the CONTENT and index that because we loose the positioning information.  My second thought was to make a custom analyzer which dropped everything except for the content element and then used the highlighting class in the sandbox to reanalyze the XML document and mark the hits.  With the marked hits in the XML we could find the position information and draw on the image.  Has anyone else worked with OCR information and lucene.  What was your approach?  Does this approach seem sound?  Any recommendations?
:
: Thanks, Corey
:
:      <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
:       <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>
:       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>
:       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
:       <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>
:       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>
:       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>
:       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
:       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
:      </TextLine>
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing and Hit Highlighting OCR Data

Richard Krenek
In reply to this post by Corey Keith
Corey,
  I have one off the wall approach that may or may not work for you.
If you convert your scanned images to PDF then use something like
Acrobat to convert those PDFs into PDFs with hidden text (The OCR
data). You can then tell Acrobat Reader via XML what to highlight when
your user opens the PDF.
  Not sure if that helps you but may give you some alternate ideas.

Richard


On 6/2/05, Corey Keith <[hidden email]> wrote:

> Hi,
>
> I am involved in a project which is trying to provide searching and hit highlighting on the scanned image of historical newspapers.  We have an XML based OCR format.  A sample is below.  We need to index the CONTENT attribute of the String element which is the easy part.  We would like to be able find the "hits" within this XML document in order to use the positioning information to draw the highlight boxes on the image.  It doesn't make a lot of sense to just extract the CONTENT and index that because we loose the positioning information.  My second thought was to make a custom analyzer which dropped everything except for the content element and then used the highlighting class in the sandbox to reanalyze the XML document and mark the hits.  With the marked hits in the XML we could find the position information and draw on the image.  Has anyone else worked with OCR information and lucene.  What was your approach?  Does this approach seem sound?  Any recommendations?
>
> Thanks, Corey
>
>      <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
>       <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>
>       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>
>       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
>       <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>
>       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>
>       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>
>       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
>       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
>      </TextLine>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing and Hit Highlighting OCR Data

Erik Hatcher
In reply to this post by Corey Keith

On Jun 3, 2005, at 8:50 AM, Corey Keith wrote:

> With this approach all work is done at the word level.  When we  
> have a phrase query the results will contain pages with the entire  
> phrase but when we go to highlight the document _all_ words in the  
> phrase regardless of being in the phrase will be highlighted.  Is  
> that correct?  It would also be difficult to get the best fragment  
> in a similar way to the current highlighter?

The current Highlighter also does it by Term even if the query is a  
PhraseQuery - so you're not losing capability by not using  
Highlighter in this case.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing and Hit Highlighting OCR Data

steve_rowe
In reply to this post by Corey Keith
There is a proposal to extend indexing (item #11 in the API Changes
section):

http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard

An excerpt:

    11. (Hard) Make indexing more flexible, so that one could
    e.g., not store positions or even frequencies, or alternately,
    to store extra information with each position, or to even use
    different posting compression algorithms.

I'm pretty sure that an implementation of this proposal would allow you
to store the positioning information with each position/token.

Doug Cutting posted recently about this (at the bottom of the message):

http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200505.mbox/%3c4291FDB7.30500@...%3e

Steve

Corey Keith wrote:

> I am involved in a project which is trying to provide searching and
> hit highlighting on the scanned image of historical newspapers. We
> have an XML based OCR format. A sample is below. We need to index the
> CONTENT attribute of the String element which is the easy part. We
> would like to be able find the "hits" within this XML document in
> order to use the positioning information to draw the highlight boxes
> on the image. It doesn't make a lot of sense to just extract the
> CONTENT and index that because we loose the positioning information.
> My second thought was to make a custom analyzer which dropped
> everything except for the content element and then used the
> highlighting class in the sandbox to reanalyze the XML document and
> mark the hits. With the marked hits in the XML we could find the
> position information and draw on the image. Has anyone else worked
> with OCR information and lucene. What was your approach? Does this
> approach seem sound? Any recommendations?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]