Hypenated word

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Hypenated word

Markus Wiederkehr
Hello,

I work on an application that has to index OCR texts of scanned books.
Naturally there occur many words that are hyphenated across lines.

I wonder if there is already an Analyzer or maybe a TokenFilter that
can merge those syllables back into whole words? It looks like Erik
Hatcher uses something like that at http://www.lucenebook.com/.

Thanks in advance,

Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hypenated word

Erik Hatcher

On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> I work on an application that has to index OCR texts of scanned books.
> Naturally there occur many words that are hyphenated across lines.
>
> I wonder if there is already an Analyzer or maybe a TokenFilter that
> can merge those syllables back into whole words? It looks like Erik
> Hatcher uses something like that at http://www.lucenebook.com/.

Markus - you're right, I did develop something to handle hyphenated  
words for lucenebook.com.  It was sort of a hack in that I had to  
build in a static list of exceptions in how I handled this, so you'll  
likely have to use caution as well.  The LiaAnalyzer is this:

   public TokenStream tokenStream(String fieldName, Reader reader) {
     TokenFilter filter = new DashSplitterFilter(
               new HyphenatedFilter(
                 new DashDashFilter(
                   new LiaTokenizer(reader))));

     filter = new LengthFilter(3, filter);
     filter = new StopFilter(filter, stopSet);

     if (stem) {
       filter = new SnowballFilter(filter, "English");
     }

     return filter;
   }


And my HyphenatedFilter is this:

public class HyphenatedFilter extends TokenFilter {
   private HashMap exceptions = new HashMap();

   private static final String[] EXCEPTION_LIST = {
      "full-text", "information-retrieval", "license-code", "old-
fashioned",
      "well-designed", "free-form", "file-based", "ramdirectory-
based", "ram-based",
      "index-modifying", "read-only",
      "top-scoring", "most-recently-used", "queryparser-parsed",
      "in-order", "per-document", "lower-caser", "domain-specific",  
"high-level",
      "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
      "date-range", "computation-intensive", "hits-returning", "lower-
level",
      "number-padding", "utf-address-book", "third-party", "plain-
text", "google-like",
      "re-add", "english-specific", "file-handling", "already-
created", "d-add", "d-add",
      "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",  
"porteranalyzer-new",
      "writer-set", "document-new", "doc-add", "field-keyword",  
"field-unstored", "writer-add",
      "writer-optimize", "queryparser-new", "porteranalyzer-new",  
"parser-parse", "indexsearcher-new",
      "hitcollector-new", "searcher-doc", "searcher-search", "jakarta-
lucene", "www-ibm", "java-specific",
      "non-java", "vis--vis", "medium-sized", "browser-based", "utf-
before", "concept-based",
      "natural-language", "queue-based", "high-likelihood", "slp-or",  
"noisy-channel", "al-rasheed",
      "hands-free", "top-notch", "google-esque", "search-config",  
"java-related",
      "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",  
"lucene-web", "lucene-webindex",
      "command-line", "lucene-version", "issue-tracking"
   };

   protected HyphenatedFilter(TokenStream tokenStream) {
     super(tokenStream);

     for (int i = 0; i < EXCEPTION_LIST.length; i++) {
       exceptions.put(EXCEPTION_LIST[i], "");
     }
   }

   private Token savedToken;

   public Token next() throws IOException {

     if (savedToken != null) {
       Token token = savedToken;
       savedToken = null;
       return token;
     }

     Token firstToken = input.next();

     if (firstToken == null)
       return firstToken;


     if (firstToken.termText().endsWith("-")) {
       String firstPart;
       firstPart = firstToken.termText();

       // consume next token
       Token secondToken = input.next();
       if (secondToken == null)
         return firstToken;

       String termText = firstPart.substring(0, firstPart.length() -  
1) + secondToken.termText();

       if (exceptions.containsKey(firstPart + secondToken.termText())) {
         savedToken = secondToken;
         return firstToken;
       }

       return new Token(termText, firstToken.startOffset(),  
firstToken.endOffset() + secondToken.termText().length() + 1);
     }

     return firstToken;
   }
}

Not all that pretty, I'm afraid, but by all means use it if its useful.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hypenated word

Markus Wiederkehr
I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!

Markus

On 6/13/05, Erik Hatcher <[hidden email]> wrote:

>
> On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> > I work on an application that has to index OCR texts of scanned books.
> > Naturally there occur many words that are hyphenated across lines.
> >
> > I wonder if there is already an Analyzer or maybe a TokenFilter that
> > can merge those syllables back into whole words? It looks like Erik
> > Hatcher uses something like that at http://www.lucenebook.com/.
>
> Markus - you're right, I did develop something to handle hyphenated
> words for lucenebook.com.  It was sort of a hack in that I had to
> build in a static list of exceptions in how I handled this, so you'll
> likely have to use caution as well.  The LiaAnalyzer is this:
>
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      TokenFilter filter = new DashSplitterFilter(
>                new HyphenatedFilter(
>                  new DashDashFilter(
>                    new LiaTokenizer(reader))));
>
>      filter = new LengthFilter(3, filter);
>      filter = new StopFilter(filter, stopSet);
>
>      if (stem) {
>        filter = new SnowballFilter(filter, "English");
>      }
>
>      return filter;
>    }
>
>
> And my HyphenatedFilter is this:
>
> public class HyphenatedFilter extends TokenFilter {
>    private HashMap exceptions = new HashMap();
>
>    private static final String[] EXCEPTION_LIST = {
>       "full-text", "information-retrieval", "license-code", "old-
> fashioned",
>       "well-designed", "free-form", "file-based", "ramdirectory-
> based", "ram-based",
>       "index-modifying", "read-only",
>       "top-scoring", "most-recently-used", "queryparser-parsed",
>       "in-order", "per-document", "lower-caser", "domain-specific",
> "high-level",
>       "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
>       "date-range", "computation-intensive", "hits-returning", "lower-
> level",
>       "number-padding", "utf-address-book", "third-party", "plain-
> text", "google-like",
>       "re-add", "english-specific", "file-handling", "already-
> created", "d-add", "d-add",
>       "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",
> "porteranalyzer-new",
>       "writer-set", "document-new", "doc-add", "field-keyword",
> "field-unstored", "writer-add",
>       "writer-optimize", "queryparser-new", "porteranalyzer-new",
> "parser-parse", "indexsearcher-new",
>       "hitcollector-new", "searcher-doc", "searcher-search", "jakarta-
> lucene", "www-ibm", "java-specific",
>       "non-java", "vis--vis", "medium-sized", "browser-based", "utf-
> before", "concept-based",
>       "natural-language", "queue-based", "high-likelihood", "slp-or",
> "noisy-channel", "al-rasheed",
>       "hands-free", "top-notch", "google-esque", "search-config",
> "java-related",
>       "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",
> "lucene-web", "lucene-webindex",
>       "command-line", "lucene-version", "issue-tracking"
>    };
>
>    protected HyphenatedFilter(TokenStream tokenStream) {
>      super(tokenStream);
>
>      for (int i = 0; i < EXCEPTION_LIST.length; i++) {
>        exceptions.put(EXCEPTION_LIST[i], "");
>      }
>    }
>
>    private Token savedToken;
>
>    public Token next() throws IOException {
>
>      if (savedToken != null) {
>        Token token = savedToken;
>        savedToken = null;
>        return token;
>      }
>
>      Token firstToken = input.next();
>
>      if (firstToken == null)
>        return firstToken;
>
>
>      if (firstToken.termText().endsWith("-")) {
>        String firstPart;
>        firstPart = firstToken.termText();
>
>        // consume next token
>        Token secondToken = input.next();
>        if (secondToken == null)
>          return firstToken;
>
>        String termText = firstPart.substring(0, firstPart.length() -
> 1) + secondToken.termText();
>
>        if (exceptions.containsKey(firstPart + secondToken.termText())) {
>          savedToken = secondToken;
>          return firstToken;
>        }
>
>        return new Token(termText, firstToken.startOffset(),
> firstToken.endOffset() + secondToken.termText().length() + 1);
>      }
>
>      return firstToken;
>    }
> }
>
> Not all that pretty, I'm afraid, but by all means use it if its useful.
>
>      Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Always remember you're unique. Just like everyone else.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hypenated word

Andy Roberts-3
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!
>

I expect you'll need to do some pre-processing. Read in your text into a
buffer, line-by-line. If a given line ends with a hyphen, you can manipulate
the buffer to merge the hyphenated tokens.

Andy


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hypenated word

Erik Hatcher

On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote:

> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
>
>> I see, the list of exceptions makes this a lot more complicated  
>> than I
>> thought... Thanks a lot, Erik!
>>
>>
>
> I expect you'll need to do some pre-processing. Read in your text  
> into a
> buffer, line-by-line. If a given line ends with a hyphen, you can  
> manipulate
> the buffer to merge the hyphenated tokens.

The problem I encountered when indexing "Lucene in Action" was that I  
couldn't just blindly concatenate two tokens because the first ends  
with a hyphen.  Some lines ended with a hyphen because it was a dash,  
not a hyphenated word.

I'm sure other more clever implementations could do this better, by  
looking up the concatenated word in a dictionary for instance.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hypenated word

Markus Wiederkehr
In reply to this post by Andy Roberts-3
On 6/13/05, Andy Roberts <[hidden email]> wrote:
> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > I see, the list of exceptions makes this a lot more complicated than I
> > thought... Thanks a lot, Erik!
> >
>
> I expect you'll need to do some pre-processing. Read in your text into a
> buffer, line-by-line. If a given line ends with a hyphen, you can manipulate
> the buffer to merge the hyphenated tokens.

As Erik wrote it is not that simple, unfortunately. For example, if
one line ends with "read-" and the next line begins with "only" the
correct word is "read-only" not "readonly". Whereas "work-" and "ing"
should of course be merged into "working".

Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hypenated word

Peter A. Friend
In reply to this post by Markus Wiederkehr

On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote:

> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!

There is a section about the problems that hyphens create in  
"Foundations of Statistical Natural Language Processing". Not only  
are the cases numerous, but seemingly simple rules such as joining  
hyphenated forms at the ends of lines does not always work. Sometimes  
the hyphen was added to break the word, sometimes you are already  
dealing with a hyphenated form that just happened to occur at the end  
of a line, so the hyphen serves two purposes. I've toyed with the  
idea of indexing hyphenated words in their raw as well as split  
forms, but I think that would wreak havoc on the word position stuff,  
as well as bloat the index with potentially meaningless gibberish.

Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hypenated word

Andy Roberts-3
In reply to this post by Markus Wiederkehr
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:

> On 6/13/05, Andy Roberts <[hidden email]> wrote:
> > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > > I see, the list of exceptions makes this a lot more complicated than I
> > > thought... Thanks a lot, Erik!
> >
> > I expect you'll need to do some pre-processing. Read in your text into a
> > buffer, line-by-line. If a given line ends with a hyphen, you can
> > manipulate the buffer to merge the hyphenated tokens.
>
> As Erik wrote it is not that simple, unfortunately. For example, if
> one line ends with "read-" and the next line begins with "only" the
> correct word is "read-only" not "readonly". Whereas "work-" and "ing"
> should of course be merged into "working".
>
> Markus

Perhaps you do some crude checking against a dictionary. Combine the word
anyway and check if it's in the dictionary. If so, keep it merged otherwise,
it's a compound and so revert back to the hyphenated form.

Word lists come part of all good OSS dictionary projects, as well as other
language resources, like the BNC word lists etc.

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]