Sentence detection/extraction as Tokenizer?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Sentence detection/extraction as Tokenizer?

Otis Gospodnetic-2
Hello,

The contrib/wordnet package contains an AnalyzerUtil class with a method that extracts sentences from text/String.  It is super-simplistic, so probably not very accurate, but I am wondering if *conceptually* it would make sense to have a Tokenizer that extracts sentences?  I suppose that means each Token would be a complete sentence.

Would you say it makes sense to implement sentence detection/extraction as a Tokenizer?

Thanks,
Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sentence detection/extraction as Tokenizer?

Shai Erera
Hi Otis

I've implemented sentence detection as part of my tokenizer, and it does not extract sentences, but "detecs" EOS (based on several characters from the UNICODE spec). Upon detection, it returns a Token of EOS type. I then have a EOS Filter which can be configured w/ appropriate behavior as to what to do with it for example, set posIncr to 100 on the next token, to avoid phrase/fuzzy searches find matches across sentences, but there are other reasons as well such as highlighting.

So I would, personally, not think of EOS detection as  a Tokenizer in and on itself, but rather as a capability of a Tokenizer (Standard?).

Shai

On Fri, Nov 27, 2009 at 8:07 PM, Otis Gospodnetic <[hidden email]> wrote:
Hello,

The contrib/wordnet package contains an AnalyzerUtil class with a method that extracts sentences from text/String.  It is super-simplistic, so probably not very accurate, but I am wondering if *conceptually* it would make sense to have a Tokenizer that extracts sentences?  I suppose that means each Token would be a complete sentence.

Would you say it makes sense to implement sentence detection/extraction as a Tokenizer?

Thanks,
Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]