I have found a kind of strange behavior in StandardAnalyzer

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

I have found a kind of strange behavior in StandardAnalyzer

Eugenio Martinez
I am indexing with Lucene a hughe set of logfiles, about 130GB of plain text in disk (up to now), planning to build a system capable of perform searches over Terabytes of such info in a kind of metaindex built from a mesh of little ones, all of them created and maintained with Lucene.

I have randomly variable file sizes, from 1KB to several hundreds of MB of plain text, and I have done tests with files about 2GB, obtaning very good performance in time and search. Of course, once we can get search results from such system we get confident that Lucene was capable of doing right its job, i.e., split all contents and index all tokens correctly.

But last week, with our first beta release in our LAN environment, some problems arose. In certain situations we've found that the Analysis stage "fails", or better, has anomalies in its activity. We have isolated one, that can be reproduced with LUKE in its Search window: parsing URL domains that end with a point, as in "www.my.domain.es." becomes in a token with the following text: "wwwmydomaines".

Maybe this behavior extends to emails, as we aren´t able to get search results with some emails that are indeed in the contents of the logfile, and with words too.

Such behavior is not acceptable for nobody, as in natural speaking is possible to find such URLs at the end of a sentence. Is this an effect of document vectorization? I write this as log's content structure doesn't match for natural language rules...

Any notice about this?

We are working on an Log Analyzer now, but i'm sure i'm not the only fellow with this issue in the world... Did you know anyone else?

Thanks for your attention.
 
Eugenio F. Martínez Pacheco

Fundación Instituto Tecnológico de Galicia - Área TIC

TFN: 981 173 206            FAX: 981 173 223

VIDEOCONFERENCIA: 981 173 596

[hidden email]






       
______________________________________________
¿Chef por primera vez?
Sé un mejor Cocinillas.
http://es.answers.yahoo.com/info/welcome
Reply | Threaded
Open this post in threaded view
|

Re: I have found a kind of strange behavior in StandardAnalyzer

Shai Erera
Hi

I tried this code:

        TokenStream ts = analyzer.tokenStream("content", new StringReader("
www.abc.com"));
        Token t;
        while ((t = ts.next()) != null) {
            System.out.println(t);
        }
If I pass "www.abc.com" (without an extra '.'), it prints
(www.abc.com,0,11,type=<HOST>)
---> it recognizes the type HOST.
If I pass "www.abc.com." (with an extra '.'), it prints
(wwwabccom,0,12,type=<ACRONYM>) ---> it recognizes the type ACRONYM.

Personally, I think it is a bug, as ACRONYMs are usually of the form A.B.C.
and not ABC.DEF. ... maybe you can try the java-dev mailing list and consult
them if you should open an issue on that ...

On Nov 26, 2007 5:47 PM, Eugenio Martinez <[hidden email]> wrote:

> I am indexing with Lucene a hughe set of logfiles, about 130GB of plain
> text in disk (up to now), planning to build a system capable of perform
> searches over Terabytes of such info in a kind of metaindex built from a
> mesh of little ones, all of them created and maintained with Lucene.
>
> I have randomly variable file sizes, from 1KB to several hundreds of MB of
> plain text, and I have done tests with files about 2GB, obtaning very good
> performance in time and search. Of course, once we can get search results
> from such system we get confident that Lucene was capable of doing right its
> job, i.e., split all contents and index all tokens correctly.
>
> But last week, with our first beta release in our LAN environment, some
> problems arose. In certain situations we've found that the Analysis stage
> "fails", or better, has anomalies in its activity. We have isolated one,
> that can be reproduced with LUKE in its Search window: parsing URL domains
> that end with a point, as in "www.my.domain.es." becomes in a token with
> the following text: "wwwmydomaines".
>
> Maybe this behavior extends to emails, as we aren´t able to get search
> results with some emails that are indeed in the contents of the logfile, and
> with words too.
>
> Such behavior is not acceptable for nobody, as in natural speaking is
> possible to find such URLs at the end of a sentence. Is this an effect of
> document vectorization? I write this as log's content structure doesn't
> match for natural language rules...
>
> Any notice about this?
>
> We are working on an Log Analyzer now, but i'm sure i'm not the only
> fellow with this issue in the world... Did you know anyone else?
>
> Thanks for your attention.
>
> Eugenio F. Martínez Pacheco
>
> Fundación Instituto Tecnológico de Galicia - Área TIC
>
> TFN: 981 173 206            FAX: 981 173 223
>
> VIDEOCONFERENCIA: 981 173 596
>
> [hidden email]
>
>
>
>
>
>
>
> ______________________________________________
> ¿Chef por primera vez?
> Sé un mejor Cocinillas.
> http://es.answers.yahoo.com/info/welcome
>



--
Regards,

Shai Erera