Analyzer

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Analyzer

Allahbaksh Mohammedali Asadullah
HI All,
I am indexing a set file type (html, js,jsp,xml etc). All the file type have a common field called as text. This field contains all the file data. Can I have different analyzer for depending upon file type.

Note: I am indexing all file type with same indexer.

Regards,
Allahbaksh

Allahbaksh Mohammedali Asadullah

http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>




**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are not
to copy, disclose, or distribute this e-mail or its contents to any other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
every reasonable precaution to minimize this risk, but is not liable for any damage
you may sustain as a result of any virus in this e-mail. You should carry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***
Reply | Threaded
Open this post in threaded view
|

Re: Analyzer

Ian Lea
Yes, you can.  But it is generally best to use the same analyzer for
indexing and for searching so, assuming that you want searches to find
matches in files of whatever type, I'd recommend pre-processing the
files to a common text format before indexing and then using the same
analyzer for all of them.


--
Ian.


On Tue, Nov 25, 2008 at 3:40 PM, Allahbaksh Mohammedali Asadullah
<[hidden email]> wrote:

> HI All,
> I am indexing a set file type (html, js,jsp,xml etc). All the file type have a common field called as text. This field contains all the file data. Can I have different analyzer for depending upon file type.
>
> Note: I am indexing all file type with same indexer.
>
> Regards,
> Allahbaksh
>
> Allahbaksh Mohammedali Asadullah
>
> http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Analyzer

Erick Erickson
In reply to this post by Allahbaksh Mohammedali Asadullah
I'm assuming that you want a different analyzer to handle extracting
the relevant information to put into a "text" field of the Lucene document.
I know of no way you can attach different analyzers to a single field.
You can certainly attach different analyzers to *different* fields...

The first thing I'd try would be to write your own analyzer that keeps
track of what kind of file it's currently analyzing and knows how to
"do the right thing" to extract the next token for the text field.

A cruder way would be to detect the type of document yourself,
extract the text into a string (or some such) then feed that into
your document.

Best
Erick


On Tue, Nov 25, 2008 at 10:40 AM, Allahbaksh Mohammedali Asadullah <
[hidden email]> wrote:

> HI All,
> I am indexing a set file type (html, js,jsp,xml etc). All the file type
> have a common field called as text. This field contains all the file data.
> Can I have different analyzer for depending upon file type.
>
> Note: I am indexing all file type with same indexer.
>
> Regards,
> Allahbaksh
>
> Allahbaksh Mohammedali Asadullah
>
> http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>
>
>
>
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>
Reply | Threaded
Open this post in threaded view
|

Re: Analyzer

Erick Erickson
In reply to this post by Ian Lea
Hmmmm, how would you do this without open/closing your
IndexWriter around different types of documents? And as far
as querying is concerned, I doubt the input would be a file, so
one of the canned analyzers should do. Although "care should
be taken <G>"....

Best
Erick

On Tue, Nov 25, 2008 at 10:57 AM, Ian Lea <[hidden email]> wrote:

> Yes, you can.  But it is generally best to use the same analyzer for
> indexing and for searching so, assuming that you want searches to find
> matches in files of whatever type, I'd recommend pre-processing the
> files to a common text format before indexing and then using the same
> analyzer for all of them.
>
>
> --
> Ian.
>
>
> On Tue, Nov 25, 2008 at 3:40 PM, Allahbaksh Mohammedali Asadullah
> <[hidden email]> wrote:
> > HI All,
> > I am indexing a set file type (html, js,jsp,xml etc). All the file type
> have a common field called as text. This field contains all the file data.
> Can I have different analyzer for depending upon file type.
> >
> > Note: I am indexing all file type with same indexer.
> >
> > Regards,
> > Allahbaksh
> >
> > Allahbaksh Mohammedali Asadullah
> >
> > http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Analyzer

Allahbaksh Mohammedali Asadullah
In reply to this post by Erick Erickson
Hi Erik,
Thanks a lot for your reply. You are right I want different analyzer on same field depending upon the other fields in the document.

For Example
Doc1
Text:"Some text here"
type:html

Doc2
Text: "Some jsp text here"
type:jsp

Now depending upon the type I wanted to use different analyzer for extracting and searching. Is there any way to do this?

I am using the way which you suggested.

Warm Regards,
Allahbaksh

________________________________________
From: Erick Erickson [[hidden email]]
Sent: Tuesday, November 25, 2008 9:38 PM
To: [hidden email]
Subject: Re: Analyzer

I'm assuming that you want a different analyzer to handle extracting
the relevant information to put into a "text" field of the Lucene document.
I know of no way you can attach different analyzers to a single field.
You can certainly attach different analyzers to *different* fields...

The first thing I'd try would be to write your own analyzer that keeps
track of what kind of file it's currently analyzing and knows how to
"do the right thing" to extract the next token for the text field.

A cruder way would be to detect the type of document yourself,
extract the text into a string (or some such) then feed that into
your document.

Best
Erick


On Tue, Nov 25, 2008 at 10:40 AM, Allahbaksh Mohammedali Asadullah <
[hidden email]> wrote:

> HI All,
> I am indexing a set file type (html, js,jsp,xml etc). All the file type
> have a common field called as text. This field contains all the file data.
> Can I have different analyzer for depending upon file type.
>
> Note: I am indexing all file type with same indexer.
>
> Regards,
> Allahbaksh
>
> Allahbaksh Mohammedali Asadullah
>
> http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>
>
>
>
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Analyzer

Erick Erickson
Why not index into different fields based upon the file type? That would be
MUCH easier and PerFieldAnalyzerWrapper would be your friend. For
example.....

Doc1
text_html: "some text here"
type: html

Doc2
text_jsp: "some jsp text here'
type: html

etc...

Now your PerFieldAnalyzerWrapper has your html extractor for the text_html
field and your jsp extractor for the text_jsp field.

Searching would require you to search across all your different types, but
that's do-able. Or you'd have to restrict your searching to one of the
text_* fields if you really only wanted to search that type.

But I still think you're not clear on the difference when searching. Your
search query analyzer will probably have no relation to your indexing
analyzer since the type of input is irrelevant to your query (I guess
anyway). There you're probably using one of the analyzers provided out of
the box.

Best
Erick


On Wed, Nov 26, 2008 at 7:30 AM, Allahbaksh Mohammedali Asadullah <
[hidden email]> wrote:

> Hi Erik,
> Thanks a lot for your reply. You are right I want different analyzer on
> same field depending upon the other fields in the document.
>
> For Example
> Doc1
> Text:"Some text here"
> type:html
>
> Doc2
> Text: "Some jsp text here"
> type:jsp
>
> Now depending upon the type I wanted to use different analyzer for
> extracting and searching. Is there any way to do this?
>
> I am using the way which you suggested.
>
> Warm Regards,
> Allahbaksh
>
> ________________________________________
> From: Erick Erickson [[hidden email]]
> Sent: Tuesday, November 25, 2008 9:38 PM
> To: [hidden email]
> Subject: Re: Analyzer
>
> I'm assuming that you want a different analyzer to handle extracting
> the relevant information to put into a "text" field of the Lucene document.
> I know of no way you can attach different analyzers to a single field.
> You can certainly attach different analyzers to *different* fields...
>
> The first thing I'd try would be to write your own analyzer that keeps
> track of what kind of file it's currently analyzing and knows how to
> "do the right thing" to extract the next token for the text field.
>
> A cruder way would be to detect the type of document yourself,
> extract the text into a string (or some such) then feed that into
> your document.
>
> Best
> Erick
>
>
> On Tue, Nov 25, 2008 at 10:40 AM, Allahbaksh Mohammedali Asadullah <
> [hidden email]> wrote:
>
> > HI All,
> > I am indexing a set file type (html, js,jsp,xml etc). All the file type
> > have a common field called as text. This field contains all the file
> data.
> > Can I have different analyzer for depending upon file type.
> >
> > Note: I am indexing all file type with same indexer.
> >
> > Regards,
> > Allahbaksh
> >
> > Allahbaksh Mohammedali Asadullah
> >
> > http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>
> >
> >
> >
> >
> > **************** CAUTION - Disclaimer *****************
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> > solely
> > for the use of the addressee(s). If you are not the intended recipient,
> > please
> > notify the sender by e-mail and delete the original message. Further, you
> > are not
> > to copy, disclose, or distribute this e-mail or its contents to any other
> > person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has
> > taken
> > every reasonable precaution to minimize this risk, but is not liable for
> > any damage
> > you may sustain as a result of any virus in this e-mail. You should carry
> > out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves
> > the
> > right to monitor and review the content of all messages sent to or from
> > this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS******** End of Disclaimer ********INFOSYS***
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>