Developper Question - Highlighting

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Developper Question - Highlighting

Fisheye
Hello to every body

I have integrated Lucene into a knowledge software, the main goal of lucene will be to index and search files given by this knowledge software.

My problem now is, I need to get back from lucene text "snippets" where the search query has been found and it should be highlighted.

For this reason I tried to accesss the field "contents", where the whole text is located and where I let lucene search the query.

In this forum I saw some posts about that, it seems to be a problem to access these fields (contents) because lucene does not store the text contents in index. Another solution was, to directly extract the text from the original file and then parse this text by using highlighter class.

Ok, I know it may be one way, but if I have a lot of binary files liek Word, Excel, PowerPoint etc., this will become very slow and cost too much resources.

So, does someone know a better idea to solve this problem? No way to make these fields "contents" storing the text from the file?

Thx for answers and help

Simon Dietschi
Reply | Threaded
Open this post in threaded view
|

Re: Developper Question - Highlighting

mark harwood
Please post "how do I?" questions to the Java-user
group.
The dev list is for people maintaining the core Lucene
code.

>>because lucene does not store the text contents
>>in index

It does if you want it to. See the Field.Store.Yes
property when adding new docs.
The Highlighter class in the contrib section contains
a JUnit test which offers a completely self-contained
example of how indexing/searching/highlighting can
work.

Hopefully this should give you enough to go on. If you
have any further questions please post them to
java-user group.

Cheers
Mark


               
___________________________________________________________
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Developper Question - Highlighting

Fisheye
Ok, thx for your answer.

Yea I have seen the property "Field.Store.YES"...but this does not work if you create the field "contents" by directly butting in it the "reader":

        //open a file input stream for the file f
        FileInputStream is = new FileInputStream(f);
        java.io.Reader reader = new BufferedReader(new InputStreamReader(is));
        //create a new field for the contents
        Field textContents = new Field(
        "contents", reader);
Reply | Threaded
Open this post in threaded view
|

RE: Developper Question - Highlighting

Aditya Liviandi-2
In reply to this post by Fisheye
When you add the field "content" inside the document,
You can choose to have it

indexed and stored
indexed and not stored
not indexed and not stored

so in this case, you would want to add it as the indexed and stored
kind.
However it should then be evident that the size of the Directory
containing the Documents will be larger than the total size of the
parsed files, because it contains a duplicate of each of the stored (and
parsed) files as well as an index.

This might help you if your files contain substantially less searchable
text than the total file size, maybe you're indexing JPEG files, and
"content" would just be the EXIF value of the JPEG...

Anyway, this belongs to [hidden email] instead of the dev
list...


--------------------------------------------------- I²R Disclaimer ------------------------------
This email is confidential and may be privileged.  If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
-------------------------------------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Developper Question - Highlighting

Fisheye
ok, but as already said (see JAVADOC of Lucene => org.apache.lucene.document.Field) it seem not to be possible to do that if I add directly a stream e.g. reader:

Syntax: Field(String name, Reader reader)

Probably you have any code samples?

Thx

Simon Dietschi

=> I have tried to move this posts do "Users", but I've got errors...
Reply | Threaded
Open this post in threaded view
|

RE: Developper Question - Highlighting

Aditya Liviandi-2
Lucene offers a method in Field that indexes and stores, but it works
for String, not Reader. You might want to use that.

OR

You might just want to add an extra (stored but not indexed) Field using
new Field(String name, byte[] contents, Field.Store store)

I think you can use the same Field name, so your searching code will
still be the same.



--------------------------------------------------- I²R Disclaimer ------------------------------
This email is confidential and may be privileged.  If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
-------------------------------------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Developper Question - Highlighting

Aditya Liviandi-2
In reply to this post by Fisheye
Oh, and before this gets out of hand,

For things regarding lucene usage please use the java-user list.

To subscribe, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

For things regarding development of lucene (stuff like lucene
implementation details etc.), then you're on the right mailing list.

To subscribe, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


--------------------------------------------------- I²R Disclaimer ------------------------------
This email is confidential and may be privileged.  If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
-------------------------------------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Developper Question - Highlighting

Randy Puttick
In reply to this post by Fisheye
Actually, it's a very small tweak to the Field class to permit Reader values other than through the Text helper function.  DocumentWriter (which actually extracts and tokenizes the field data doesn't care about the artificial restriction in Field).

Randy

-----Original Message-----
From: Aditya Liviandi [mailto:[hidden email]]
Sent: Thursday, March 30, 2006 1:32 AM
To: [hidden email]
Subject: RE: Developper Question - Highlighting

Lucene offers a method in Field that indexes and stores, but it works
for String, not Reader. You might want to use that.

OR

You might just want to add an extra (stored but not indexed) Field using
new Field(String name, byte[] contents, Field.Store store)

I think you can use the same Field name, so your searching code will
still be the same.



--------------------------------------------------- I²R Disclaimer ------------------------------
This email is confidential and may be privileged.  If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
-------------------------------------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]