highlighter / fragmenter performance for large fields

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

highlighter / fragmenter performance for large fields

Beard, Brian
We index some documents which have an "all" field containing all of the
data which can be searched on.

One of the problems we're having is when this field is say 10Mbytes the
highlighter takes about a second to calculate the best fragments. The
search only takes 30 milliseconds. I've accomodated the load time for
the text which is about 5-10X faster in general, so 0.1-0.2 seconds for
loading text from the document, and the other 0.8-0.9 performing
highlighting.

I've over-ridden the maxDocBytesToAnalyze so it will analyze the entire
field of the document. At least at the moment we need to try and match
the entire document.

I've also tried using a SimpleAnalyzer when the highlighting is
performed, but this doesn't seem to affect performance much.

Also, I've modified the QueryScorer so it can do wildcard term matches
without extracting the terms from the index (Because we're using a
ConstantScoreQuery which doesn't let highlighting work to get around the
MaxBooleanClauses exception). Basically if the term doesn't match in the
highlighter, then it will try to pattern match against the wildcard
search terms, so there's some more processing there, but disabling it
doesn't seem to affect the performance that much.

One other thing was just doing a simple regex search without using a
scorer or analyzer. This runs about 2x faster, but still is relatively
slow.

Has anyone had any good experience with performing fragmentation and
highlighting for larger documents?

Thanks,

Brian Beard


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighter / fragmenter performance for large fields

Karsten F.-2
Hi Brian,

I don't know the internals of highlighting („explanation“) in lucene.
But I know that XTF ( http://xtf.wiki.sourceforge.net/underHood_Documents#tocunderHood_Documents5 ) can handle very large documents (above 100 Mbyte) with highlighting very fast. The difference to your approach is, that xtf devide the document in small (overlapping) chunks and store the original text as xml separately with connection to lucene indexed fields via numbered xml-nodes.
For large texts (above 200 KByte), it is the best tool I know.

Best regards
  Karsten

Beard, Brian wrote
We index some documents which have an "all" field containing all of the
data which can be searched on.

One of the problems we're having is when this field is say 10Mbytes the
highlighter takes about a second to calculate the best fragments. The
search only takes 30 milliseconds. I've accomodated the load time for
the text which is about 5-10X faster in general, so 0.1-0.2 seconds for
loading text from the document, and the other 0.8-0.9 performing
highlighting.
Reply | Threaded
Open this post in threaded view
|

Re: highlighter / fragmenter performance for large fields

brian beard-2
In reply to this post by Beard, Brian

Karsten,

Thanks, I will look into this.

>Hi Brian,
>
>I don't know the internals of highlighting („explanation“) in lucene.
>But I know that XTF (
>http://xtf.wiki.sourceforge.net/underHood_Documents#tocunderHood_Documents5
>) can handle very large documents (above 100 Mbyte) with highlighting very
>fast. The difference to your approach is, that xtf devide the document in
>small (overlapping) chunks and store the original text as xml separately
>with connection to lucene indexed fields via numbered xml-nodes.
>For large texts (above 200 KByte), it is the best tool I know.
>
>Best regards
>  Karsten


_________________________________________________________________
Store, manage and share up to 5GB with Windows Live SkyDrive.
http://skydrive.live.com/welcome.aspx?provision=1?ocid=TXT_TAGLM_WL_skydrive_102008