Using Highlighter to highlight entire HTML documents?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Highlighter to highlight entire HTML documents?

Fred Toth
Hi,

We have a need to present HTML documents with all search
terms highlighted. Everything I've seen regarding the Highlighter
code seems to point to the typical case of extracting relevant
fragments from the text for presentation of hit lists.

Is it possible to use the core highlighting code to process an
entire document? Instead of extracting fragments, we would want
the entire document back. Has anyone done this?

Or is this the wrong approach? Even if the Highlighter is not an
exact fit for this, it seems that the term positions could still be
useful?

Any suggestions would be appreciated.

Thanks,

Fred Toth


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Using Highlighter to highlight entire HTML documents?

Will Allen-2
The challenge with this is always not breaking the HTML page itself.

-----Original Message-----
From: Fred Toth [mailto:[hidden email]]
Sent: Tuesday, May 24, 2005 3:47 PM
To: [hidden email]
Subject: Using Highlighter to highlight entire HTML documents?


Hi,

We have a need to present HTML documents with all search
terms highlighted. Everything I've seen regarding the Highlighter
code seems to point to the typical case of extracting relevant
fragments from the text for presentation of hit lists.

Is it possible to use the core highlighting code to process an
entire document? Instead of extracting fragments, we would want
the entire document back. Has anyone done this?

Or is this the wrong approach? Even if the Highlighter is not an
exact fit for this, it seems that the term positions could still be
useful?

Any suggestions would be appreciated.

Thanks,

Fred Toth


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using Highlighter to highlight entire HTML documents?

mark harwood
In reply to this post by Fred Toth
Fred Toth wrote:

> Hi,
>
> We have a need to present HTML documents with all search
> terms highlighted. Everything I've seen regarding the Highlighter
> code seems to point to the typical case of extracting relevant
> fragments from the text for presentation of hit lists.

If you dont want to fragment your docs either pass  the highlighter an
instance of the default fragmenter with it's "fragment size in bytes"
property set to a very large number or pass a custom fragmenter
implementation which always returns false when asked if the next token
starts a new fragment.

Cheers
Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Finding docs which contain at least x of the queryterms

bkrausz

>
Hi,

Consider a Query with e.g. 4 terms (t1,t2,t3,t4). I want to retrieve all
documents which contain at least e.g. 3 of the queryterms. How can I
implement this?
The first idea is to use BooleanQueries such as
(t1 and t2 and t3 and t4) or (t1 and t2 and t3) or(t1 and t2 and t4) or
(t1 and t3 and t4).....

But the perfomance is not very good when I have 20 queryterms and I want
to retrieve all docs which contain at least 15 of the terms.
Can I modify the skipto-algorithm in ConjunctionScorer in order to
achieve this?

Thanks
Barbara

PS: Has anybody written a Statistics-class which says how many term and
different terms are  in  the index.  And perhaps computes the mean
length of the documents in the index with the standard deviation?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Finding docs which contain at least x of the queryterms

Erik Hatcher

On May 25, 2005, at 7:00 AM, Barbara Krausz wrote:

> Hi,
>
> Consider a Query with e.g. 4 terms (t1,t2,t3,t4). I want to  
> retrieve all documents which contain at least e.g. 3 of the  
> queryterms. How can I implement this?
> The first idea is to use BooleanQueries such as
> (t1 and t2 and t3 and t4) or (t1 and t2 and t3) or(t1 and t2 and  
> t4) or (t1 and t3 and t4).....
>
> But the perfomance is not very good when I have 20 queryterms and I  
> want to retrieve all docs which contain at least 15 of the terms.
> Can I modify the skipto-algorithm in ConjunctionScorer in order to  
> achieve this?
>
> Thanks
> Barbara
>
> PS: Has anybody written a Statistics-class which says how many term  
> and different terms are  in  the index.  And perhaps computes the  
> mean length of the documents in the index with the standard deviation?

There is an interesting trick you can play with a custom Similarity  
class on a BooleanQuery - check out the coord method.  This could be  
used to ensure that an "overlap" of 3 is mandatory for a match, for  
example.

I'll leave the details of this as an exercise to the reader for the  
moment.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using Highlighter to highlight entire HTML documents?

Dan Funk
In reply to this post by Fred Toth

 I wrote a very simple sax parser for our xml content -  I check for the
search tokens (analyzer.tokenStream)  in the text and place a span tag
around each found token.  This process could work well with xhtml  as well.

In other words, I could never get the highlighter to do what I wanted to
do, but there's a lot to be learned from the highlighter source.

Fred Toth wrote:

> Hi,
>
> We have a need to present HTML documents with all search
> terms highlighted. Everything I've seen regarding the Highlighter
> code seems to point to the typical case of extracting relevant
> fragments from the text for presentation of hit lists.
>
> Is it possible to use the core highlighting code to process an
> entire document? Instead of extracting fragments, we would want
> the entire document back. Has anyone done this?
>
> Or is this the wrong approach? Even if the Highlighter is not an
> exact fit for this, it seems that the term positions could still be
> useful?
>
> Any suggestions would be appreciated.
>
> Thanks,
>
> Fred Toth
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
Dan Funk
Software Engineer

Information Technology Solutions
Battelle Charlottesville Operations
1000 Research Park Boulevard, Suite 105
Charlottesville, Virginia 22911

434.984.0951 x244
434.984.0947 (fax)
[hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Finding docs which contain at least x of the queryterms

Paul Elschot
In reply to this post by bkrausz
On Wednesday 25 May 2005 13:00, Barbara Krausz wrote:

>
> >
> Hi,
>
> Consider a Query with e.g. 4 terms (t1,t2,t3,t4). I want to retrieve all
> documents which contain at least e.g. 3 of the queryterms. How can I
> implement this?
> The first idea is to use BooleanQueries such as
> (t1 and t2 and t3 and t4) or (t1 and t2 and t3) or(t1 and t2 and t4) or
> (t1 and t3 and t4).....
>
> But the perfomance is not very good when I have 20 queryterms and I want
> to retrieve all docs which contain at least 15 of the terms.
> Can I modify the skipto-algorithm in ConjunctionScorer in order to
> achieve this?

I don't think so, but in case you can describe a method to do this,
please share it.

In the svn trunk there is a DisjunctionSumScorer that has the
minimum number of subquery matchers as a constructor parameter:

http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/

It has this javadoc comment in the advanceAfterCurrent method:
* @todo Investigate whether it is possible to use skipTo() when
* the minimum number of matchers is bigger than one, ie. try and use the
* character of ConjunctionScorer for the minimum number of matchers.

The constructor parameter is not used (even in the trunk), so you'll have
to write the code to use it yourself. I'd recommend to start from the trunk
and extend BooleanQuery for this.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

utf-8 & Lucene 1.4.3 & Solaris &windows

arno13
In reply to this post by bkrausz
Hi,

I haven't got no utf-8 index when I use Lucene with Solaris while my
characters are OK under windows. My indexing program is the same and it
uses lucene 1.4.3.

Is someone have an Idea to help me?

Regards,

Arnaud.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: utf-8 & Lucene 1.4.3 & Solaris &windows

Angelov, Rossen
Probably your Unix system has a different default encoding than your Windows
machine.
You have to make sure you give the IndexWriter a string that has the correct
encoding.

Do you specifically set the encoding in you code before you index it with
Lucene?

Ross

-----Original Message-----
From: gaudinat [mailto:[hidden email]]
Sent: Friday, May 27, 2005 10:58 AM
To: [hidden email]
Subject: utf-8 & Lucene 1.4.3 & Solaris &windows


Hi,

I haven't got no utf-8 index when I use Lucene with Solaris while my
characters are OK under windows. My indexing program is the same and it
uses lucene 1.4.3.

Is someone have an Idea to help me?

Regards,

Arnaud.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


"This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution."

Reply | Threaded
Open this post in threaded view
|

RE: utf-8 & Lucene 1.4.3 & Solaris &windows

Grant Ingersoll
In reply to this post by arno13
Also, see if
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages  helps
at all.


>>> [hidden email] 5/27/2005 12:09:32 PM >>>
Probably your Unix system has a different default encoding than your
Windows
machine.
You have to make sure you give the IndexWriter a string that has the
correct
encoding.

Do you specifically set the encoding in you code before you index it
with
Lucene?

Ross

-----Original Message-----
From: gaudinat [mailto:[hidden email]]
Sent: Friday, May 27, 2005 10:58 AM
To: [hidden email]
Subject: utf-8 & Lucene 1.4.3 & Solaris &windows


Hi,

I haven't got no utf-8 index when I use Lucene with Solaris while my
characters are OK under windows. My indexing program is the same and it

uses lucene 1.4.3.

Is someone have an Idea to help me?

Regards,

Arnaud.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

"This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution."


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]