Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Doug Cutting
Thanks for making this change!

A few comments:

[hidden email] wrote:
> ==============================================================================
> --- lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java (original)
> +++ lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java Tue May  9 16:04:40 2006
[...]
> -        addNode(doc, item, "description", summaries[i]);
> +        addNode(doc, item, "description", summaries[i].toString());

This means there's no markup in the OpenSearch output?

Shouldn't there be?

> Modified: lucene/nutch/trunk/src/web/jsp/search.jsp
> URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/web/jsp/search.jsp?rev=405565&r1=405564&r2=405565&view=diff
> ==============================================================================
> +    
> +    // Build the summary
> +    StringBuffer sum = new StringBuffer();
> +    Fragment[] fragments = summaries[i].getFragments();
> +    for (int j=0; j<fragments.length; j++) {
> +      if (fragments[j].isHighlight()) {
> +        sum.append("<span class=\"highlight\">")
> +           .append(Entities.encode(fragments[j].getText()))
> +           .append("</span>");
> +      } else if (fragments[j].isEllipsis()) {
> +        sum.append("<span class=\"ellipsis\"> ... </span>");
> +      } else {
> +        sum.append(Entities.encode(fragments[j].getText()));
> +      }
> +    }
> +    String summary = sum.toString();

Perhaps this should be a method on Summary, to render it as html?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
> This means there's no markup in the OpenSearch output?

Yes, no markup for now.


> Shouldn't there be?

The restriction on description field is : "Can contain simple escaped HTML
markup, such as <b>, <i>, <a>, and <img> elements."
So, ya, why not. We can add <b> around highlights.
What you and others thinks?


> Perhaps this should be a method on Summary, to render it as html?

I had some hesitations about this while coding ....
In fact, as suggested in the issue's comments, I would like to add a generic
method on Summary :
String toString(Encoder, Formatter) like in the Lucene's Highlighter and
provide some basic implementations of Encoder and Formatter.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Doug Cutting
Jérôme Charron wrote:
>> This means there's no markup in the OpenSearch output?
>
>
> Yes, no markup for now.

Doesn't this break any existing application that uses OpenSearch and
displays summaries in a web browser?  This is an incompatible change
which we should avoid.

>> Shouldn't there be?
>
>
> The restriction on description field is : "Can contain simple escaped HTML
> markup, such as <b>, <i>, <a>, and <img> elements."
> So, ya, why not. We can add <b> around highlights.
> What you and others thinks?

+1

>> Perhaps this should be a method on Summary, to render it as html?
>
>
> I had some hesitations about this while coding ....
> In fact, as suggested in the issue's comments, I would like to add a
> generic
> method on Summary :
> String toString(Encoder, Formatter) like in the Lucene's Highlighter and
> provide some basic implementations of Encoder and Formatter.

That sounds fine, but in the meantime, let's not reproduce the
html-specific code in lots of places.  We need it in both search.jsp and
in OpenSearchServlet.java.  So we should have it in a common place.  A
method on Summary seems like a good place.  If we subsequently add a
more general API then we could re-implement the toHtml() method using
that API, but I think a generic toHtml() method will be useful for quite
a while yet.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Sami Siren-2

>
> Doesn't this break any existing application that uses OpenSearch and
> displays summaries in a web browser?  This is an incompatible change
> which we should avoid.
>
Also a friendly hint to all plugin hackers, you need to enable
summary-basic in your existing nutch-site.xml to get things working.
Took me some time to realize this fact :)

>
> That sounds fine, but in the meantime, let's not reproduce the
> html-specific code in lots of places.  We need it in both search.jsp
> and in OpenSearchServlet.java.  So we should have it in a common
> place.  A method on Summary seems like a good place.  If we
> subsequently add a more general API then we could re-implement the
> toHtml() method using that API, but I think a generic toHtml() method
> will be useful for quite a while yet.
>
+1

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Andrzej Białecki-2
Sami Siren wrote:
>
>>
>> Doesn't this break any existing application that uses OpenSearch and
>> displays summaries in a web browser?  This is an incompatible change
>> which we should avoid.
>>
> Also a friendly hint to all plugin hackers, you need to enable
> summary-basic in your existing nutch-site.xml to get things working.
> Took me some time to realize this fact :)

I think we should add this to nutch-default.xml, if omitting this
results in a non-working installation ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Doug Cutting
In reply to this post by Sami Siren-2
Sami Siren wrote:
> Also a friendly hint to all plugin hackers, you need to enable
> summary-basic in your existing nutch-site.xml to get things working.
> Took me some time to realize this fact :)

Sounds like we should enable it by default, no?

Doug

Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
> > Also a friendly hint to all plugin hackers, you need to enable
> > summary-basic in your existing nutch-site.xml to get things working.
> > Took me some time to realize this fact :)
> Sounds like we should enable it by default, no?

The summary-basic plugin is already enabled by default in nutch-default.xml
(but if the nutch-site.xml overrides the plugin.include property and doen't
include it it will not be activated, like any other plugin)

Jérôme
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
In reply to this post by Andrzej Białecki-2
> > Also a friendly hint to all plugin hackers, you need to enable
> > summary-basic in your existing nutch-site.xml to get things working.
> > Took me some time to realize this fact :)
> I think we should add this to nutch-default.xml,

Does I missed something?
summary-basic is activated in the nutch-default.xml ... no?


> if omitting this
> results in a non-working installation ...

During my tests, it only results in no summary in the results pages...
Isn't it the case?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
In reply to this post by Doug Cutting
> > String toString(Encoder, Formatter) like in the Lucene's Highlighter and
> > provide some basic implementations of Encoder and Formatter.
> That sounds fine, but in the meantime, let's not reproduce the
> html-specific code in lots of places.  We need it in both search.jsp and
> in OpenSearchServlet.java.  So we should have it in a common place.  A
> method on Summary seems like a good place.  If we subsequently add a
> more general API then we could re-implement the toHtml() method using
> that API, but I think a generic toHtml() method will be useful for quite
> a while yet.

Yes Doug, but in fact, the idea is to add the toString(Formatter) method in
a common place (Summary).
And add one specific Formatter implementation for OpenSearch and another one
for search.jsp :
The reason is that they should not use the same HTML code :
1. OpenSearch should only use <b> around highlights
2. search.jsp should use some more complicated HTML code (<span ... >)

In fact, I don't know if the "Formatter" solution is the good one, but the
toString() or toHtml() must be parametrized
since the two pieces of code that use this method should have distinct
outputs.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Doug Cutting
Jérôme Charron wrote:

> Yes Doug, but in fact, the idea is to add the toString(Formatter) method in
> a common place (Summary).
> And add one specific Formatter implementation for OpenSearch and another
> one
> for search.jsp :
> The reason is that they should not use the same HTML code :
> 1. OpenSearch should only use <b> around highlights
> 2. search.jsp should use some more complicated HTML code (<span ... >)
>
> In fact, I don't know if the "Formatter" solution is the good one, but the
> toString() or toHtml() must be parametrized
> since the two pieces of code that use this method should have distinct
> outputs.

This all sounds fine, I'm just remarking that, at present, the
OpenSearch output has changed incompatibly, which is a bad thing, and
that I wish, until this is fully worked out, OpenSearch returned what it
did before (markup, although perhaps exceeding what's advised).

Doug
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Dawid Weiss
In reply to this post by Jérôme Charron
> The reason is that they should not use the same HTML code :
> 1. OpenSearch should only use <b> around highlights
> 2. search.jsp should use some more complicated HTML code (<span ... >)

Add 3. Clustering would benefit from a plain text version.

D.
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
> Add 3. Clustering would benefit from a plain text version.

Yes Dawid, but it is already committed => the clustering now uses the plain
text version returned by the toString() method.

Dawid, I have a question about clustering.
Actually, the clustering uses the summaries as input. I assumes it would
provides some better results if it takes the whole documents content. no?
I assumes that clustering uses the summaries instead of documents content
for some performances purpose.
But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering "quality" will vary depending on the summaries
size configuration. I really found this very confusing : when folks adjust
this parameter it is only for front-end consideration (they want to display
a long or a short summary), but certainly not for clustering reasons.

What you and others thinks about this?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Marvin Humphrey

On May 11, 2006, at 3:36 AM, Jérôme Charron wrote:

> Actually, the clustering uses the summaries as input. I assumes it  
> would
> provides some better results if it takes the whole documents  
> content. no?
> I assumes that clustering uses the summaries instead of documents  
> content
> for some performances purpose.
> But there is a (bad) side effect : since the size of the summaries is
> configurable, the clustering "quality" will vary depending on the  
> summaries
> size configuration. I really found this very confusing : when folks  
> adjust
> this parameter it is only for front-end consideration (they want to  
> display
> a long or a short summary), but certainly not for clustering reasons.
>
> What you and others thinks about this?

Bob Carpenter of alias-i had this to say when I brought up this very  
idea:

http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Sami Siren-2
In reply to this post by Jérôme Charron
Jérôme Charron wrote:

> (but if the nutch-site.xml overrides the plugin.include property and
> doen't
> include it it will not be activated, like any other plugin)

yes, that's what I ment, I quess that's the default case for people
hacking plugins.

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
> > (but if the nutch-site.xml overrides the plugin.include property and
> > doen't
> > include it it will not be activated, like any other plugin)
> yes, that's what I ment, I quess that's the default case for people
> hacking plugins.

Oh, yes Sami, I understand what you mean...
Sorry, I just forgot to mention this point on the list (so, plugins hackers,
you need to add one of the new summary plugin if you want to have some
summaries displayed).
Sorry, I forgot too to add summary plugins in the default webapp context
file (nutch.xml) ... I will add this once the svn write access will be
available.
And one more time sorry, because I forgot too to report summary APIs changes
to web2 module...

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
In reply to this post by Marvin Humphrey
> Bob Carpenter of alias-i had this to say when I brought up this very
> idea:
> http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599

Thanks for you response Marvin.
But finally my question is : shouldn't the nutch clustering uses some
fixed size snippets instead of the configurable displayed size?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Dawid Weiss
In reply to this post by Jérôme Charron

Hi Jerome,

> Yes Dawid, but it is already committed => the clustering now uses the plain
> text version returned by the toString() method.

Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be
specific and that uses toString internally.

> Actually, the clustering uses the summaries as input. I assumes it would
> provides some better results if it takes the whole documents content. no?
> I assumes that clustering uses the summaries instead of documents content
> for some performances purpose.

Not always. Or rather: depends what your goals are. Full document
clustering will take longer (word segmentation, feature extraction etc),
but since you have more data to work with, document similarity should be
more accurate and hence clusters more sensible. In practice, however,
similarity between documents and "cluster quality" is just a
mathematical concept which is never shown to the user -- what the user
sees is the representation of a cluster, which in case of full-document
clustering is usually quite inconvenient to build and has a weak
relationship with the actual mathematical model of clusters.

Contextual (keyword-in-context) snippets have a great advantage: they
are shorter and carry the neighborhood of your query's terms. This very
neighborhood (or rather: repetitive sequences of terms) can be used to
first determine "clusters" of documents and then to describe them to the
user. This is how most Web clustering algorithms work (excuse me if I
explained it in a very imprecise way).

> But there is a (bad) side effect : since the size of the summaries is
> configurable, the clustering "quality" will vary depending on the summaries
> size configuration. I really found this very confusing : when folks adjust
> this parameter it is only for front-end consideration (they want to display
> a long or a short summary), but certainly not for clustering reasons.

You're right -- changing anything with the input (snippets length,
number of documents etc) will alter the clusters. This is basically how
it works. If you want clustering in your search engine then, depending
on the type of data you serve, you'll have to experiment with the
settings a bit and see which give you satisfactory results. I don't
think there is any particular reason to provide different data to the
clusterer. Moreover, it'd complicate things quite badly.

D.




Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Jérôme Charron
> You're right -- changing anything with the input (snippets length,
> number of documents etc) will alter the clusters. This is basically how
> it works. If you want clustering in your search engine then, depending
> on the type of data you serve, you'll have to experiment with the
> settings a bit and see which give you satisfactory results. I don't
> think there is any particular reason to provide different data to the
> clusterer. Moreover, it'd complicate things quite badly.

Thanks Dawid for your response.
In fact, I don't really want to change this, but just to be sure that
everybody is aware about it and to have some opinions.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Dawid Weiss

Yes, this should be definitely mentioned somewhere (in the documentation
:) At least we left a track on the mailing list so it'll be possible to
refer to it.

D.

Jérôme Charron wrote:

>> You're right -- changing anything with the input (snippets length,
>> number of documents etc) will alter the clusters. This is basically how
>> it works. If you want clustering in your search engine then, depending
>> on the type of data you serve, you'll have to experiment with the
>> settings a bit and see which give you satisfactory results. I don't
>> think there is any particular reason to provide different data to the
>> clusterer. Moreover, it'd complicate things quite badly.
>
> Thanks Dawid for your response.
> In fact, I don't really want to change this, but just to be sure that
> everybody is aware about it and to have some opinions.
>
> Regards
>
> Jérôme
>