Sorting in nutch-webinterface - how?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Sorting in nutch-webinterface - how?

Stefan Neufeind
Hi,

I did use index-basic and index-more. I see lastModified in the
RSS-output. Now I want to &sort=lastModified - does not work. Same for
&sort=title. However &sort=url does work.

What am I doing wrong here?


Regards,
 Stefan
Reply | Threaded
Open this post in threaded view
|

Re: Sorting in nutch-webinterface - how?

Marko Bauhardt-2

Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:

> Hi,
>
> I did use index-basic and index-more. I see lastModified in the
> RSS-output. Now I want to &sort=lastModified - does not work.

Try sort=date.

Regards,
Marko


Reply | Threaded
Open this post in threaded view
|

Re: Sorting in nutch-webinterface - how?

Stefan Neufeind
Marko Bauhardt wrote:
>
> Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:
>
>> Hi,
>>
>> I did use index-basic and index-more. I see lastModified in the
>> RSS-output. Now I want to &sort=lastModified - does not work.
>
> Try sort=date.

Hmm, that works. But why - since I think the field is named lastModified.


Thank you very much for your help,
 Stefan
Reply | Threaded
Open this post in threaded view
|

Re: Sorting in nutch-webinterface - how?

Marko Bauhardt-2


>
> Hmm, that works. But why - since I think the field is named  
> lastModified.

LastModified is only used if lastModified is available about the html  
meta tags. If that true, lastModified is stored but not indexed.
However the date field is always indexed. Is lastModified is  
available as metatag, then date=lastModified. If not, date=FetchTime.

HTH,
Marko

Reply | Threaded
Open this post in threaded view
|

Re: Sorting in nutch-webinterface - how?

Stefan Neufeind
Marko Bauhardt wrote:
>
>> Hmm, that works. But why - since I think the field is named lastModified.
>
> LastModified is only used if lastModified is available about the html
> meta tags. If that true, lastModified is stored but not indexed.
> However the date field is always indexed. Is lastModified is available
> as metatag, then date=lastModified. If not, date=FetchTime.

Hi Marko,

that hint really helped. Can you maybe also help me out with sort=title?
See also:
http://issues.apache.org/jira/browse/NUTCH-287

The problem is that it works on some searches - but not always. Could it
be that maybe some plugins don't write a title or write title as
null/empty and that leads to problems? What could I do:
a) as a quickfix to prevent the exception    and
b) to track this further down which result(s) and why actually cause the
problem.

I've taken a look at the javadoc from the lucene-interface. It looks
like if you sort by something the fields[0] should always be set with
the field you searched for - but afaik actually it is null, or maybe
even fields is empty or so.


Regards,
 Stefan
Reply | Threaded
Open this post in threaded view
|

Re: Sorting in nutch-webinterface - how?

Marko Bauhardt-2

Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
>> Modified. If not, date=FetchTime.
>
> Hi Marko,
>

Hi Stefan,

> that hint really helped. Can you maybe also help me out with  
> sort=title?
> See also:
> http://issues.apache.org/jira/browse/NUTCH-287
>
> The problem is that it works on some searches - but not always.  
> Could it
> be that maybe some plugins don't write a title or write title as
> null/empty and that leads to problems? What could I do:

If a html page begins with "<?xml", then the textparser is used and  
not the html parser (i am not sure). If the TextParser is used to  
parse this page, then no title will be extract. So in this case the  
title is empty and the summary is xml-code.

Please verify your pages , that have no title and look whether "<?
xml" exists at the begin of this page.

Marko




Reply | Threaded
Open this post in threaded view
|

Re: Sorting in nutch-webinterface - how?

Stefan Neufeind
Marko Bauhardt wrote:

>
> Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
>>> Modified. If not, date=FetchTime.
>>
>> Hi Marko,
>>
>
> Hi Stefan,
>
>> that hint really helped. Can you maybe also help me out with sort=title?
>> See also:
>> http://issues.apache.org/jira/browse/NUTCH-287
>>
>> The problem is that it works on some searches - but not always. Could it
>> be that maybe some plugins don't write a title or write title as
>> null/empty and that leads to problems? What could I do:
>
> If a html page begins with "<?xml", then the textparser is used and not
> the html parser (i am not sure). If the TextParser is used to parse this
> page, then no title will be extract. So in this case the title is empty
> and the summary is xml-code.
>
> Please verify your pages , that have no title and look whether "<?xml"
> exists at the begin of this page.

I could understand that those documents are "problematic" in sorting -
e.g. they would all be in front or at the end of the sorted list. But
why does this actually lead to no output/an exception/...?

Maybe in case no title is present at least _something_ could be used -
e.g. the URL instead or so?


Regards,
 Stefan
Reply | Threaded
Open this post in threaded view
|

Re: Sorting in nutch-webinterface - how?

Doug Cutting
In reply to this post by Stefan Neufeind
Stefan Neufeind wrote:
> Can you maybe also help me out with sort=title?

Lucene's works with indexed, non-tokenized fields.  The title field is
tokenized.  If you need to sort by title then you'd need to add a plugin
that indexes another field (e.g., "sortTitle") containing the
un-tokenized title, perhaps lowercased, if you want case-independent
sorting.

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Sort.html

Doug