Restrictive searching approaches?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Restrictive searching approaches?

Andrew Libby
Hello,

I'm using nutch on a single site, currently.  The site has a few major
sections, which
I'd like to allow users to restrict their searches to a particular
portion of the site.
The general guideline is by top level.  For example, if I've got

http://site.com/documentation/
http://site.com/marketing/
http://site.com/technialstats/

I'd like to allow users to choose if they'd like to search any
combination of these top
level trees.   One approach I can think of would be somehow causing a
field to be set
in the index, say "section", with the value being this top level name.
Then searching
could be limited to these sections by adding section:marketing, or
section:documentation
to the search query.

Am I on the right track here, and if so, how do I go about accomplishing
this?

Thanks,

Andy

--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/


Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

Zaheed Haque
Maybe this could help you..

http://issues.apache.org/jira/browse/NUTCH-201

Cheers

On 4/24/06, Andrew Libby <[hidden email]> wrote:

> Hello,
>
> I'm using nutch on a single site, currently.  The site has a few major
> sections, which
> I'd like to allow users to restrict their searches to a particular
> portion of the site.
> The general guideline is by top level.  For example, if I've got
>
> http://site.com/documentation/
> http://site.com/marketing/
> http://site.com/technialstats/
>
> I'd like to allow users to choose if they'd like to search any
> combination of these top
> level trees.   One approach I can think of would be somehow causing a
> field to be set
> in the index, say "section", with the value being this top level name.
> Then searching
> could be limited to these sections by adding section:marketing, or
> section:documentation
> to the search query.
>
> Am I on the right track here, and if so, how do I go about accomplishing
> this?
>
> Thanks,
>
> Andy
>
> --
> Andrew Libby
> [hidden email]
> http://philadelphiariders.com/
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

juan_barbancho_rsi
Hi,

I have a similar problem.

I have an intranet searcher, I have VIP user and normal user. Vip user
could search both normal documentation as Vip documentation.

The way of solve it with nutch are use the subcollection api.

I use in the past lucene api, I create two index and I make two search for
the VIP user.

I make a "manual" merge between the two search result.

I think use nutch because I need search more file types and I need add a
internet web pages to the normal user index.

Thanks

You


Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

Andrew Libby
In reply to this post by Zaheed Haque

Thank you very much!

Andy


Zaheed Haque wrote:

>Maybe this could help you..
>
>http://issues.apache.org/jira/browse/NUTCH-201
>
>Cheers
>
>  
>
--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/


Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

Andrew Libby
In reply to this post by Zaheed Haque

I'm running out of subversion, and I don't see that the patches exist
there.  I'm one who is clearly not familiar with the development processes
used, but anyone know if/ when the patches for this will be applied to
subversion?

Thanks.

Andy


Zaheed Haque wrote:

>Maybe this could help you..
>
>http://issues.apache.org/jira/browse/NUTCH-201
>
>Cheers
>
>On 4/24/06, Andrew Libby <[hidden email]> wrote:
>  
>
>>Hello,
>>
>>I'm using nutch on a single site, currently.  The site has a few major
>>sections, which
>>I'd like to allow users to restrict their searches to a particular
>>portion of the site.
>>The general guideline is by top level.  For example, if I've got
>>
>>http://site.com/documentation/
>>http://site.com/marketing/
>>http://site.com/technialstats/
>>
>>I'd like to allow users to choose if they'd like to search any
>>combination of these top
>>level trees.   One approach I can think of would be somehow causing a
>>field to be set
>>in the index, say "section", with the value being this top level name.
>>Then searching
>>could be limited to these sections by adding section:marketing, or
>>section:documentation
>>to the search query.
>>
>>Am I on the right track here, and if so, how do I go about accomplishing
>>this?
>>
>>Thanks,
>>
>>Andy
>>
>>--
>>Andrew Libby
>>[hidden email]
>>http://philadelphiariders.com/
>>
>>
>>
>>    
>>
>
>  
>


--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/


Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

Andrew Libby
In reply to this post by Zaheed Haque

Iv'e applied the patch in the ticket linked to below.  I browesed the
patch to
try to figure out how to use this plugin, and I'm having troubles trying
to get it
working.

Before I get into the details, if someone has a source of information
describing
how nutch starts up and initializes plugins so that I can get a feel for
if this patch
is even being used properly in the system, I'd very much appreciate it.

----

Here's what I did:

Added patches with patch -p0 < subcollection.2.path

Comiled tarball with ant tar

Extracted tarball in my runtime location with tar -zxvpf -
nutch-0.8-dev.tar.gz

Created urls/urls.txt containing my site name
(http://www.philadelphiariders.com/)

Edited crawl-urlfilter.xml to accept aformentioned site name

Edited subcollections.xml and added the following:

    <subcollection>
        <name>wiki</name>
        <id>wiki</name>
        <whitelist>http://www.philadelphiariders.com/wiki</whitelist>
        <blacklist />
    </subcollection>

    <subcollection>
        <name>moto-web</name>
        <id>moto-web</name>
        <whitelist>http://www.philadelphiariders.com/c/dmoz</whitelist>
        <blacklist />
    </subcollection>

    <subcollection>
        <name>gallery</name>
        <id>gallery</id>
        <whitelist>http://www.philadelphiariders.com/gallery</whitelist>
        <blacklist />
    </subcollection>

Crawled/ indexed my site with ./bin/nutch crawl urls -dir ../nutch-index

When I start tomcat and do some test searching, I get links from the
wiki area
w/o a collection filed added to the query.  But if I do something a
query like:

collection:wiki loudon

Which should return documents, I get none. Additionally, if I simply query
collection:wiki, I get no hits.

If anyone has any ideas, I'll be very greatful.


Zaheed Haque wrote:

>Maybe this could help you..
>
>http://issues.apache.org/jira/browse/NUTCH-201
>
>Cheers
>
>  
>

--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/


Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

jay jiang
Shouldn't that be subcollection:wiki instead?   Also I assumed you had
subcollection added to plugin.includes in the config file (nutch-site.xml).

Andrew Libby wrote:

>Iv'e applied the patch in the ticket linked to below.  I browesed the
>patch to
>try to figure out how to use this plugin, and I'm having troubles trying
>to get it
>working.
>
>Before I get into the details, if someone has a source of information
>describing
>how nutch starts up and initializes plugins so that I can get a feel for
>if this patch
>is even being used properly in the system, I'd very much appreciate it.
>
>----
>
>Here's what I did:
>
>Added patches with patch -p0 < subcollection.2.path
>
>Comiled tarball with ant tar
>
>Extracted tarball in my runtime location with tar -zxvpf -
>nutch-0.8-dev.tar.gz
>
>Created urls/urls.txt containing my site name
>(http://www.philadelphiariders.com/)
>
>Edited crawl-urlfilter.xml to accept aformentioned site name
>
>Edited subcollections.xml and added the following:
>
>    <subcollection>
>        <name>wiki</name>
>        <id>wiki</name>
>        <whitelist>http://www.philadelphiariders.com/wiki</whitelist>
>        <blacklist />
>    </subcollection>
>
>    <subcollection>
>        <name>moto-web</name>
>        <id>moto-web</name>
>        <whitelist>http://www.philadelphiariders.com/c/dmoz</whitelist>
>        <blacklist />
>    </subcollection>
>
>    <subcollection>
>        <name>gallery</name>
>        <id>gallery</id>
>        <whitelist>http://www.philadelphiariders.com/gallery</whitelist>
>        <blacklist />
>    </subcollection>
>
>Crawled/ indexed my site with ./bin/nutch crawl urls -dir ../nutch-index
>
>When I start tomcat and do some test searching, I get links from the
>wiki area
>w/o a collection filed added to the query.  But if I do something a
>query like:
>
>collection:wiki loudon
>
>Which should return documents, I get none. Additionally, if I simply query
>collection:wiki, I get no hits.
>
>If anyone has any ideas, I'll be very greatful.
>
>
>Zaheed Haque wrote:
>
>  
>
>>Maybe this could help you..
>>
>>http://issues.apache.org/jira/browse/NUTCH-201
>>
>>Cheers
>>
>>
>>
>>    
>>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

Andrew Libby
In reply to this post by Zaheed Haque

Okay, I sort of answered my own question on this.

From looking at the index with luke, it seemed that I can use the url field
to restrict searches.  I found that the main categories in my site had
url field
values that were equal to the top level path in the url.  So I just add
search
terms like

url:gallery or url:wiki to my query strings, and viola.  How nice!

I also was able to find some information on how plugins are initialized
on the
wiki, in case anyone is interested:

http://wiki.apache.org/nutch/WritingPluginExample

The example plugin the develop was enough to get me working through if a
plugin was properly configured.

Andy


Zaheed Haque wrote:

>Maybe this could help you..
>
>http://issues.apache.org/jira/browse/NUTCH-201
>
>Cheers
>
>On 4/24/06, Andrew Libby <[hidden email]> wrote:
>  
>
>>Hello,
>>
>>I'm using nutch on a single site, currently.  The site has a few major
>>sections, which
>>I'd like to allow users to restrict their searches to a particular
>>portion of the site.
>>The general guideline is by top level.  For example, if I've got
>>
>>http://site.com/documentation/
>>http://site.com/marketing/
>>http://site.com/technialstats/
>>
>>I'd like to allow users to choose if they'd like to search any
>>combination of these top
>>level trees.   One approach I can think of would be somehow causing a
>>field to be set
>>in the index, say "section", with the value being this top level name.
>>Then searching
>>could be limited to these sections by adding section:marketing, or
>>section:documentation
>>to the search query.
>>
>>Am I on the right track here, and if so, how do I go about accomplishing
>>this?
>>
>>Thanks,
>>
>>Andy
>>
>>--
>>Andrew Libby
>>[hidden email]
>>http://philadelphiariders.com/
>>
>>
>>
>>    
>>
>
>  
>


--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/


Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

Andrew Libby
In reply to this post by jay jiang

Indeed, you are correct.

Thanks.


jay jiang wrote:

> Shouldn't that be subcollection:wiki instead?   Also I assumed you had
> subcollection added to plugin.includes in the config file
> (nutch-site.xml).
>
> Andrew Libby wrote:
>
>> Iv'e applied the patch in the ticket linked to below.  I browesed the
>> patch to
>> try to figure out how to use this plugin, and I'm having troubles trying
>> to get it
>> working.
>> Before I get into the details, if someone has a source of information
>> describing
>> how nutch starts up and initializes plugins so that I can get a feel for
>> if this patch
>> is even being used properly in the system, I'd very much appreciate it.
>>
>> ----
>>
>> Here's what I did:
>>
>> Added patches with patch -p0 < subcollection.2.path
>>
>> Comiled tarball with ant tar
>>
>> Extracted tarball in my runtime location with tar -zxvpf -
>> nutch-0.8-dev.tar.gz
>>
>> Created urls/urls.txt containing my site name
>> (http://www.philadelphiariders.com/)
>>
>> Edited crawl-urlfilter.xml to accept aformentioned site name
>>
>> Edited subcollections.xml and added the following:
>>
>>    <subcollection>
>>        <name>wiki</name>
>>        <id>wiki</name>
>>        <whitelist>http://www.philadelphiariders.com/wiki</whitelist>
>>        <blacklist />
>>    </subcollection>
>>
>>    <subcollection>
>>        <name>moto-web</name>
>>        <id>moto-web</name>
>>        <whitelist>http://www.philadelphiariders.com/c/dmoz</whitelist>
>>        <blacklist />
>>    </subcollection>
>>
>>    <subcollection>
>>        <name>gallery</name>
>>        <id>gallery</id>
>>        <whitelist>http://www.philadelphiariders.com/gallery</whitelist>
>>        <blacklist />
>>    </subcollection>
>>
>> Crawled/ indexed my site with ./bin/nutch crawl urls -dir ../nutch-index
>>
>> When I start tomcat and do some test searching, I get links from the
>> wiki area
>> w/o a collection filed added to the query.  But if I do something a
>> query like:
>>
>> collection:wiki loudon
>>
>> Which should return documents, I get none. Additionally, if I simply
>> query
>> collection:wiki, I get no hits.
>>
>> If anyone has any ideas, I'll be very greatful.
>>
>>
>> Zaheed Haque wrote:
>>
>>  
>>
>>> Maybe this could help you..
>>>
>>> http://issues.apache.org/jira/browse/NUTCH-201
>>>
>>> Cheers
>>>
>>>
>>>
>>>  
>>
>>
>>  
>>
>
>


--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/


Reply | Threaded
Open this post in threaded view
|

Re: Restrictive searching approaches?

jay jiang
A caveat when I tried to back port the subcollection patch to
nutch-0.7.2 is that you need to explicitly set a non-zero boost for
SubcollectionQueryFilter.  Otherwise you will get no results for
collection search because the default boost of its parent class
RawFieldQueryFilter is zero.

Jay Jiang

Andrew Libby wrote:

>Indeed, you are correct.
>
>Thanks.
>
>
>jay jiang wrote:
>
>  
>
>>Shouldn't that be subcollection:wiki instead?   Also I assumed you had
>>subcollection added to plugin.includes in the config file
>>(nutch-site.xml).
>>
>>Andrew Libby wrote:
>>
>>    
>>
>>>Iv'e applied the patch in the ticket linked to below.  I browesed the
>>>patch to
>>>try to figure out how to use this plugin, and I'm having troubles trying
>>>to get it
>>>working.
>>>Before I get into the details, if someone has a source of information
>>>describing
>>>how nutch starts up and initializes plugins so that I can get a feel for
>>>if this patch
>>>is even being used properly in the system, I'd very much appreciate it.
>>>
>>>----
>>>
>>>Here's what I did:
>>>
>>>Added patches with patch -p0 < subcollection.2.path
>>>
>>>Comiled tarball with ant tar
>>>
>>>Extracted tarball in my runtime location with tar -zxvpf -
>>>nutch-0.8-dev.tar.gz
>>>
>>>Created urls/urls.txt containing my site name
>>>(http://www.philadelphiariders.com/)
>>>
>>>Edited crawl-urlfilter.xml to accept aformentioned site name
>>>
>>>Edited subcollections.xml and added the following:
>>>
>>>   <subcollection>
>>>       <name>wiki</name>
>>>       <id>wiki</name>
>>>       <whitelist>http://www.philadelphiariders.com/wiki</whitelist>
>>>       <blacklist />
>>>   </subcollection>
>>>
>>>   <subcollection>
>>>       <name>moto-web</name>
>>>       <id>moto-web</name>
>>>       <whitelist>http://www.philadelphiariders.com/c/dmoz</whitelist>
>>>       <blacklist />
>>>   </subcollection>
>>>
>>>   <subcollection>
>>>       <name>gallery</name>
>>>       <id>gallery</id>
>>>       <whitelist>http://www.philadelphiariders.com/gallery</whitelist>
>>>       <blacklist />
>>>   </subcollection>
>>>
>>>Crawled/ indexed my site with ./bin/nutch crawl urls -dir ../nutch-index
>>>
>>>When I start tomcat and do some test searching, I get links from the
>>>wiki area
>>>w/o a collection filed added to the query.  But if I do something a
>>>query like:
>>>
>>>collection:wiki loudon
>>>
>>>Which should return documents, I get none. Additionally, if I simply
>>>query
>>>collection:wiki, I get no hits.
>>>
>>>If anyone has any ideas, I'll be very greatful.
>>>
>>>
>>>Zaheed Haque wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>Maybe this could help you..
>>>>
>>>>http://issues.apache.org/jira/browse/NUTCH-201
>>>>
>>>>Cheers
>>>>
>>>>
>>>>
>>>>  
>>>>        
>>>>
>>>
>>>
>>>      
>>>
>>    
>>
>
>
>  
>