Boolean Queries and extracting all index terms

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Boolean Queries and extracting all index terms

Nick Rowlands
Hi

I am at the start of my uni dissertation on interactive query expansion. At
the mo I am using an Ajax framework and Wordnet to suggest alternative or
additional search terms based on the user's original query. The webpage is
updated as the user types. I now which to integrate my system into a search
engine and Nutch seems suitable. I have successfully completed the whole-web
crawl tutorial. I have two questions:

1. I wish to formulate a boolean query using the OR operator to search on
all of the alternative search terms Wordnet has suggested. I have found no
documentation neither in the Wiki or in the mailing list archive. Are
boolean queries possible in Nutch?

2. How do I extract all index terms from nutch, and possibly their tf/idf
score too? I inted to use this information to have a function similar to
Google Suggest, in that as you type, suggested terms will appear based on
terms actually in the index. I would want to put the terms and their
associated score into a database like postgresql.

Any pointers would be much appreciated!

Regards,
Nick.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Nutch-general] Boolean Queries and extracting all index terms

praveen pathiyil
Hi Nick,

For implementing Boolean OR queries, you will have to write your own
plugin. Look at the code of query-basic and query-site for example
code of how to write a new query plugin.

Look at the javadocs of org.apache.lucene.search.BooleanQuery for
details. By making a query as non-required, you will get an OR
behavior. [The API for adding a new query term is add(Query query,
boolean required, boolean prohibited). So you will specify 'false' for
required].

For your second question, you might want to start by looking at this email:
http://mail-archives.apache.org/mod_mbox/jakarta-lucene-dev/200309-incomplete.mbox/%3C3F6A2EEC.8010007@...%3E


Regards,
Praveen.

On 7/11/05, Nick Rowlands <[hidden email]> wrote:

> Hi
>
> I am at the start of my uni dissertation on interactive query expansion. At
> the mo I am using an Ajax framework and Wordnet to suggest alternative or
> additional search terms based on the user's original query. The webpage is
> updated as the user types. I now which to integrate my system into a search
> engine and Nutch seems suitable. I have successfully completed the whole-web
> crawl tutorial. I have two questions:
>
> 1. I wish to formulate a boolean query using the OR operator to search on
> all of the alternative search terms Wordnet has suggested. I have found no
> documentation neither in the Wiki or in the mailing list archive. Are
> boolean queries possible in Nutch?
>
> 2. How do I extract all index terms from nutch, and possibly their tf/idf
> score too? I inted to use this information to have a function similar to
> Google Suggest, in that as you type, suggested terms will appear based on
> terms actually in the index. I would want to put the terms and their
> associated score into a database like postgresql.
>
> Any pointers would be much appreciated!
>
> Regards,
> Nick.
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Cnt of real pages in segments

luti
Dear List,

How to determine: How many real (indexed, not deleted) pages are in a
segment?
I think if we have some backends, we need to balance the segments
between them.
I firstly try the fetched number of pages, but this is not real balance.
I used the lukeall.jar tool on my winxp client, but on the servers can't
run graphical interfaces.

Regards,
Ferenc
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Cnt of real pages in segments

Andrzej Bialecki
[hidden email] wrote:
> Dear List,
>
> How to determine: How many real (indexed, not deleted) pages are in a
> segment?
> I think if we have some backends, we need to balance the segments
> between them.
> I firstly try the fetched number of pages, but this is not real balance.
> I used the lukeall.jar tool on my winxp client, but on the servers can't
> run graphical interfaces.

You can use two tools:

1. nutch segread -list : this will give you the total number of records
in a segment. Note, however, that this includes also pages which failed
to be fetched or parsed.

2. You can use LuCli (in lucene/contrib) for a command-line frontend to
Lucene.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Search Script

quovadis
Hi

Does anyone know of a way that you can get the "real"
 number of documents shwoing/returned which are displayed
to the user for a particular search when the persite
variable is active (not 0). As opposed to total documents
returned.

Can anyone can understand what I mean?
_________________________________________________________________
Need software for your hardware? Click here http://www.asg.co.za
Loading...