Restricting query to a domain

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Restricting query to a domain

Bill de hOra
Hi,

I'll need to provide a search that allow a person to restrict search to
a specific domain (and probably a group of them). Afaict that's not
supported (apologies if I'm wrong). Before I go rolling my own are they
plans to support anything like "site:"?

cheers
Bill

Reply | Threaded
Open this post in threaded view
|

RE: Restricting query to a domain

Bogdan Kecman
Use plugin "query-site". It supports the site field.
Also if you look at the the

NutchBean.search(query, start + hitsToRetrieve, hitsPerSite, "site", sort,
reverse);

You will notice that you can get results grouped
by site field, actually to get only hitsPerSite
number of results per site.

Now, this works with 0.7.1, donno about 0.7.2 and 0.8 as
I had no time to check them out, but there should not be
much difference

Pay notice that this is a filter, so query like

 findme andme site:"www.aaa.com"

Will limit resultset to www.aaa.com only but query

 site:"www.aaa.com"

Is empty query and will not return anything.

Hope this helps
Bogdan

> -----Original Message-----
> From: Bill de hÓra [mailto:[hidden email]]
> Sent: Sunday, June 18, 2006 6:33 PM
> To: [hidden email]
> Subject: Restricting query to a domain
>
> Hi,
>
> I'll need to provide a search that allow a person to restrict
> search to a specific domain (and probably a group of them).
> Afaict that's not supported (apologies if I'm wrong). Before
> I go rolling my own are they plans to support anything like "site:"?
>
> cheers
> Bill

Reply | Threaded
Open this post in threaded view
|

Re: Restricting query to a domain

Stefan Neufeind
Bogdan Kecman wrote:

> Use plugin "query-site". It supports the site field.
> Also if you look at the the
>
> NutchBean.search(query, start + hitsToRetrieve, hitsPerSite, "site", sort,
> reverse);
>
> You will notice that you can get results grouped
> by site field, actually to get only hitsPerSite
> number of results per site.
>
> Now, this works with 0.7.1, donno about 0.7.2 and 0.8 as
> I had no time to check them out, but there should not be
> much difference
>
> Pay notice that this is a filter, so query like
>
>  findme andme site:"www.aaa.com"
>
> Will limit resultset to www.aaa.com only but query
>
>  site:"www.aaa.com"
>
> Is empty query and will not return anything.

Why won't that return anything?

And is grouping with "brackets" somehow possible? I know the thing
mentioned below does not work - but would be nice if it could, wouldn't it?

        abc && (site:"www.aaa.com" || site:"www.bbb.com")



Regards,
 Stefan

>> -----Original Message-----
>> From: Bill de hÓra [mailto:[hidden email]]
>> Sent: Sunday, June 18, 2006 6:33 PM
>> To: [hidden email]
>> Subject: Restricting query to a domain
>>
>> Hi,
>>
>> I'll need to provide a search that allow a person to restrict
>> search to a specific domain (and probably a group of them).
>> Afaict that's not supported (apologies if I'm wrong). Before
>> I go rolling my own are they plans to support anything like "site:"?
>>
>> cheers
>> Bill
Reply | Threaded
Open this post in threaded view
|

RE: Restricting query to a domain

Bogdan Kecman
 

> > Pay notice that this is a filter, so query like
> >
> >  findme andme site:"www.aaa.com"
> >
> > Will limit resultset to www.aaa.com only but query
> >
> >  site:"www.aaa.com"
> >
> > Is empty query and will not return anything.
>
> Why won't that return anything?

Well, as I understand it, and must admin I'm no nutch expert (playing with
it for few weeks) site:"something" is just a query filter meaning it filters
the main query sort of speak, so if you do not have main query, there is
nothing to be filtered out. As I said, this might be completely untrue, but
this is how I understood it.

>
> And is grouping with "brackets" somehow possible? I know the
> thing mentioned below does not work - but would be nice if it
> could, wouldn't it?
>
> abc && (site:"www.aaa.com" || site:"www.bbb.com")

Well, the Lucene will allow you to do this, actually,
+content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
As I remember will do the trick. For some reason this do not work in nutch
search, there must be a way to make it work (also the beautiful syntax like:
something~ som*ing some?hing )
Now, some of the experts could remain silent or shed some light into it :) I
spent a lot of time trough archives and wiki getting to the point where I
can write useful plugins and use the system, although still miss some basics
(like unanswered issue of difference between field and raw-field)

Bogdan

Reply | Threaded
Open this post in threaded view
|

Migrating crawled data (urls) from version 0.7.1 to 0.8-dev.

Bipin Parmar
Hi,

Is there any way to migrate segments and webdb data generated using 0.7.1 to 0.8-dev version? I ask this question because the directory structures and files are different between the two versions.

Thanks,

Bipin


Reply | Threaded
Open this post in threaded view
|

Re: Migrating crawled data (urls) from version 0.7.1 to 0.8-dev.

Thomas Delnoij-3
Unfortunately this is only feasible with *a lot* of custom code.
Probably you will be done sooner refetching and indexing your pages.

Rgrds, Thomas


On 6/19/06, [hidden email] <[hidden email]> wrote:

> Hi,
>
> Is there any way to migrate segments and webdb data generated using 0.7.1 to 0.8-dev version? I ask this question because the directory structures and files are different between the two versions.
>
> Thanks,
>
> Bipin
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Migrating crawled data (urls) from version 0.7.1 to 0.8-dev.

Andrzej Białecki-2
TDLN wrote:
> Unfortunately this is only feasible with *a lot* of custom code.
> Probably you will be done sooner refetching and indexing your pages.

I confirm. Theroretically, you probably could use some classloader
tricks to load both versions of classes and libraries, and then use
other (temporary container) classes loaded from the parent classloader
to transfer the data. But it would be a LOT of pain to code it well and
reliably ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Migrating crawled data (urls) from version 0.7.1 to 0.8-dev.

Bipin Parmar
Thank you, Andrzej and Thomas. Your answers make
sense. Re-crawling is not a problem because I had not
crawled a lot anyway and am still in development. I
had about 700,000 urls. I worry about directory
structure & file formats because I plan to crawl 100M
urls and have some manual effort after crawling that I
would not like to repeat as new versions of nutch are
released.

I also ask the question because I read that people
have crawled/indexed hundreds of millions of pages
using Nutch. I assumed that they must have used nutch
0.7.1 or prior version. And therefore I want to know
how they would migrate to nutch 0.8 production
release.

Thanks again to both of you.

Bipin Parmar

--- Andrzej Bialecki <[hidden email]> wrote:

> TDLN wrote:
> > Unfortunately this is only feasible with *a lot*
> of custom code.
> > Probably you will be done sooner refetching and
> indexing your pages.
>
> I confirm. Theroretically, you probably could use
> some classloader
> tricks to load both versions of classes and
> libraries, and then use
> other (temporary container) classes loaded from the
> parent classloader
> to transfer the data. But it would be a LOT of pain
> to code it well and
> reliably ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Migrating crawled data (urls) from version 0.7.1 to 0.8-dev.

Thomas Delnoij-3
I would say that with any (open source) software <= version 1.0 you
should *plan* for change.

Rgrds, Thomas




On 6/23/06, Bipin Parmar <[hidden email]> wrote:

> Thank you, Andrzej and Thomas. Your answers make
> sense. Re-crawling is not a problem because I had not
> crawled a lot anyway and am still in development. I
> had about 700,000 urls. I worry about directory
> structure & file formats because I plan to crawl 100M
> urls and have some manual effort after crawling that I
> would not like to repeat as new versions of nutch are
> released.
>
> I also ask the question because I read that people
> have crawled/indexed hundreds of millions of pages
> using Nutch. I assumed that they must have used nutch
> 0.7.1 or prior version. And therefore I want to know
> how they would migrate to nutch 0.8 production
> release.
>
> Thanks again to both of you.
>
> Bipin Parmar
>
> --- Andrzej Bialecki <[hidden email]> wrote:
>
> > TDLN wrote:
> > > Unfortunately this is only feasible with *a lot*
> > of custom code.
> > > Probably you will be done sooner refetching and
> > indexing your pages.
> >
> > I confirm. Theroretically, you probably could use
> > some classloader
> > tricks to load both versions of classes and
> > libraries, and then use
> > other (temporary container) classes loaded from the
> > parent classloader
> > to transfer the data. But it would be a LOT of pain
> > to code it well and
> > reliably ...
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _
> > __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval,
> > Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System
> > Integration
> > http://www.sigram.com  Contact: info at sigram dot
> > com
> >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Restricting query to a domain

Rajasekar Karthik
In reply to this post by Bogdan Kecman
Recently, I came across the same problem - these are the interesting points I found (maybe old news for you now - but, just reporting in case). Also, another solution at the end of my reply

a) search depends on what analyzer you use for query parsing - I found
org.apache.lucene.analysis.KeywordAnalyzer
org.apache.lucene.analysis.standard.StandardAnalyzer

to be yielding results for a similar query like yours:
+content:"abc" +(site"www.aaa.com" site:"www.bbb.com")

Maybe nutch is using a different analyzer - if it is something similar to this analyzer
org.apache.lucene.analysis.SimpleAnalyzer
I don't get any results.

I believe lib-lucene-analyzers plugin might contain different analyzers which you could use - Not sure as I haven't tried - some expert opinion could be of help here.
I found this post to be useful to change the analyzers
http://www.nabble.com/forum/ViewPost.jtp?post=1036075&framed=y


b) [This method works in nutch] Alternative way (maybe little inefficent) - you could use in nutch
<query> -site:"ccc.com" -site:"ddd.com"
these eliminates the results from these sites from the original search results.

Hope it helps someone.

Bogdan Kecman wrote
 
> > Pay notice that this is a filter, so query like
> >
> >  findme andme site:"www.aaa.com"
> >
> > Will limit resultset to www.aaa.com only but query
> >
> >  site:"www.aaa.com"
> >
> > Is empty query and will not return anything.
>
> Why won't that return anything?

Well, as I understand it, and must admin I'm no nutch expert (playing with
it for few weeks) site:"something" is just a query filter meaning it filters
the main query sort of speak, so if you do not have main query, there is
nothing to be filtered out. As I said, this might be completely untrue, but
this is how I understood it.

>
> And is grouping with "brackets" somehow possible? I know the
> thing mentioned below does not work - but would be nice if it
> could, wouldn't it?
>
> abc && (site:"www.aaa.com" || site:"www.bbb.com")

Well, the Lucene will allow you to do this, actually,
+content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
As I remember will do the trick. For some reason this do not work in nutch
search, there must be a way to make it work (also the beautiful syntax like:
something~ som*ing some?hing )
Now, some of the experts could remain silent or shed some light into it :) I
spent a lot of time trough archives and wiki getting to the point where I
can write useful plugins and use the system, although still miss some basics
(like unanswered issue of difference between field and raw-field)

Bogdan