getting seed list for vertical search engine

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

getting seed list for vertical search engine

DS jha
Hello,
We are in the process of developing a vertical search engine for the
medical industry – and I need to estimate server/sizing requirements
to setup my environment – my question is, how do I estimate how many
documents I will be fetching for a particular vertical?  And – from
where do I get the seed list of all the sites? Will dmoz health
category be sufficient or will I have to purchase a seed list?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: getting seed list for vertical search engine

Otis Gospodnetic-2-2
This seems to be a common request - sizing.  I think the best you can do is use existing search engines to estimate how many pages sites you are interested in have.  You will have to know the exact sites (their URLs) and make use of the "site:" search operator (Google, Yahoo).  Yahoo also has something called SiteExplorer that might help.  Getting the seed list is typically a (semi-)manual process.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: DS jha <[hidden email]>
> To: [hidden email]
> Sent: Monday, June 16, 2008 11:04:06 PM
> Subject: getting seed list for vertical search engine
>
> Hello,
> We are in the process of developing a vertical search engine for the
> medical industry – and I need to estimate server/sizing requirements
> to setup my environment – my question is, how do I estimate how many
> documents I will be fetching for a particular vertical?  And – from
> where do I get the seed list of all the sites? Will dmoz health
> category be sufficient or will I have to purchase a seed list?
>
> Thanks

Reply | Threaded
Open this post in threaded view
|

Re: getting seed list for vertical search engine

DS jha
Thanks for your reply. However problem with this approach is that you
have to know the set of websites first, where as, we are using a
focused crawling approach to build our vertical - idea being crawler
will be able to determine which outlinks to fetch (or discard).

Another problem with manually preparing seed list form the known site
list is that I am sure to miss lots of small, individual sites - I
wonder how google, msn, yahoo does it - they must be getting list of
from ISPs, hosting providers, etc?

Thanks
Jha,




On Mon, Jun 16, 2008 at 11:15 PM, Otis Gospodnetic
<[hidden email]> wrote:

> This seems to be a common request - sizing.  I think the best you can do is use existing search engines to estimate how many pages sites you are interested in have.  You will have to know the exact sites (their URLs) and make use of the "site:" search operator (Google, Yahoo).  Yahoo also has something called SiteExplorer that might help.  Getting the seed list is typically a (semi-)manual process.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
>> From: DS jha <[hidden email]>
>> To: [hidden email]
>> Sent: Monday, June 16, 2008 11:04:06 PM
>> Subject: getting seed list for vertical search engine
>>
>> Hello,
>> We are in the process of developing a vertical search engine for the
>> medical industry – and I need to estimate server/sizing requirements
>> to setup my environment – my question is, how do I estimate how many
>> documents I will be fetching for a particular vertical?  And – from
>> where do I get the seed list of all the sites? Will dmoz health
>> category be sufficient or will I have to purchase a seed list?
>>
>> Thanks
>
>
Reply | Threaded
Open this post in threaded view
|

Re: getting seed list for vertical search engine

Otis Gospodnetic-2-2
In reply to this post by DS jha
Jha,

Nutch doesn't include anything that would let it figure out which pages are good and should be kept for inclusion in your vertical search, or which should be discarded.  One could write a custom plugin that does this type of classification, though.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: DS jha <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, June 17, 2008 2:11:35 PM
> Subject: Re: getting seed list for vertical search engine
>
> Thanks for your reply. However problem with this approach is that you
> have to know the set of websites first, where as, we are using a
> focused crawling approach to build our vertical - idea being crawler
> will be able to determine which outlinks to fetch (or discard).
>
> Another problem with manually preparing seed list form the known site
> list is that I am sure to miss lots of small, individual sites - I
> wonder how google, msn, yahoo does it - they must be getting list of
> from ISPs, hosting providers, etc?
>
> Thanks
> Jha,
>
>
>
>
> On Mon, Jun 16, 2008 at 11:15 PM, Otis Gospodnetic
> wrote:
> > This seems to be a common request - sizing.  I think the best you can do is
> use existing search engines to estimate how many pages sites you are interested
> in have.  You will have to know the exact sites (their URLs) and make use of the
> "site:" search operator (Google, Yahoo).  Yahoo also has something called
> SiteExplorer that might help.  Getting the seed list is typically a
> (semi-)manual process.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > ----- Original Message ----
> >> From: DS jha
> >> To: [hidden email]
> >> Sent: Monday, June 16, 2008 11:04:06 PM
> >> Subject: getting seed list for vertical search engine
> >>
> >> Hello,
> >> We are in the process of developing a vertical search engine for the
> >> medical industry – and I need to estimate server/sizing requirements
> >> to setup my environment – my question is, how do I estimate how many
> >> documents I will be fetching for a particular vertical?  And – from
> >> where do I get the seed list of all the sites? Will dmoz health
> >> category be sufficient or will I have to purchase a seed list?
> >>
> >> Thanks
> >
> >