How can I modify the crawler?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How can I modify the crawler?

suxiaoke79
 
  I want to realize a topic-based search engine through modifing the nutch. For example I define a computer topic so I hope that I only find some information about computer. I can't find the appropriate point where I can insert myself sentence in Fetcher.java. Please tell me how can I modify the Fetcher and the parser? thanks.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can I modify the crawler?

Jim R. Wilson
To answer your question, some more information is needed:

1) How do you decide which "topic" a particular page belongs to?  URL
segments?  The Title?  Other html page elements? Latent Semantic Analysis (
http://en.wikipedia.org/wiki/Latent_semantic_indexing)?

2) Given a topic, how will your end users find pages on this topic?
Search?  Link navigation?  Hierarchical categories?

3) If the answer to question 2 was "search", how is your topic search
different from the standard Nutch search?

4) Do you control all the content or the servers hosting the content (like
in an Intranet)?

I ask these because your question, though simply stated, is not necessarily
an easy problem to solve.  Any solution will probably require hooking into
Nutch at several different locations.

Also, I'm curious as to why you want topic based search.  Are you trying to
provide clustered results like Vivisimo (http://vivisimo.com/)?

-- Jim

On 9/14/06, suxiaoke79 <[hidden email]> wrote:

>
>
>   I want to realize a topic-based search engine through modifing the
> nutch. For example I define a computer topic so I hope that I only find some
> information about computer. I can't find the appropriate point where I can
> insert myself sentence in Fetcher.java. Please tell me how can I modify
> the Fetcher and the parser? thanks.
>
>
>
Loading...