Can I "chunk" during the crawl?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Can I "chunk" during the crawl?

Paul Tomblin
Forgive me if this is a bit of a n00b question.
I've been tasked with taking some other person's code and replacing all the
DieselPoint code with Lucene/Nutch.  What they do in DieselPoint is crawl
specific parts of the web, then perform some proprietary splitting up of the
returned pages into "chunks", and then the chunks themselves are
indexed.  Actually, I think they do it in a kind of a naive way,
because it appears that DieselPoint crawls and indexes, and then this
code goes through the index and creates
chunk files, possibly several from any given initial page, and then
DieselPoint is set loose to crawl and index those chunk files.  Then the app
uses *that* index in proprietary searches.
I'm trying to learn my way around Nutch, and I'm wondering if there might be
a way to get rid of the chunking stage by doing it directly in the initial
crawl, possibly by writing a plugin.  Unfortunately I'm under NDA so I can't
give away too much of what the chunking process does, but I hope I've given
enough information on what I'm trying to do.  Is what I'm doing possible?

--
http://www.linkedin.com/in/paultomblin