nutch/lucene question..

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

nutch/lucene question..


i have a possible project where i'm looking at extracting information from
various public/college websites. i don't need to index the text/content of
the sites. i do need to extract specific information.

as an example, a site might have a course schedule page, which in turn has
links to the departments page, which in turn has links to the class
information page. from a tree structure this would be:
  course listings by semester
     departments for the semester
       classes of each department
          class information

obviously nutch/lucene has the ability to crawl a given site, does it have
the ability to somehow 'link'/maintain a given relationship to the upstream
page for a given piece of information.

for my needs, i need to maintain the semester i get, as i follow the "link
to the department, etc... this approach allows me to then store the complete
course information in a db, so i can then iterate through the course

i can accomplish this now, by creating a unique crawling app for each
school. my curiousity is whether nutch/lucene can provide a basic crawling
engine that i could then plug into, for my specific needs. i'm also curious
as to the amount of additional development that would have to be created for
my needs...