Nutch - New Features (?)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Nutch - New Features (?)

Fuad Efendi
Since we have such strange plugin structure (DI? IoC?), and many utility
classes with a single UNIX shell script to run everything...


1. Separate concerns. Clearly.
- Crawl
- Parse
- Generate URL List
- Crawl
- ...
(Interfaces of WebDB should be more clear, so we can use databases, etc,...)

1a. Data Mining (finding new language constructs)


2. Automate Classification
- Anchor text is the true subject of a page
- Page contains anchors
- Anchor Text is The Class of referenced pages
Sample: the page "Network Cards" has referenced pages. The page "Computer
Hardware" has a link with anchor text "Network Cards".


3. Data Mining (???)
- String Tokenization
- Sentence
- Human Language
- AJAX, Red Rouge, Opteron, Break Barrel, Caviar, The Jacobian Conjecture,
... - different language constructs for different sites?

Nothing "Agile".

Many staff changed in a trunk, such as 'Link' and 'WebDB', it simplifies...

Thanks