Since we have such strange plugin structure (DI? IoC?), and many utility
classes with a single UNIX shell script to run everything...
1. Separate concerns. Clearly.
- Generate URL List
(Interfaces of WebDB should be more clear, so we can use databases, etc,...)
1a. Data Mining (finding new language constructs)
2. Automate Classification
- Anchor text is the true subject of a page
- Page contains anchors
- Anchor Text is The Class of referenced pages
Sample: the page "Network Cards" has referenced pages. The page "Computer
Hardware" has a link with anchor text "Network Cards".
3. Data Mining (???)
- String Tokenization
- Human Language
- AJAX, Red Rouge, Opteron, Break Barrel, Caviar, The Jacobian Conjecture,
... - different language constructs for different sites?
Many staff changed in a trunk, such as 'Link' and 'WebDB', it simplifies...