Document Classification - indexing question

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Document Classification - indexing question

Bastian Preindl-4
Hi,

I'd like to use Nutch for crawling parts of the web and automatically
classify the fetched documents before indexing them. I've already done
some investigations on how to achieve this and have read about different
classification techniques like Bayes, SVM a.s.o. I've also already made
some offline classification tests with several libraries and think that
the best would be a pre-classification (has interesting content, doesn't
have interesting content) using something similar to CRM114 and a
"fine-grained" multi-classification afterwards with the interesting
documents using SVM or something similar to Lingpipe.

My question is now: Which extension point is appropriate for such a
plugin or extension and how can I avoid that documents which are not
interesting are even indexed?

To illustrate my approach I'd like to apply the following actions
step-by-step:

1: Fetch a new document from the web
2: Pre-classify the document (interesting, not interesting) with an
already trained filter/classifier - positive: goto 3, negative: goto 4
3: Classify the interesting document using an already trained classifier
having multi-classification-capabilities and index it with
meta-information about the document's class/category, goto 5
4: Throw the document's content and the URL away, forget it, don't index
it, goto 5
5: Fetch the next document (goto 1)

Which are the best points to "hook in" with such a classification and
how do I tell Nutch to throw a document completely away and to not index it?

I would be very encouraged if somebody could provide some hints on this
or (even better) a field report on how this can be achieved.

Thank you very much in advance

Bastian
Reply | Threaded
Open this post in threaded view
|

RE: Document Classification - indexing question

Armel T. Nene-2
Bastian,

I have been working on a similar project for the last couple of months but I
am taking a slightly different approach. Because fetching - parsing  -
indexing can be time consuming and in my case, I also need the unclassified
indexes. Using classification algorithm and the Lucene API, I build
classified indexes by using the first index as corpus.

Maybe we should discuss together on skype or MSN let me know. My skype is
etapix.

-----Original Message-----
From: Bastian Preindl [mailto:[hidden email]]
Sent: 08 May 2007 11:31
To: [hidden email]
Subject: Document Classification - indexing question

Hi,

I'd like to use Nutch for crawling parts of the web and automatically
classify the fetched documents before indexing them. I've already done
some investigations on how to achieve this and have read about different
classification techniques like Bayes, SVM a.s.o. I've also already made
some offline classification tests with several libraries and think that
the best would be a pre-classification (has interesting content, doesn't
have interesting content) using something similar to CRM114 and a
"fine-grained" multi-classification afterwards with the interesting
documents using SVM or something similar to Lingpipe.

My question is now: Which extension point is appropriate for such a
plugin or extension and how can I avoid that documents which are not
interesting are even indexed?

To illustrate my approach I'd like to apply the following actions
step-by-step:

1: Fetch a new document from the web
2: Pre-classify the document (interesting, not interesting) with an
already trained filter/classifier - positive: goto 3, negative: goto 4
3: Classify the interesting document using an already trained classifier
having multi-classification-capabilities and index it with
meta-information about the document's class/category, goto 5
4: Throw the document's content and the URL away, forget it, don't index
it, goto 5
5: Fetch the next document (goto 1)

Which are the best points to "hook in" with such a classification and
how do I tell Nutch to throw a document completely away and to not index it?

I would be very encouraged if somebody could provide some hints on this
or (even better) a field report on how this can be achieved.

Thank you very much in advance

Bastian



Reply | Threaded
Open this post in threaded view
|

Re: Document Classification - indexing question

Bastian Preindl-4
Hi Armel,

thanks for you quick reply!

> I have been working on a similar project for the last couple of months but I
> am taking a slightly different approach. Because fetching - parsing  -
> indexing can be time consuming and in my case, I also need the unclassified
> indexes. Using classification algorithm and the Lucene API, I build
> classified indexes by using the first index as corpus.
>  

This is definitely a good idea and a somewhat other approach as it moves
the classification task out of Nutch and into Lucene. Are there any
frameworks/plugins already available for applying document
classification within Lucene? The much faster parsing and indexing
process within Nutch if no "online" classification takes places stands
against the disk space consumption which is some thousand times greater
when indexing all parsed documents instead of indexing only the
positively classified ones.

> Maybe we should discuss together on skype or MSN let me know. My skype is
> etapix.
>  

That would be really nice, thanks for the offer! I'll let you know my
MSN-nummer after I've created an account.

Best regards

Bastian
Reply | Threaded
Open this post in threaded view
|

RE: Document Classification - indexing question

Armel T. Nene-2
Bastian,

When trying to classify document using the approach of dynamic
classification, depending on the file type Nutch can take a awhile to parse
the data. While working with Nutch I have encountered some null pointer
exception due to parsing processes. This is due to a Hadoop configuration
that was not made available in Nutch-default.xml file. The settings should
allow Nutch to increase the time that hadoop have to wait before setting a
process as inactive.

Some questions that you should investigate is how will your classification
process handles failed parsed and what about if the data is not parsed in a
text format (i.e. unsupported file type)? What happens to the index being
created if the classification fails; corrupted? In a multithreaded
environment such as Nutch, what happens to the concurrent classification
processes, mixed up data? I have a problem with Nutch now it seems not to be
able to generate dynamic fields based on documents while using more than a
single threads. The index becomes corrupted with mixed data from different
files in the wrong Lucene document. There are many other questions once you
start to work on your classification project.

Best regards

Armel

-----Original Message-----
From: Bastian Preindl [mailto:[hidden email]]
Sent: 08 May 2007 13:38
To: [hidden email]
Subject: Re: Document Classification - indexing question

Hi Armel,

thanks for you quick reply!

> I have been working on a similar project for the last couple of months but
I
> am taking a slightly different approach. Because fetching - parsing  -
> indexing can be time consuming and in my case, I also need the
unclassified
> indexes. Using classification algorithm and the Lucene API, I build
> classified indexes by using the first index as corpus.
>  

This is definitely a good idea and a somewhat other approach as it moves
the classification task out of Nutch and into Lucene. Are there any
frameworks/plugins already available for applying document
classification within Lucene? The much faster parsing and indexing
process within Nutch if no "online" classification takes places stands
against the disk space consumption which is some thousand times greater
when indexing all parsed documents instead of indexing only the
positively classified ones.

> Maybe we should discuss together on skype or MSN let me know. My skype is
> etapix.
>  

That would be really nice, thanks for the offer! I'll let you know my
MSN-nummer after I've created an account.

Best regards

Bastian