Clustering single doc as multiple docs

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Clustering single doc as multiple docs

Bogdan94202
Hi,

I would like to run some clustering for a single document but then I want
that multiple clusters are extracted.
I guess I have to find a way to split the doc into multiple docs / input
vectors but I am wondering if there are any best practices on how to do the
split then
Should I derive vectors based on sentences or paragraphs? Is there a
paragraph boundary detection tool around?
Any recommendations will be appreciated.

Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Clustering single doc as multiple docs

Grant Ingersoll-2
This strike me a little bit as an XY problem: http://people.apache.org/~hossman/#xyproblem

Perhaps it would be helpful if you could back up a little and describe the higher level problem you are trying to solve.  You certainly can split up your documents and then cluster them, but I'm not sure that is actually going to give you what you need.

Cheers,
Grant

On Apr 30, 2010, at 5:29 AM, Bogdan Vatkov wrote:

> Hi,
>
> I would like to run some clustering for a single document but then I want
> that multiple clusters are extracted.
> I guess I have to find a way to split the doc into multiple docs / input
> vectors but I am wondering if there are any best practices on how to do the
> split then
> Should I derive vectors based on sentences or paragraphs? Is there a
> paragraph boundary detection tool around?
> Any recommendations will be appreciated.
>
> Best regards,
> Bogdan


Reply | Threaded
Open this post in threaded view
|

Re: Clustering single doc as multiple docs

Bogdan94202
Hi Grant,

You are probably right.
What I wanted is to use my mahout setup to extract topics from a single
document.
So, maybe in popular terms I am trying to do topic extraction via document
clustering.
Does it make sense to try to split a doc into sub docs so that I leverage
the clustering algorithm and thus find topic which appear key ones for the
document?

Best regards,
Bogdan

On Fri, Apr 30, 2010 at 6:18 PM, Grant Ingersoll <[hidden email]>wrote:

> This strike me a little bit as an XY problem:
> http://people.apache.org/~hossman/#xyproblem
>
> Perhaps it would be helpful if you could back up a little and describe the
> higher level problem you are trying to solve.  You certainly can split up
> your documents and then cluster them, but I'm not sure that is actually
> going to give you what you need.
>
> Cheers,
> Grant
>
> On Apr 30, 2010, at 5:29 AM, Bogdan Vatkov wrote:
>
> > Hi,
> >
> > I would like to run some clustering for a single document but then I want
> > that multiple clusters are extracted.
> > I guess I have to find a way to split the doc into multiple docs / input
> > vectors but I am wondering if there are any best practices on how to do
> the
> > split then
> > Should I derive vectors based on sentences or paragraphs? Is there a
> > paragraph boundary detection tool around?
> > Any recommendations will be appreciated.
> >
> > Best regards,
> > Bogdan
>
>
>


--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Clustering single doc as multiple docs

Robin Anil
On Fri, Apr 30, 2010 at 10:40 PM, Bogdan Vatkov <[hidden email]>wrote:

> Hi Grant,
>
> You are probably right.
> What I wanted is to use my mahout setup to extract topics from a single
> document.
> So, maybe in popular terms I am trying to do topic extraction via document
> clustering.
> Does it make sense to try to split a doc into sub docs so that I leverage
> the clustering algorithm and thus find topic which appear key ones for the
> document?
>
Have you heard of LDA (Its in Mahout). Or are you trying to do something
different for topic extraction ?

>
> Best regards,
> Bogdan
>
> On Fri, Apr 30, 2010 at 6:18 PM, Grant Ingersoll <[hidden email]
> >wrote:
>
> > This strike me a little bit as an XY problem:
> > http://people.apache.org/~hossman/#xyproblem
> >
> > Perhaps it would be helpful if you could back up a little and describe
> the
> > higher level problem you are trying to solve.  You certainly can split up
> > your documents and then cluster them, but I'm not sure that is actually
> > going to give you what you need.
> >
> > Cheers,
> > Grant
> >
> > On Apr 30, 2010, at 5:29 AM, Bogdan Vatkov wrote:
> >
> > > Hi,
> > >
> > > I would like to run some clustering for a single document but then I
> want
> > > that multiple clusters are extracted.
> > > I guess I have to find a way to split the doc into multiple docs /
> input
> > > vectors but I am wondering if there are any best practices on how to do
> > the
> > > split then
> > > Should I derive vectors based on sentences or paragraphs? Is there a
> > > paragraph boundary detection tool around?
> > > Any recommendations will be appreciated.
> > >
> > > Best regards,
> > > Bogdan
> >
> >
> >
>
>
> --
> Best regards,
> Bogdan
>
Reply | Threaded
Open this post in threaded view
|

Re: Clustering single doc as multiple docs

Grant Ingersoll-2

On Apr 30, 2010, at 1:15 PM, Robin Anil wrote:

> On Fri, Apr 30, 2010 at 10:40 PM, Bogdan Vatkov <[hidden email]>wrote:
>
>> Hi Grant,
>>
>> You are probably right.
>> What I wanted is to use my mahout setup to extract topics from a single
>> document.
>> So, maybe in popular terms I am trying to do topic extraction via document
>> clustering.
>> Does it make sense to try to split a doc into sub docs so that I leverage
>> the clustering algorithm and thus find topic which appear key ones for the
>> document?
>>
> Have you heard of LDA (Its in Mahout). Or are you trying to do something
> different for topic extraction ?

That's more across docs.  You might also have a look at TextRank, which is a graph based approach to keyword/topic extraction that is nice to implement (one of these days, I'll do it in Mahout)
Reply | Threaded
Open this post in threaded view
|

Re: Clustering single doc as multiple docs

Bogdan94202
I will check it but I am not sure I will have the right knowledge to
implement it, is there a ready to be used impl somewhere?
Btw, why do you think splitting and clustering won't work? Have anybody
tried this?
I am not sure it will be successful but I also do not have the arguments
that it should not lead to a meaningful result.
If I split a doc per sentence it might not get good results but if I use
larger pieces, e.g. paragraphs it might give some topics (sets of keywords).
Anyone tried something like this?

On Fri, Apr 30, 2010 at 8:24 PM, Grant Ingersoll <[hidden email]>wrote:

>
> On Apr 30, 2010, at 1:15 PM, Robin Anil wrote:
>
> > On Fri, Apr 30, 2010 at 10:40 PM, Bogdan Vatkov <[hidden email]
> >wrote:
> >
> >> Hi Grant,
> >>
> >> You are probably right.
> >> What I wanted is to use my mahout setup to extract topics from a single
> >> document.
> >> So, maybe in popular terms I am trying to do topic extraction via
> document
> >> clustering.
> >> Does it make sense to try to split a doc into sub docs so that I
> leverage
> >> the clustering algorithm and thus find topic which appear key ones for
> the
> >> document?
> >>
> > Have you heard of LDA (Its in Mahout). Or are you trying to do something
> > different for topic extraction ?
>
> That's more across docs.  You might also have a look at TextRank, which is
> a graph based approach to keyword/topic extraction that is nice to implement
> (one of these days, I'll do it in Mahout)




--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Clustering single doc as multiple docs

Ted Dunning
Yes.  Splitting by paragraph should work fine (been there, done that).

Splitting by sentence works well if you does something like SVD to smooth
over the fact that you have few words per sentence.

Splitting by paragraph is pretty easy, but corpus specific.  For plain text,
try looking for blank lines.  For HTML make a list of breaking markup and
insert split points whereever you find those.  For other formats you will
need to put on your thinking cap.

Sentence splitting is easy to do 90% correctly, hard to do better than 99%
especially in some domains.  For your purposes, 90% is probably fine.  Start
with the simplest possible case and add a few special cases and you will be
set.  There may be usable software to be found on the net, but your needs
are very modest.

Good luck!

Let us know how it goes.

On Fri, Apr 30, 2010 at 10:32 AM, Bogdan Vatkov <[hidden email]>wrote:

> Btw, why do you think splitting and clustering won't work? Have anybody
> tried this?
> I am not sure it will be successful but I also do not have the arguments
> that it should not lead to a meaningful result.
> If I split a doc per sentence it might not get good results but if I use
> larger pieces, e.g. paragraphs it might give some topics (sets of
> keywords).
> Anyone tried something like this?
>
Reply | Threaded
Open this post in threaded view
|

Re: Clustering single doc as multiple docs

Bogdan94202
Thanks Ted! That was what I needed!

On Fri, Apr 30, 2010 at 10:21 PM, Ted Dunning <[hidden email]> wrote:

> Yes.  Splitting by paragraph should work fine (been there, done that).
>
> Splitting by sentence works well if you does something like SVD to smooth
> over the fact that you have few words per sentence.
>
> Splitting by paragraph is pretty easy, but corpus specific.  For plain
> text,
> try looking for blank lines.  For HTML make a list of breaking markup and
> insert split points whereever you find those.  For other formats you will
> need to put on your thinking cap.
>
> Sentence splitting is easy to do 90% correctly, hard to do better than 99%
> especially in some domains.  For your purposes, 90% is probably fine.
>  Start
> with the simplest possible case and add a few special cases and you will be
> set.  There may be usable software to be found on the net, but your needs
> are very modest.
>
> Good luck!
>
> Let us know how it goes.
>
> On Fri, Apr 30, 2010 at 10:32 AM, Bogdan Vatkov <[hidden email]
> >wrote:
>
> > Btw, why do you think splitting and clustering won't work? Have anybody
> > tried this?
> > I am not sure it will be successful but I also do not have the arguments
> > that it should not lead to a meaningful result.
> > If I split a doc per sentence it might not get good results but if I use
> > larger pieces, e.g. paragraphs it might give some topics (sets of
> > keywords).
> > Anyone tried something like this?
> >
>



--
Best regards,
Bogdan