Using Regex fragmenter to extract paragraphs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Regex fragmenter to extract paragraphs

Mark Ferguson
Hello,

I am trying to use the regex fragmenter and am having a hard time getting
the results I want. I am trying to get fragments that start on a word
character and end on punctuation, but for some reason the fragments being
returned to me seem to be very inflexible, despite that I've provided a
large slop. Here are the relevant parameters I'm using, maybe someone can
help point out where I've gone wrong:

<str name="hl.fragsize">500</str>
<str name="hl.fragmenter">regex</str>
<str name="hl.regex.slop">0.8</str>
<str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
<str name="hl">true</str>
<str name="q">chinese</str>

This should be matching between 400-600 characters, beginning with a word
character and ending with one of .!?. Here is an example of a typical
result:

. Check these pictures out. Nine panda cubs on display for the first time
Thursday in southwest China. They're less than a year old. They just
recently stopped nursing. There are only 1,600 of these guys left in the
mountain forests of central China, another 120 in <span
class='hl'>Chinese</span> breeding facilities and zoos. And they're about 20
that live outside China in zoos. They exist almost entirely on bamboo. They
can live to be 30 years old. And these little guys will eventually get much
bigger. They'll grow

As you can see, it is starting with a period and ending on a word character!
It's almost as if the fragments are just coming out as they will and the
regex isn't doing anything at all, but the results are different when I use
the gap fragmenter. In the above result I don't see any reason why it
shouldn't have stripped out the preceding period and the last two words,
there is plenty of room in the slop and in the regex pattern. Please help me
figure out what I'm doing wrong...

Thanks a lot,

Mark Ferguson
Reply | Threaded
Open this post in threaded view
|

Re: Using Regex fragmenter to extract paragraphs

Mark Ferguson
Someone helped me with the regex and pointed out a couple mistakes, most
notably the extra quantifier in .*{400,600}. My new regex is this:

\w.{400,600}[\.!?]

Unfortunately, my results still aren't any better. Some results start with a
word character, some don't, and none seem to end with punctuation. Any ideas
would else could be wrong?

Mark



On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <[hidden email]>wrote:

> Hello,
>
> I am trying to use the regex fragmenter and am having a hard time getting
> the results I want. I am trying to get fragments that start on a word
> character and end on punctuation, but for some reason the fragments being
> returned to me seem to be very inflexible, despite that I've provided a
> large slop. Here are the relevant parameters I'm using, maybe someone can
> help point out where I've gone wrong:
>
> <str name="hl.fragsize">500</str>
> <str name="hl.fragmenter">regex</str>
> <str name="hl.regex.slop">0.8</str>
> <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
> <str name="hl">true</str>
> <str name="q">chinese</str>
>
> This should be matching between 400-600 characters, beginning with a word
> character and ending with one of .!?. Here is an example of a typical
> result:
>
> . Check these pictures out. Nine panda cubs on display for the first time
> Thursday in southwest China. They're less than a year old. They just
> recently stopped nursing. There are only 1,600 of these guys left in the
> mountain forests of central China, another 120 in <span
> class='hl'>Chinese</span> breeding facilities and zoos. And they're about 20
> that live outside China in zoos. They exist almost entirely on bamboo. They
> can live to be 30 years old. And these little guys will eventually get much
> bigger. They'll grow
>
> As you can see, it is starting with a period and ending on a word
> character! It's almost as if the fragments are just coming out as they will
> and the regex isn't doing anything at all, but the results are different
> when I use the gap fragmenter. In the above result I don't see any reason
> why it shouldn't have stripped out the preceding period and the last two
> words, there is plenty of room in the slop and in the regex pattern. Please
> help me figure out what I'm doing wrong...
>
> Thanks a lot,
>
> Mark Ferguson
>
Reply | Threaded
Open this post in threaded view
|

Re: Using Regex fragmenter to extract paragraphs

Erick Erickson
Shouldn't you escape the question mark at the end too?

On Fri, Dec 12, 2008 at 6:22 PM, Mark Ferguson <[hidden email]>wrote:

> Someone helped me with the regex and pointed out a couple mistakes, most
> notably the extra quantifier in .*{400,600}. My new regex is this:
>
> \w.{400,600}[\.!?]
>
> Unfortunately, my results still aren't any better. Some results start with
> a
> word character, some don't, and none seem to end with punctuation. Any
> ideas
> would else could be wrong?
>
> Mark
>
>
>
> On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <[hidden email]
> >wrote:
>
> > Hello,
> >
> > I am trying to use the regex fragmenter and am having a hard time getting
> > the results I want. I am trying to get fragments that start on a word
> > character and end on punctuation, but for some reason the fragments being
> > returned to me seem to be very inflexible, despite that I've provided a
> > large slop. Here are the relevant parameters I'm using, maybe someone can
> > help point out where I've gone wrong:
> >
> > <str name="hl.fragsize">500</str>
> > <str name="hl.fragmenter">regex</str>
> > <str name="hl.regex.slop">0.8</str>
> > <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
> > <str name="hl">true</str>
> > <str name="q">chinese</str>
> >
> > This should be matching between 400-600 characters, beginning with a word
> > character and ending with one of .!?. Here is an example of a typical
> > result:
> >
> > . Check these pictures out. Nine panda cubs on display for the first time
> > Thursday in southwest China. They're less than a year old. They just
> > recently stopped nursing. There are only 1,600 of these guys left in the
> > mountain forests of central China, another 120 in <span
> > class='hl'>Chinese</span> breeding facilities and zoos. And they're about
> 20
> > that live outside China in zoos. They exist almost entirely on bamboo.
> They
> > can live to be 30 years old. And these little guys will eventually get
> much
> > bigger. They'll grow
> >
> > As you can see, it is starting with a period and ending on a word
> > character! It's almost as if the fragments are just coming out as they
> will
> > and the regex isn't doing anything at all, but the results are different
> > when I use the gap fragmenter. In the above result I don't see any reason
> > why it shouldn't have stripped out the preceding period and the last two
> > words, there is plenty of room in the slop and in the regex pattern.
> Please
> > help me figure out what I'm doing wrong...
> >
> > Thanks a lot,
> >
> > Mark Ferguson
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Using Regex fragmenter to extract paragraphs

Mark Ferguson
You actually don't need to escape most characters inside a character class,
the escaping of the period was unnecessary.

I've tried using the example regex ([-\w ,/\n\"']{20,200}), and I'm _still_
getting lots of highlighted snippets that don't match the regex (starting
with a period, etc.) Has anyone else has any trouble with the default regex
fragmenter? If someone has used it and gotten the expected results, can you
let me know, so I know that the problem is on my end?

Thanks for your help,

Mark


On Sun, Dec 14, 2008 at 8:34 AM, Erick Erickson <[hidden email]>wrote:

> Shouldn't you escape the question mark at the end too?
>
> On Fri, Dec 12, 2008 at 6:22 PM, Mark Ferguson <[hidden email]
> >wrote:
>
> > Someone helped me with the regex and pointed out a couple mistakes, most
> > notably the extra quantifier in .*{400,600}. My new regex is this:
> >
> > \w.{400,600}[\.!?]
> >
> > Unfortunately, my results still aren't any better. Some results start
> with
> > a
> > word character, some don't, and none seem to end with punctuation. Any
> > ideas
> > would else could be wrong?
> >
> > Mark
> >
> >
> >
> > On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <
> [hidden email]
> > >wrote:
> >
> > > Hello,
> > >
> > > I am trying to use the regex fragmenter and am having a hard time
> getting
> > > the results I want. I am trying to get fragments that start on a word
> > > character and end on punctuation, but for some reason the fragments
> being
> > > returned to me seem to be very inflexible, despite that I've provided a
> > > large slop. Here are the relevant parameters I'm using, maybe someone
> can
> > > help point out where I've gone wrong:
> > >
> > > <str name="hl.fragsize">500</str>
> > > <str name="hl.fragmenter">regex</str>
> > > <str name="hl.regex.slop">0.8</str>
> > > <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
> > > <str name="hl">true</str>
> > > <str name="q">chinese</str>
> > >
> > > This should be matching between 400-600 characters, beginning with a
> word
> > > character and ending with one of .!?. Here is an example of a typical
> > > result:
> > >
> > > . Check these pictures out. Nine panda cubs on display for the first
> time
> > > Thursday in southwest China. They're less than a year old. They just
> > > recently stopped nursing. There are only 1,600 of these guys left in
> the
> > > mountain forests of central China, another 120 in <span
> > > class='hl'>Chinese</span> breeding facilities and zoos. And they're
> about
> > 20
> > > that live outside China in zoos. They exist almost entirely on bamboo.
> > They
> > > can live to be 30 years old. And these little guys will eventually get
> > much
> > > bigger. They'll grow
> > >
> > > As you can see, it is starting with a period and ending on a word
> > > character! It's almost as if the fragments are just coming out as they
> > will
> > > and the regex isn't doing anything at all, but the results are
> different
> > > when I use the gap fragmenter. In the above result I don't see any
> reason
> > > why it shouldn't have stripped out the preceding period and the last
> two
> > > words, there is plenty of room in the slop and in the regex pattern.
> > Please
> > > help me figure out what I'm doing wrong...
> > >
> > > Thanks a lot,
> > >
> > > Mark Ferguson
> > >
> >
>