[jira] Created: (SOLR-211) regex split() Tokenizer

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
regex split() Tokenizer
-----------------------

                 Key: SOLR-211
                 URL: https://issues.apache.org/jira/browse/SOLR-211
             Project: Solr
          Issue Type: New Feature
          Components: search
            Reporter: Ryan McKinley


A TokenizerFactory that makes tokens from:

  string.split( regex );




--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-211:
-------------------------------

    Attachment: SOLR-211-RegexSplitTokenizer.patch

simple regex tokenizer and a test.


<fieldType name="splitText" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.RegexSplitTokenizerFactory" regex="--"/>
       <filter class="solr.TrimFilterFactory" />
     </analyzer>
 </fieldType>


Given a field:
  "Architecture--United States--19th century"

will create tokens for:
  "Architecture"
  "United States"
 "19th century"



> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491106 ]

Hoss Man commented on SOLR-211:
-------------------------------

some quick comments based on a cursory reading of the patch...

1) RegexSplitTokenizerFactory.init should probably compile the regex into a pattern that can be reused more then once ... i think  String.split calls recompile each time.
2) i don't think the offset stuff will work properly ... the length of the regex string is not the same as the length of the string it matches on when splitting (ie: \p{javaWhitespace}) ... we would probably need to use the Matcher API and iterate over the individual matches.
3) in the vein of like things having like names, we may wan to call this the PatternSplitTokenizer and name it's init param "pattern" (to match PatternReplaceFilter)

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491109 ]

Yonik Seeley commented on SOLR-211:
-----------------------------------

> should probably compile the regex [...]

Yep... beat me to it.
I was off trying to look up if there was a way to avoid reading everything into a String too... but I don't see a way to use a regex directly on a Reader.

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491125 ]

Hoss Man commented on SOLR-211:
-------------------------------

> but I don't see a way to use a regex directly on a Reader.

...I think it's pretty much impossible to have a robust regex system that can operate on character streams, regex engines need to be able to backup .... a lot.

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-211:
-------------------------------

    Attachment: SOLR-211-RegexSplitTokenizer.patch

Thanks for the quick feedback!

Here is an updated version that

1. uses a compiled Pattern
2. uses matcher.find() to set proper start and offeset
3. is called PatternSplitTokenizerFactory
4. The tests make sure the output is the same as you would get with string.split( pattern )



> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-211:
-------------------------------

    Attachment: SOLR-211-RegexSplitTokenizer.patch

Using a Matcher to generate the tokens makes it easy enough to return the match as token -- not just the split()

* Updated to take a "group" argument - if the group is less then zero, it behaves as a split, otherwise it uses the matched group as the token.

* Changed the name to PatternTokenizerFactory as it is more general then just split

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Work started: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on SOLR-211 started by Ryan McKinley.

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley reassigned SOLR-211:
----------------------------------

    Assignee: Ryan McKinley

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491852 ]

Ken Krugler commented on SOLR-211:
----------------------------------

I think we must be working on similar types of projects :)

I did something similar to the above, but in two different ways:

# I extended WhitespaceTokenizerFactory to take optional pattern & replacement parameters. If these exist, then I apply them before the tokenizer gets called. This lets me do something like strip out all XML fields other than the content of the one that I want to index from a bunch of XML going into a Solr field.
# I added a CSVTokenizerFactory, which takes an optional split character and an optional remapping file. This lets me get a field like "Java,Python,C#" and turn it into "java python csharp", which are the index tokens I need, while leaving the display text as-is.

I don't know if your new PatternTokenizerFactory could replace either of these, though. For the first case, I still want the white space tokenization after I've stripped off all the junk I don't want. And for the second, I need to be able to do the remapping.

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492132 ]

Ryan McKinley commented on SOLR-211:
------------------------------------

>
> I don't know if your new PatternTokenizerFactory could replace either of these, though. For the first case, I still want the white space tokenization after I've stripped off all the junk I don't want. And for the second, I need to be able to do the remapping.
>

If your really good with regular expressions, perhaps it could all be combined... I'm not ;)  

In my real use case, I use the general PatternTokenizerFactory to split the input into a bunch of tokens, then I have a custom (ugly!) TokenFilter transform the stream with other one-off transformations similar to what you describe.  



> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (SOLR-211) regex split() Tokenizer

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley resolved SOLR-211.
--------------------------------

    Resolution: Fixed

added in rev:532508

I'm not sure how to make the svn changelog show up in JIRA.  It looks like issues may get automatically linked if  you start the svn comment with SOLR-XXX.  Is this true?

https://issues.apache.org/jira/browse/SOLR-104?page=com.atlassian.jira.plugin.ext.subversion:subversion-commits-tabpanel

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.