Extracting URLs while indexing

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting URLs while indexing

Bogdan94202
Hi,

I want to extract URLs (http://..., as well as file://... or even //.....)
while pushing documents into Solr.
Is it possible with the Filters/Analyzers available nowadays?
I looked into the doc but could not find anything related to it.

Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Erick Erickson
Do you mean you want the URLs to be extracted on the client?
If so, no. Filters/analyzers reside on the server, not the client.
You'll have to do it with custom code....

Erick

On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <[hidden email]>wrote:

> Hi,
>
> I want to extract URLs (http://..., as well as file://... or even //.....)
> while pushing documents into Solr.
> Is it possible with the Filters/Analyzers available nowadays?
> I looked into the doc but could not find anything related to it.
>
> Best regards,
> Bogdan
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Bogdan94202
Sorry, I meant completely server-side - even more I want that at indexing
time (I do not care about query-time as I am reading later the whole index
anyway).

On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <[hidden email]>wrote:

> Do you mean you want the URLs to be extracted on the client?
> If so, no. Filters/analyzers reside on the server, not the client.
> You'll have to do it with custom code....
>
> Erick
>
> On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <[hidden email]
> >wrote:
>
> > Hi,
> >
> > I want to extract URLs (http://..., as well as file://... or even
> //.....)
> > while pushing documents into Solr.
> > Is it possible with the Filters/Analyzers available nowadays?
> > I looked into the doc but could not find anything related to it.
> >
> > Best regards,
> > Bogdan
> >
>



--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Erick Erickson
I guess it depends on what you mean by "extract". There's
nothing that I know of that, say, stores them to a file or
separate field, or even does anything special with them.

I think StandardTokenizerFactory tries to keep URLs
together as a token in the field, but it's just another
token... You should check though....

FWIW
Erick

On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <[hidden email]>wrote:

> Sorry, I meant completely server-side - even more I want that at indexing
> time (I do not care about query-time as I am reading later the whole index
> anyway).
>
> On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <[hidden email]
> >wrote:
>
> > Do you mean you want the URLs to be extracted on the client?
> > If so, no. Filters/analyzers reside on the server, not the client.
> > You'll have to do it with custom code....
> >
> > Erick
> >
> > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <[hidden email]
> > >wrote:
> >
> > > Hi,
> > >
> > > I want to extract URLs (http://..., as well as file://... or even
> > //.....)
> > > while pushing documents into Solr.
> > > Is it possible with the Filters/Analyzers available nowadays?
> > > I looked into the doc but could not find anything related to it.
> > >
> > > Best regards,
> > > Bogdan
> > >
> >
>
>
>
> --
> Best regards,
> Bogdan
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Bogdan94202
I am not absolutely sure about what I am saying but I think after
tokenization I get the URLs as single tokens but with all the "interesting
symbols" :) like "/",":" removed from the token.
Is it normal? Is there a chance I misconfigured something?

Best regards,
Bogdan

On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson <[hidden email]>wrote:

> I guess it depends on what you mean by "extract". There's
> nothing that I know of that, say, stores them to a file or
> separate field, or even does anything special with them.
>
> I think StandardTokenizerFactory tries to keep URLs
> together as a token in the field, but it's just another
> token... You should check though....
>
> FWIW
> Erick
>
> On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <[hidden email]
> >wrote:
>
> > Sorry, I meant completely server-side - even more I want that at indexing
> > time (I do not care about query-time as I am reading later the whole
> index
> > anyway).
> >
> > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <[hidden email]
> > >wrote:
> >
> > > Do you mean you want the URLs to be extracted on the client?
> > > If so, no. Filters/analyzers reside on the server, not the client.
> > > You'll have to do it with custom code....
> > >
> > > Erick
> > >
> > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <
> [hidden email]
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I want to extract URLs (http://..., as well as file://... or even
> > > //.....)
> > > > while pushing documents into Solr.
> > > > Is it possible with the Filters/Analyzers available nowadays?
> > > > I looked into the doc but could not find anything related to it.
> > > >
> > > > Best regards,
> > > > Bogdan
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Bogdan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Erick Erickson
That's really hard to say without seeing your configuration <G>...

If your field has WordDelimiterFactory with the proper catenate
options set to one, that'd do it.

Can you post the relevant parts of your schema?

Erick

On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov <[hidden email]>wrote:

> I am not absolutely sure about what I am saying but I think after
> tokenization I get the URLs as single tokens but with all the "interesting
> symbols" :) like "/",":" removed from the token.
> Is it normal? Is there a chance I misconfigured something?
>
> Best regards,
> Bogdan
>
> On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > I guess it depends on what you mean by "extract". There's
> > nothing that I know of that, say, stores them to a file or
> > separate field, or even does anything special with them.
> >
> > I think StandardTokenizerFactory tries to keep URLs
> > together as a token in the field, but it's just another
> > token... You should check though....
> >
> > FWIW
> > Erick
> >
> > On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <[hidden email]
> > >wrote:
> >
> > > Sorry, I meant completely server-side - even more I want that at
> indexing
> > > time (I do not care about query-time as I am reading later the whole
> > index
> > > anyway).
> > >
> > > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <
> [hidden email]
> > > >wrote:
> > >
> > > > Do you mean you want the URLs to be extracted on the client?
> > > > If so, no. Filters/analyzers reside on the server, not the client.
> > > > You'll have to do it with custom code....
> > > >
> > > > Erick
> > > >
> > > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <
> > [hidden email]
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I want to extract URLs (http://..., as well as file://... or even
> > > > //.....)
> > > > > while pushing documents into Solr.
> > > > > Is it possible with the Filters/Analyzers available nowadays?
> > > > > I looked into the doc but could not find anything related to it.
> > > > >
> > > > > Best regards,
> > > > > Bogdan
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Bogdan
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Bogdan94202
that is the field type:
    <fieldType name="body_text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<!--        <filter class="solr.LowerCaseFilterFactory"/> -->
<!--        <filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/> -->
        <filter
class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/>
      </analyzer>


and that is the field def:

<field name="msg_body" type="body_text" termVectors="true" indexed="true"
stored="true"/>


On Wed, Jan 20, 2010 at 7:53 PM, Erick Erickson <[hidden email]>wrote:

> That's really hard to say without seeing your configuration <G>...
>
> If your field has WordDelimiterFactory with the proper catenate
> options set to one, that'd do it.
>
> Can you post the relevant parts of your schema?
>
> Erick
>
> On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov <[hidden email]
> >wrote:
>
> > I am not absolutely sure about what I am saying but I think after
> > tokenization I get the URLs as single tokens but with all the
> "interesting
> > symbols" :) like "/",":" removed from the token.
> > Is it normal? Is there a chance I misconfigured something?
> >
> > Best regards,
> > Bogdan
> >
> > On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson <[hidden email]
> > >wrote:
> >
> > > I guess it depends on what you mean by "extract". There's
> > > nothing that I know of that, say, stores them to a file or
> > > separate field, or even does anything special with them.
> > >
> > > I think StandardTokenizerFactory tries to keep URLs
> > > together as a token in the field, but it's just another
> > > token... You should check though....
> > >
> > > FWIW
> > > Erick
> > >
> > > On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <
> [hidden email]
> > > >wrote:
> > >
> > > > Sorry, I meant completely server-side - even more I want that at
> > indexing
> > > > time (I do not care about query-time as I am reading later the whole
> > > index
> > > > anyway).
> > > >
> > > > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <
> > [hidden email]
> > > > >wrote:
> > > >
> > > > > Do you mean you want the URLs to be extracted on the client?
> > > > > If so, no. Filters/analyzers reside on the server, not the client.
> > > > > You'll have to do it with custom code....
> > > > >
> > > > > Erick
> > > > >
> > > > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <
> > > [hidden email]
> > > > > >wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I want to extract URLs (http://..., as well as file://... or
> even
> > > > > //.....)
> > > > > > while pushing documents into Solr.
> > > > > > Is it possible with the Filters/Analyzers available nowadays?
> > > > > > I looked into the doc but could not find anything related to it.
> > > > > >
> > > > > > Best regards,
> > > > > > Bogdan
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Bogdan
> > > >
> > >
> >
>



--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Need help : Solr configuration issue for sorting on title field

EL KASMI  Hicham
In reply to this post by Bogdan94202
Hello,

We have a problem with sorting on title field in Solr instance of our
production repository, we get the error message:

"HTTP Status 500 - there are more terms than documents in field
"titleStr", but it's impossible to sort on tokenized fields".

After some googling and searching in this listserv, we found that a
sorting field has to be untokenized but our sorting field "titleStr"
which is a copy of the "title" field has a string type.

What we did as configs in our schema.xml file :

1st config
++++++++++

    <fieldtype name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>


    <fieldtype name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>

===============================

   <field name="title" type="text" indexed="true" stored="true"
termVectors="true"/>
   <field name="titleStr" type="string" indexed="true" stored="false"/>

=================================

<copyField source="title" dest="titleStr"/>


As you can see, the title field has the termVectors property as true, we
drop it in the second attempt of our config

2end attempt
++++++++++++

    <fieldtype name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>

===============================

   <field name="title" type="text" indexed="true" stored="true"/>
   <field name="titleStr" type="string" indexed="true" stored="false"/>

=================================

<copyField source="title" dest="titleStr"/>


3rd attempt
+++++++++++
Create a new field type named 'text_exact' which doesn't use the
"WhitespaceTokenizer" tokenizer but instead uses the "KeywordTokenizer"
tokenizer.


    <fieldtype name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="text_exact" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <analyzer type="index">
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
input string
             is preserved as a single token -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
input string
             is preserved as a single token -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>


===============================

   <field name="title" type="text" indexed="true" stored="true"/>
   <field name="titleStr" type="text_exact" indexed="true"
stored="false"/>

=================================

No copyField.


4th attempt
+++++++++++
Same config as the 3rd attempt but adding explicitly that the property
'multiValued' of titleStr (text_exact type) field is false

<field name="titleStr" type="text_exact" indexed="true" stored="false"
multiValued="false"/>

For this last config, we noticed that the number of documents in our
index was downsized from ~22500 records to ~17800 records! We don't
understand this behavior of Sorl/Lucene?


For all these configs, we got the same error message, please note that
we encounter this issue on our production server
http://difusion.ulb.ac.be/vufind/Search/Home?lookfor=&sort=pubdate+desc&
submitButton=Recherche&type=general&sort=title (with ~22500 records),

or with the same config (the first one) on our test server
http://bib17.ulb.ac.be/vufind/Search/Home?lookfor=&sort=pubdate+desc&sub
mitButton=Find&type=general&sort=title (with ~57700 records),
the sorting on title is going well!

Thanks in advance for the time you can spend to have a look on this.

Best regards,

Hicham El Kasmi

Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Erick Erickson
In reply to this post by Bogdan94202
You really need to have this page as a handy reference.....

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>Look in
particular at what happens with
WordDelimiterFilterFactory,
you're breaking your tokens up on non-alpha characters and
case change and letter<->number transitions. Then
you're asking that things "of a kind" be put back into
words.

You might try StandardTokenizerFactory instead....

Erick

On Wed, Jan 20, 2010 at 12:55 PM, Bogdan Vatkov <[hidden email]>wrote:

> that is the field type:
>    <fieldType name="body_text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>          add enablePositionIncrements=true in both the index and query
>          analyzers to leave a 'gap' for more accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <!--        <filter class="solr.LowerCaseFilterFactory"/> -->
> <!--        <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/> -->
>        <filter
>
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/>
>      </analyzer>
>
>
> and that is the field def:
>
> <field name="msg_body" type="body_text" termVectors="true" indexed="true"
> stored="true"/>
>
>
> On Wed, Jan 20, 2010 at 7:53 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > That's really hard to say without seeing your configuration <G>...
> >
> > If your field has WordDelimiterFactory with the proper catenate
> > options set to one, that'd do it.
> >
> > Can you post the relevant parts of your schema?
> >
> > Erick
> >
> > On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov <[hidden email]
> > >wrote:
> >
> > > I am not absolutely sure about what I am saying but I think after
> > > tokenization I get the URLs as single tokens but with all the
> > "interesting
> > > symbols" :) like "/",":" removed from the token.
> > > Is it normal? Is there a chance I misconfigured something?
> > >
> > > Best regards,
> > > Bogdan
> > >
> > > On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson <
> [hidden email]
> > > >wrote:
> > >
> > > > I guess it depends on what you mean by "extract". There's
> > > > nothing that I know of that, say, stores them to a file or
> > > > separate field, or even does anything special with them.
> > > >
> > > > I think StandardTokenizerFactory tries to keep URLs
> > > > together as a token in the field, but it's just another
> > > > token... You should check though....
> > > >
> > > > FWIW
> > > > Erick
> > > >
> > > > On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <
> > [hidden email]
> > > > >wrote:
> > > >
> > > > > Sorry, I meant completely server-side - even more I want that at
> > > indexing
> > > > > time (I do not care about query-time as I am reading later the
> whole
> > > > index
> > > > > anyway).
> > > > >
> > > > > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <
> > > [hidden email]
> > > > > >wrote:
> > > > >
> > > > > > Do you mean you want the URLs to be extracted on the client?
> > > > > > If so, no. Filters/analyzers reside on the server, not the
> client.
> > > > > > You'll have to do it with custom code....
> > > > > >
> > > > > > Erick
> > > > > >
> > > > > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <
> > > > [hidden email]
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I want to extract URLs (http://..., as well as file://... or
> > even
> > > > > > //.....)
> > > > > > > while pushing documents into Solr.
> > > > > > > Is it possible with the Filters/Analyzers available nowadays?
> > > > > > > I looked into the doc but could not find anything related to
> it.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Bogdan
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Bogdan
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Best regards,
> Bogdan
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting URLs while indexing

Bogdan94202
Now I see I didn't review all the config that I took from the default
config.
Removed the WordDelimiterFilter and the StandardTokenizer seems to keep URLs
but splits relative paths (e.g. /file/location/file.txt) and I want to keep
such as single token.
Any ideas?

On Wed, Jan 20, 2010 at 8:13 PM, Erick Erickson <[hidden email]>wrote:

> You really need to have this page as a handy reference.....
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>Look in
> particular at what happens with
> WordDelimiterFilterFactory,
> you're breaking your tokens up on non-alpha characters and
> case change and letter<->number transitions. Then
> you're asking that things "of a kind" be put back into
> words.
>
> You might try StandardTokenizerFactory instead....
>
> Erick
>
> On Wed, Jan 20, 2010 at 12:55 PM, Bogdan Vatkov <[hidden email]
> >wrote:
>
> > that is the field type:
> >    <fieldType name="body_text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <!-- in this example, we will only use synonyms at query time
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        -->
> >        <!-- Case insensitive stop word removal.
> >          add enablePositionIncrements=true in both the index and query
> >          analyzers to leave a 'gap' for more accurate phrase queries.
> >        -->
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > <!--        <filter class="solr.LowerCaseFilterFactory"/> -->
> > <!--        <filter class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/> -->
> >        <filter
> >
> >
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> > language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/>
> >      </analyzer>
> >
> >
> > and that is the field def:
> >
> > <field name="msg_body" type="body_text" termVectors="true" indexed="true"
> > stored="true"/>
> >
> >
> > On Wed, Jan 20, 2010 at 7:53 PM, Erick Erickson <[hidden email]
> > >wrote:
> >
> > > That's really hard to say without seeing your configuration <G>...
> > >
> > > If your field has WordDelimiterFactory with the proper catenate
> > > options set to one, that'd do it.
> > >
> > > Can you post the relevant parts of your schema?
> > >
> > > Erick
> > >
> > > On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov <
> [hidden email]
> > > >wrote:
> > >
> > > > I am not absolutely sure about what I am saying but I think after
> > > > tokenization I get the URLs as single tokens but with all the
> > > "interesting
> > > > symbols" :) like "/",":" removed from the token.
> > > > Is it normal? Is there a chance I misconfigured something?
> > > >
> > > > Best regards,
> > > > Bogdan
> > > >
> > > > On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson <
> > [hidden email]
> > > > >wrote:
> > > >
> > > > > I guess it depends on what you mean by "extract". There's
> > > > > nothing that I know of that, say, stores them to a file or
> > > > > separate field, or even does anything special with them.
> > > > >
> > > > > I think StandardTokenizerFactory tries to keep URLs
> > > > > together as a token in the field, but it's just another
> > > > > token... You should check though....
> > > > >
> > > > > FWIW
> > > > > Erick
> > > > >
> > > > > On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <
> > > [hidden email]
> > > > > >wrote:
> > > > >
> > > > > > Sorry, I meant completely server-side - even more I want that at
> > > > indexing
> > > > > > time (I do not care about query-time as I am reading later the
> > whole
> > > > > index
> > > > > > anyway).
> > > > > >
> > > > > > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <
> > > > [hidden email]
> > > > > > >wrote:
> > > > > >
> > > > > > > Do you mean you want the URLs to be extracted on the client?
> > > > > > > If so, no. Filters/analyzers reside on the server, not the
> > client.
> > > > > > > You'll have to do it with custom code....
> > > > > > >
> > > > > > > Erick
> > > > > > >
> > > > > > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <
> > > > > [hidden email]
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I want to extract URLs (http://..., as well as file://... or
> > > even
> > > > > > > //.....)
> > > > > > > > while pushing documents into Solr.
> > > > > > > > Is it possible with the Filters/Analyzers available nowadays?
> > > > > > > > I looked into the doc but could not find anything related to
> > it.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Bogdan
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Bogdan
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Bogdan
> >
>



--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Need help : Solr configuration issue for sorting on title field

hossman
In reply to this post by EL KASMI Hicham

: Subject: Need help : Solr configuration issue for sorting on title field
: In-Reply-To: <[hidden email]>
: References: <[hidden email]>
:     <[hidden email]>
:     <[hidden email]>
:     <[hidden email]>
:     <[hidden email]>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE : Need help : Solr configuration issue for sorting on title field

EL KASMI  Hicham
Sorry Chris and others, it's my first time I'm using a mailing list to ask a question.
I'll send my question again in a new blank clean message.

Thanks for references.
Hicham

-------- Message d'origine--------
De: Chris Hostetter [mailto:[hidden email]]
Date: jeu. 21/01/2010 0:12
À: [hidden email]
Objet : Re: Need help : Solr configuration issue for sorting on title field
 

: Subject: Need help : Solr configuration issue for sorting on title field
: In-Reply-To: <[hidden email]>
: References: <[hidden email]>
:     <[hidden email]>
:     <[hidden email]>
:     <[hidden email]>
:     <[hidden email]>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking


-Hoss