Is it posible to exclude results from other languages?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Is it posible to exclude results from other languages?

raimon.bosch

Hi,

In our indexes, sometimes we have some documents written in other languages different to the most common index's language. Is there any way to give less boosting to this documents?

Thanks in advance,
Raimon Bosch.
Reply | Threaded
Open this post in threaded view
|

Re: Is it posible to exclude results from other languages?

iorixxx

> In our indexes, sometimes we have some documents written in
> other languages
> different to the most common index's language. Is there any
> way to give less
> boosting to this documents?

If you are aware of those documents, at index time you can boost those documents with a value less than 1.0:

<add>
  <doc boost="0.5">
    // document written in other languages
    <field name="...">...</field>
    <field name="...">...</field>
  </doc>
</add>

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22 


     
Reply | Threaded
Open this post in threaded view
|

Re: Is it posible to exclude results from other languages?

raimon.bosch

Yes, It's true that we could do it in index time if we had a way to know. I was thinking in some solution in search time, maybe measuring the % of stopwords of each document. Normally, a document of another language won't have any stopword of its main language.

If you know some external software to detect the language of a source text, it would be useful too.

Thanks,
Raimon Bosch.


Ahmet Arslan wrote
> In our indexes, sometimes we have some documents written in
> other languages
> different to the most common index's language. Is there any
> way to give less
> boosting to this documents?

If you are aware of those documents, at index time you can boost those documents with a value less than 1.0:

<add>
  <doc boost="0.5">
    // document written in other languages
    <field name="...">...</field>
    <field name="...">...</field>
  </doc>
</add>

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22 


     
Reply | Threaded
Open this post in threaded view
|

Re: Is it posible to exclude results from other languages?

Lance Norskog-2
There is

On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch <[hidden email]> wrote:

>
>
> Yes, It's true that we could do it in index time if we had a way to know. I
> was thinking in some solution in search time, maybe measuring the % of
> stopwords of each document. Normally, a document of another language won't
> have any stopword of its main language.
>
> If you know some external software to detect the language of a source text,
> it would be useful too.
>
> Thanks,
> Raimon Bosch.
>
>
>
> Ahmet Arslan wrote:
>>
>>
>>> In our indexes, sometimes we have some documents written in
>>> other languages
>>> different to the most common index's language. Is there any
>>> way to give less
>>> boosting to this documents?
>>
>> If you are aware of those documents, at index time you can boost those
>> documents with a value less than 1.0:
>>
>> <add>
>>   <doc boost="0.5">
>>     // document written in other languages
>>     <field name="...">...</field>
>>     <field name="...">...</field>
>>   </doc>
>> </add>
>>
>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22
>>
>>
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Is it posible to exclude results from other languages?

Jan Høydahl / Cominvent
Much more efficient to tag documents with language at index time. Look for language identification tools such as http://www.sematext.com/products/language-identifier/index.html or http://ngramj.sourceforge.net/ or http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 9. feb. 2010, at 05.19, Lance Norskog wrote:

> There is
>
> On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch <[hidden email]> wrote:
>>
>>
>> Yes, It's true that we could do it in index time if we had a way to know. I
>> was thinking in some solution in search time, maybe measuring the % of
>> stopwords of each document. Normally, a document of another language won't
>> have any stopword of its main language.
>>
>> If you know some external software to detect the language of a source text,
>> it would be useful too.
>>
>> Thanks,
>> Raimon Bosch.
>>
>>
>>
>> Ahmet Arslan wrote:
>>>
>>>
>>>> In our indexes, sometimes we have some documents written in
>>>> other languages
>>>> different to the most common index's language. Is there any
>>>> way to give less
>>>> boosting to this documents?
>>>
>>> If you are aware of those documents, at index time you can boost those
>>> documents with a value less than 1.0:
>>>
>>> <add>
>>>   <doc boost="0.5">
>>>     // document written in other languages
>>>     <field name="...">...</field>
>>>     <field name="...">...</field>
>>>   </doc>
>>> </add>
>>>
>>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> View this message in context: http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> Lance Norskog
> [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is it posible to exclude results from other languages?

Lance Norskog-2
That's what I was going to look up :)

The nutch thing works reasonably well. It comes with a training
database from various languages. It had some UTF-8 problems in the
files. The trick here is to come up with a balanced volume of text for
all languages so that one language's patterns do not overwhelm.

Thanks for the pointer to ngramj (LGPL license), which then leads to
another contender, http://tcatng.sourceforge.net/ (BSD license). The
latter would make a great DIH Transformer that could go into contrib/
(hint hint).

On Tue, Feb 9, 2010 at 7:21 AM, Jan Høydahl / Cominvent
<[hidden email]> wrote:

> Much more efficient to tag documents with language at index time. Look for language identification tools such as http://www.sematext.com/products/language-identifier/index.html or http://ngramj.sourceforge.net/ or http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 9. feb. 2010, at 05.19, Lance Norskog wrote:
>
>> There is
>>
>> On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch <[hidden email]> wrote:
>>>
>>>
>>> Yes, It's true that we could do it in index time if we had a way to know. I
>>> was thinking in some solution in search time, maybe measuring the % of
>>> stopwords of each document. Normally, a document of another language won't
>>> have any stopword of its main language.
>>>
>>> If you know some external software to detect the language of a source text,
>>> it would be useful too.
>>>
>>> Thanks,
>>> Raimon Bosch.
>>>
>>>
>>>
>>> Ahmet Arslan wrote:
>>>>
>>>>
>>>>> In our indexes, sometimes we have some documents written in
>>>>> other languages
>>>>> different to the most common index's language. Is there any
>>>>> way to give less
>>>>> boosting to this documents?
>>>>
>>>> If you are aware of those documents, at index time you can boost those
>>>> documents with a value less than 1.0:
>>>>
>>>> <add>
>>>>   <doc boost="0.5">
>>>>     // document written in other languages
>>>>     <field name="...">...</field>
>>>>     <field name="...">...</field>
>>>>   </doc>
>>>> </add>
>>>>
>>>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context: http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [hidden email]
>
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Is it posible to exclude results from other languages?

Shalin Shekhar Mangar
On Wed, Feb 10, 2010 at 10:09 AM, Lance Norskog <[hidden email]> wrote:

>
> Thanks for the pointer to ngramj (LGPL license), which then leads to
> another contender, http://tcatng.sourceforge.net/ (BSD license). The
> latter would make a great DIH Transformer that could go into contrib/
> (hint hint).
>
>
SOLR-1768 :)

--
Regards,
Shalin Shekhar Mangar.