LanguageIdentifier refactoring

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

LanguageIdentifier refactoring

Jérôme Charron
Hi,

In my last LanguageIndentifier patch, I splitted the code, so that the core
of this plugin could now be viewed as a standalone lib.
I think it could be a good idea to move this language identification lib
from Nutch to Lucene (in order to be available in Lucene), and that the
LanguageIdentifier plugin just rely on this Lucene code.
What do you think about that?

Jerome

PS: Looking at Jira issues, it seems that a lot of patches
(LanguageIdentifier for instance) are not applied to the trunk. What is the
reason? What is the "process" for applying a patch?


--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: LanguageIdentifier refactoring

Andrzej Białecki-2
J?r?me Charron wrote:

> Hi,
>
> In my last LanguageIndentifier patch, I splitted the code, so that the core
> of this plugin could now be viewed as a standalone lib.
> I think it could be a good idea to move this language identification lib
> from Nutch to Lucene (in order to be available in Lucene), and that the
> LanguageIdentifier plugin just rely on this Lucene code.
> What do you think about that?
>
> Jerome
>
> PS: Looking at Jira issues, it seems that a lot of patches
> (LanguageIdentifier for instance) are not applied to the trunk. What is the
> reason? What is the "process" for applying a patch?
>
>

I monitor your work, and as soon as you say "go" I'm ready to apply the
patches - but I'd rather avoid doing this every couple of days. So, for
now, I'm waiting for a more or less stable situation... ;-)


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: LanguageIdentifier refactoring

Jérôme Charron
>
> I monitor your work, and as soon as you say "go" I'm ready to apply the
> patches - but I'd rather avoid doing this every couple of days. So, for
> now, I'm waiting for a more or less stable situation... ;-)

Ok Andrzej,

the last patch seems to be stable. I perform some functional tests on around
200000 docs, and it seems to be ok.
So, "Go" ... feel free to apply the last patch. ;-)
Thanks.

Jerome
--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: LanguageIdentifier refactoring

Andrzej Białecki-2
Jerome,

I have an issue with the language detection plugin, which I'm not sure
how to address. The plugin first tries to extract the language
identifier from meta tags. However, meta tag values people put there are
  often completely wrong, or follow obscure pseudo-standards.

Example: there is a bunch of pages, generated by Frontpage, where author
apparently forgot to change the default settings. So, the meta tags say
"en-us", while the real content of the page is in Spanish. The
identify() method shows this clearly.

The final value put in X-meta-lang is "en-us". Now, the question is -
should the plugin override that value with the one from the
auto-detection? This means that it should always run the detection
step... Can we have more confidence in our detection mechanism than in
the author's knowledge? Well, perhaps, if for content longer than xxx
bytes the detection is nearly unambiguous.

Another example: for a bunch of pages in Swedish, I collected the
following values of X-meta-lang:

(SCHEME=ISO.639-1) sv
(SCHEME=ISO639-1) sv
(SCHEME=RFC1766) sv-FI
(SCHEME=Z39.53) SWE
EN_US, SV, EN, EN_UK
English Swedish
English, swedish
English,Swedish
Other (Svenska)
SE
SV
SV charset=iso-8859-1
SV-FI
SV; charset=iso-8859-1
SVE
SW
SWE
SWEDISH
Sv
Sve
Svenska
Swedish
Swedish, svenska
en, sv
se
se, en
se,en,de
se-sv
sv
sv, be, dk, de, fr, no, pt, ch, fi, en
sv, dk, fi, gl, is, fo
sv, dk, no
sv, en
sv, eng
sv, eng, de
sv, fr, eng
sv, nl
sv, no, de
sv, no, en, de, dk, fi
sv,en
sv,en,de,fr
sv,eng
sv,eng,de,fr
sv,no,fi
sv-FI
sv-SE
sv-en
sv-fi
sv-se
sv; Content-Language: sv
sv_SE
sve
svenska
svenska, swedish, engelska, english, norsk, norwegian, polska, polish
sw
swe
swe.SPR.
sweden
swedish
swedish,
text/html; charset=sv-SE
text/html; sv
torp, stuga, uthyres, bed & breakfast


In all cases the value from the detection routine was unambiguous - swedish.

In this light, I propose the following changes:

* modify the identify() method to return a pair of lang code + relative
score (normalized to 0..1)

* in HTMLLanguageParser we should always run
LanguageIdentifier.identify(parse.getText())

* if the meta tag is null, we take the value from identify()

* if the value from identify() is null, we take the meta tag value.

* if the meta tag is not null and the value from identify() is not null:

        * if the content is shorter than "lang.analyze.max.length",
          we take the meta tag value

        * else, if the meta tag and identify values are different:

                * if the score from identify() is above "certainty"
                  threshold (0.8?), we take the value from identify().

                * elsee, we take the meta tag value.

Similar changes would be needed in LanguageIndexingFilter.filter(), to
handle text coming from other content types.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: LanguageIdentifier refactoring

Jérôme Charron
> I have an issue with the language detection plugin, which I'm not sure
> how to address. The plugin first tries to extract the language
> identifier from meta tags. However, meta tag values people put there are
> often completely wrong, or follow obscure pseudo-standards.
>
> Example: there is a bunch of pages, generated by Frontpage, where author
> apparently forgot to change the default settings. So, the meta tags say
> "en-us", while the real content of the page is in Spanish. The
> identify() method shows this clearly.


The final value put in X-meta-lang is "en-us". Now, the question is -
> should the plugin override that value with the one from the
> auto-detection? This means that it should always run the detection
> step... Can we have more confidence in our detection mechanism than in
> the author's knowledge? Well, perhaps, if for content longer than xxx
> bytes the detection is nearly unambiguous.

I think, this is an issue for all detection mechanisms...
For the content-type it is the same problem: What is the right value, the
one provided by the protocol layer, or the one provided by the extension
mapping, or the one provided by the detection (mime-magic)?

I think, we need to change the actual process, to use auto-detection
mechanisms (this is true at least for code that use the language-identifier
and the code that use the mime-type identifier). Instead of doing someting
like:

1. Get info from protocol
2. If no info from protocol, get info from parsing
3. If no info from parsing, get info from auto-detection

We need to do something like:

1. Get info from protocol
2. Get info from parsing
3. Get degree of confidences from auto-detection, and checks:
3.1 Extracted value from protocol has a high degree of confidence. Take the
protocol value
3.2 Extracted value from parsing has a high degree of confidence. Take the
parsing value
3.3 None has a high degree of confidence, but the auto-detection returns
another value with a high degree of confidence. Take the auto-detection
value.
3.4 Take a default value

Another example: for a bunch of pages in Swedish, I collected the

> following values of X-meta-lang:
>
> (SCHEME=ISO.639-1) sv
> (SCHEME=ISO639-1) sv
> (SCHEME=RFC1766) sv-FI
> (SCHEME=Z39.53) SWE
> EN_US, SV, EN, EN_UK
> English Swedish
> English, swedish
> English,Swedish
> Other (Svenska)
> SE
> SV
> SV charset=iso-8859-1
> SV-FI
> SV; charset=iso-8859-1
> SVE
> SW
> SWE
> SWEDISH
> Sv
> Sve
> Svenska
> Swedish
> Swedish, svenska
> en, sv
> se
> se, en
> se,en,de
> se-sv
> sv
> sv, be, dk, de, fr, no, pt, ch, fi, en
> sv, dk, fi, gl, is, fo
> sv, dk, no
> sv, en
> sv, eng
> sv, eng, de
> sv, fr, eng
> sv, nl
> sv, no, de
> sv, no, en, de, dk, fi
> sv,en
> sv,en,de,fr
> sv,eng
> sv,eng,de,fr
> sv,no,fi
> sv-FI
> sv-SE
> sv-en
> sv-fi
> sv-se
> sv; Content-Language: sv
> sv_SE
> sve
> svenska
> svenska, swedish, engelska, english, norsk, norwegian, polska, polish
> sw
> swe
> swe.SPR.
> sweden
> swedish
> swedish,
> text/html; charset=sv-SE
> text/html; sv
> torp, stuga, uthyres, bed & breakfast
> In all cases the value from the detection routine was unambiguous -
> swedish.

Yes, I recently saw this problem while analyzing my indexes...
A first step, could be to improve the Content-language / dc.language / html
lang parsers.
(It could be done in the HTMLLanguageParser)

In this light, I propose the following changes:
>
> * modify the identify() method to return a pair of lang code + relative
> score (normalized to 0..1)

What do you think about returning a sorted array of lang/score pair?

> * in HTMLLanguageParser we should always run
> LanguageIdentifier.identify(parse.getText())

Yes!

For information, there's some other issues on the language-identifier:
I was focused on performance and precision, and now, that I run it outside
of the "lab", and performs some tests in real life, with real documents, I
saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!
I discovered this issue and analyze it yesterday: With UTF-8 encoded input
documents, you get some very fine identification, but with another encoding
it is a disaster.
Sami (I think you were the original and first coder of the
LanguageIdentifierPlugin), do you already know this problem? Do you have
some ideas about solving it?
Actually, it is a very big issue, and the language-identifier can not be
used on a real crawl.

Thanks Andrzej for your feed-back and ideas.
(I will continue to focus my work on the encoding problem, but once I can
commit, I will implement the changes you suggest in this mail)

In fact, there's still a lot of TODOs in the languageidentifier => the most
I work on it, the most I see some issues to fix, but it is a very important
module if we want to add Multi-lingual support in Nutch.
So, I will update Wiki pages about language identifier in order to keep
trace of all these fixes/ideas/issues....

Best Regards

Jerome

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: LanguageIdentifier refactoring

Andrzej Białecki-2
J?r?me Charron wrote:

> I think, this is an issue for all detection mechanisms...
> For the content-type it is the same problem: What is the right value, the
> one provided by the protocol layer, or the one provided by the extension
> mapping, or the one provided by the detection (mime-magic)?
>
> I think, we need to change the actual process, to use auto-detection
> mechanisms (this is true at least for code that use the language-identifier
> and the code that use the mime-type identifier). Instead of doing someting
> like:
>
> 1. Get info from protocol
> 2. If no info from protocol, get info from parsing
> 3. If no info from parsing, get info from auto-detection
>
> We need to do something like:
>
> 1. Get info from protocol
> 2. Get info from parsing
> 3. Get degree of confidences from auto-detection, and checks:
> 3.1 Extracted value from protocol has a high degree of confidence. Take the
> protocol value
> 3.2 Extracted value from parsing has a high degree of confidence. Take the
> parsing value
> 3.3 None has a high degree of confidence, but the auto-detection returns
> another value with a high degree of confidence. Take the auto-detection
> value.
> 3.4 Take a default value

Yes, I agree.

>>* modify the identify() method to return a pair of lang code + relative
>>score (normalized to 0..1)
>
>
> What do you think about returning a sorted array of lang/score pair?

Yes, that would make sense too. I've been working with a proprietary
language detection tool (based on similar principles), and it was also
returning a sorted array.

> For information, there's some other issues on the language-identifier:
> I was focused on performance and precision, and now, that I run it outside
> of the "lab", and performs some tests in real life, with real documents, I
> saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!
> I discovered this issue and analyze it yesterday: With UTF-8 encoded input
> documents, you get some very fine identification, but with another encoding
> it is a disaster.

Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they
both use UTF-8. LanguageIdentifier.identify() seems to be safe, too -
because it only works with Strings, which are not encoded (native
Unicode). So, the only place where it would be problematic seems to be
in the command-line utilities (main methods in both classes), where
simple change to use InputStreamReader(inputstream, encoding) would fix
the issue...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: LanguageIdentifier refactoring

Jérôme Charron
> Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they
> both use UTF-8. LanguageIdentifier.identify() seems to be safe, too -
> because it only works with Strings, which are not encoded (native
> Unicode). So, the only place where it would be problematic seems to be
> in the command-line utilities (main methods in both classes), where
> simple change to use InputStreamReader(inputstream, encoding) would fix
> the issue...

In fact, what I see while looking at the code (correct me if I'm wrong) is
that the Writers and Readers used by Nutch don't take care of the encoding
(only the HtmlParser performs some encoding detection and add some meta-data
about encoding).
So, my idea is simply to:
1. Move the encoding detection used in HtmlParser in a more generic place
(ParseSegment could be a good candidate)
2. Uses the encoding MetaData in all the Read/Write related methods

Seems to be a huge work... but I think it is necessary... no?

Jerome

--
http://motrech.free.fr/
http://www.frutch.org/