how to use HTMLStripCharFilter in solrJ?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

how to use HTMLStripCharFilter in solrJ?

Arturas Mazeika
Hi Solr Folk,

What would be the easiest way to use some of the Solr and Lucene components
in SolrJ?

I am pretty amazed how much thought and careful engineering went into some
individual components to cover the wild real world effectively. And I
wonder whether one could re-use some of them in other context.

At the bottom, I wanted to strip the HTML code and store the output in solr
(with different reasons behind [0]). I approached the problem
pragmatically: googled with "HTMLStripCharFilter and example", got to [1].
checked which jar I need for that (solr-core), googled for pom dependencies
[2]. and integrated this into my solrj app:

                    StringReader strReader = new StringReader(content);
                    HTMLStripCharFilter stripper = new
HTMLStripCharFilter(new BufferedReader(strReader));
                    StringBuilder o = new StringBuilder();
                    char[] cbuf = new char[1024 * 10];
                    while (true) {
                        int count = stripper.read(cbuf);
                        if (count == -1)
                            break; // end of stream mark is -1
                        if (count > 0)
                            o.append(cbuf, 0, count);
                    }
                    stripper.close();
                    doc.addField("content_stripped", o.toString());


Dependencies were downloaded [3], and if I start the program nothing
happens (I have a feeling that a web server is being started).

Comments?

Cheers,
Arturas

References

[0] Reasons may vary from optimizing highlighting of the text for the end
user to exposing oneself to individual components of solr at the deepest
level, analysis of impact to algorithms like machine learning or data
management

[1]
https://www.programcreek.com/java-api-examples/index.php?api=org.apache.lucene.analysis.charfilter.HTMLStripCharFilter

[2] pom.xml:

  <dependencies>
        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-solrj</artifactId>
            <version>7.3.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-core</artifactId>
            <version>7.3.0</version>
        </dependency>
    </dependencies>

[3]Included Jars:
hppc-0.7.3.jar already exists in destination.
jackson-annotations-2.5.4.jar already exists in destination.
jackson-core-2.5.4.jar already exists in destination.
jackson-databind-2.5.4.jar already exists in destination.
jackson-dataformat-smile-2.5.4.jar already exists in destination.
caffeine-2.4.0.jar already exists in destination.
guava-14.0.1.jar already exists in destination.
protobuf-java-3.1.0.jar already exists in destination.
t-digest-3.1.jar already exists in destination.
commons-cli-1.2.jar already exists in destination.
commons-codec-1.10.jar already exists in destination.
commons-collections-3.2.2.jar already exists in destination.
commons-configuration-1.6.jar already exists in destination.
commons-fileupload-1.3.2.jar already exists in destination.
commons-io-2.5.jar already exists in destination.
commons-lang-2.6.jar already exists in destination.
dom4j-1.6.1.jar already exists in destination.
gmetric4j-1.0.7.jar already exists in destination.
metrics-core-3.2.2.jar already exists in destination.
metrics-ganglia-3.2.2.jar already exists in destination.
metrics-graphite-3.2.2.jar already exists in destination.
metrics-jetty9-3.2.2.jar already exists in destination.
metrics-jvm-3.2.2.jar already exists in destination.
javax.servlet-api-3.1.0.jar already exists in destination.
tools.jar already exists in destination.
joda-time-2.2.jar already exists in destination.
log4j-1.2.17.jar already exists in destination.
eigenbase-properties-1.1.5.jar already exists in destination.
antlr4-runtime-4.5.1-1.jar already exists in destination.
calcite-core-1.13.0.jar already exists in destination.
calcite-linq4j-1.13.0.jar already exists in destination.
avatica-core-1.10.0.jar already exists in destination.
commons-exec-1.3.jar already exists in destination.
commons-lang3-3.6.jar already exists in destination.
commons-math3-3.6.1.jar already exists in destination.
curator-client-2.8.0.jar already exists in destination.
curator-framework-2.8.0.jar already exists in destination.
curator-recipes-2.8.0.jar already exists in destination.
hadoop-annotations-2.7.4.jar already exists in destination.
hadoop-auth-2.7.4.jar already exists in destination.
hadoop-common-2.7.4.jar already exists in destination.
hadoop-hdfs-2.7.4.jar already exists in destination.
htrace-core-3.2.0-incubating.jar already exists in destination.
httpclient-4.5.3.jar already exists in destination.
httpcore-4.4.6.jar already exists in destination.
httpmime-4.5.3.jar already exists in destination.
lucene-analyzers-common-7.3.0.jar already exists in destination.
lucene-analyzers-kuromoji-7.3.0.jar already exists in destination.
lucene-analyzers-phonetic-7.3.0.jar already exists in destination.
lucene-backward-codecs-7.3.0.jar already exists in destination.
lucene-classification-7.3.0.jar already exists in destination.
lucene-codecs-7.3.0.jar already exists in destination.
lucene-core-7.3.0.jar already exists in destination.
lucene-expressions-7.3.0.jar already exists in destination.
lucene-grouping-7.3.0.jar already exists in destination.
lucene-highlighter-7.3.0.jar already exists in destination.
lucene-join-7.3.0.jar already exists in destination.
lucene-memory-7.3.0.jar already exists in destination.
lucene-misc-7.3.0.jar already exists in destination.
lucene-queries-7.3.0.jar already exists in destination.
lucene-queryparser-7.3.0.jar already exists in destination.
lucene-sandbox-7.3.0.jar already exists in destination.
lucene-spatial-extras-7.3.0.jar already exists in destination.
lucene-spatial3d-7.3.0.jar already exists in destination.
lucene-suggest-7.3.0.jar already exists in destination.
solr-core-7.3.0.jar already exists in destination.
solr-solrj-7.3.0.jar already exists in destination.
zookeeper-3.4.11.jar already exists in destination.
jackson-core-asl-1.9.13.jar already exists in destination.
jackson-mapper-asl-1.9.13.jar already exists in destination.
commons-compiler-2.7.6.jar already exists in destination.
janino-2.7.6.jar already exists in destination.
stax2-api-3.1.4.jar already exists in destination.
woodstox-core-asl-4.4.1.jar already exists in destination.
jetty-continuation-9.4.8.v20171121.jar already exists in destination.
jetty-deploy-9.4.8.v20171121.jar already exists in destination.
jetty-http-9.4.8.v20171121.jar already exists in destination.
jetty-io-9.4.8.v20171121.jar already exists in destination.
jetty-jmx-9.4.8.v20171121.jar already exists in destination.
jetty-rewrite-9.4.8.v20171121.jar already exists in destination.
jetty-security-9.4.8.v20171121.jar already exists in destination.
jetty-server-9.4.8.v20171121.jar already exists in destination.
jetty-servlet-9.4.8.v20171121.jar already exists in destination.
jetty-servlets-9.4.8.v20171121.jar already exists in destination.
jetty-util-9.4.8.v20171121.jar already exists in destination.
jetty-webapp-9.4.8.v20171121.jar already exists in destination.
jetty-xml-9.4.8.v20171121.jar already exists in destination.
spatial4j-0.7.jar already exists in destination.
noggit-0.8.jar already exists in destination.
asm-5.1.jar already exists in destination.
asm-commons-5.1.jar already exists in destination.
org.restlet-2.3.0.jar already exists in destination.
org.restlet.ext.servlet-2.3.0.jar already exists in destination.
jcl-over-slf4j-1.7.24.jar already exists in destination.
slf4j-api-1.7.24.jar already exists in destination.
Reply | Threaded
Open this post in threaded view
|

Re: how to use HTMLStripCharFilter in solrJ?

Alexandre Rafalovitch
I am confused. Why you do not just add the CharFilter definition to the
field type you need?

You see to be trying to do it completely on the cliwnt side? No sure.

Regards,
    Alex

On Thu, Jul 5, 2018, 2:53 AM Arturas Mazeika, <[hidden email]> wrote:

> Hi Solr Folk,
>
> What would be the easiest way to use some of the Solr and Lucene components
> in SolrJ?
>
> I am pretty amazed how much thought and careful engineering went into some
> individual components to cover the wild real world effectively. And I
> wonder whether one could re-use some of them in other context.
>
> At the bottom, I wanted to strip the HTML code and store the output in solr
> (with different reasons behind [0]). I approached the problem
> pragmatically: googled with "HTMLStripCharFilter and example", got to [1].
> checked which jar I need for that (solr-core), googled for pom dependencies
> [2]. and integrated this into my solrj app:
>
>                     StringReader strReader = new StringReader(content);
>                     HTMLStripCharFilter stripper = new
> HTMLStripCharFilter(new BufferedReader(strReader));
>                     StringBuilder o = new StringBuilder();
>                     char[] cbuf = new char[1024 * 10];
>                     while (true) {
>                         int count = stripper.read(cbuf);
>                         if (count == -1)
>                             break; // end of stream mark is -1
>                         if (count > 0)
>                             o.append(cbuf, 0, count);
>                     }
>                     stripper.close();
>                     doc.addField("content_stripped", o.toString());
>
>
> Dependencies were downloaded [3], and if I start the program nothing
> happens (I have a feeling that a web server is being started).
>
> Comments?
>
> Cheers,
> Arturas
>
> References
>
> [0] Reasons may vary from optimizing highlighting of the text for the end
> user to exposing oneself to individual components of solr at the deepest
> level, analysis of impact to algorithms like machine learning or data
> management
>
> [1]
>
> https://www.programcreek.com/java-api-examples/index.php?api=org.apache.lucene.analysis.charfilter.HTMLStripCharFilter
>
> [2] pom.xml:
>
>   <dependencies>
>         <dependency>
>             <groupId>org.apache.solr</groupId>
>             <artifactId>solr-solrj</artifactId>
>             <version>7.3.0</version>
>         </dependency>
>
>         <dependency>
>             <groupId>org.apache.solr</groupId>
>             <artifactId>solr-core</artifactId>
>             <version>7.3.0</version>
>         </dependency>
>     </dependencies>
>
> [3]Included Jars:
> hppc-0.7.3.jar already exists in destination.
> jackson-annotations-2.5.4.jar already exists in destination.
> jackson-core-2.5.4.jar already exists in destination.
> jackson-databind-2.5.4.jar already exists in destination.
> jackson-dataformat-smile-2.5.4.jar already exists in destination.
> caffeine-2.4.0.jar already exists in destination.
> guava-14.0.1.jar already exists in destination.
> protobuf-java-3.1.0.jar already exists in destination.
> t-digest-3.1.jar already exists in destination.
> commons-cli-1.2.jar already exists in destination.
> commons-codec-1.10.jar already exists in destination.
> commons-collections-3.2.2.jar already exists in destination.
> commons-configuration-1.6.jar already exists in destination.
> commons-fileupload-1.3.2.jar already exists in destination.
> commons-io-2.5.jar already exists in destination.
> commons-lang-2.6.jar already exists in destination.
> dom4j-1.6.1.jar already exists in destination.
> gmetric4j-1.0.7.jar already exists in destination.
> metrics-core-3.2.2.jar already exists in destination.
> metrics-ganglia-3.2.2.jar already exists in destination.
> metrics-graphite-3.2.2.jar already exists in destination.
> metrics-jetty9-3.2.2.jar already exists in destination.
> metrics-jvm-3.2.2.jar already exists in destination.
> javax.servlet-api-3.1.0.jar already exists in destination.
> tools.jar already exists in destination.
> joda-time-2.2.jar already exists in destination.
> log4j-1.2.17.jar already exists in destination.
> eigenbase-properties-1.1.5.jar already exists in destination.
> antlr4-runtime-4.5.1-1.jar already exists in destination.
> calcite-core-1.13.0.jar already exists in destination.
> calcite-linq4j-1.13.0.jar already exists in destination.
> avatica-core-1.10.0.jar already exists in destination.
> commons-exec-1.3.jar already exists in destination.
> commons-lang3-3.6.jar already exists in destination.
> commons-math3-3.6.1.jar already exists in destination.
> curator-client-2.8.0.jar already exists in destination.
> curator-framework-2.8.0.jar already exists in destination.
> curator-recipes-2.8.0.jar already exists in destination.
> hadoop-annotations-2.7.4.jar already exists in destination.
> hadoop-auth-2.7.4.jar already exists in destination.
> hadoop-common-2.7.4.jar already exists in destination.
> hadoop-hdfs-2.7.4.jar already exists in destination.
> htrace-core-3.2.0-incubating.jar already exists in destination.
> httpclient-4.5.3.jar already exists in destination.
> httpcore-4.4.6.jar already exists in destination.
> httpmime-4.5.3.jar already exists in destination.
> lucene-analyzers-common-7.3.0.jar already exists in destination.
> lucene-analyzers-kuromoji-7.3.0.jar already exists in destination.
> lucene-analyzers-phonetic-7.3.0.jar already exists in destination.
> lucene-backward-codecs-7.3.0.jar already exists in destination.
> lucene-classification-7.3.0.jar already exists in destination.
> lucene-codecs-7.3.0.jar already exists in destination.
> lucene-core-7.3.0.jar already exists in destination.
> lucene-expressions-7.3.0.jar already exists in destination.
> lucene-grouping-7.3.0.jar already exists in destination.
> lucene-highlighter-7.3.0.jar already exists in destination.
> lucene-join-7.3.0.jar already exists in destination.
> lucene-memory-7.3.0.jar already exists in destination.
> lucene-misc-7.3.0.jar already exists in destination.
> lucene-queries-7.3.0.jar already exists in destination.
> lucene-queryparser-7.3.0.jar already exists in destination.
> lucene-sandbox-7.3.0.jar already exists in destination.
> lucene-spatial-extras-7.3.0.jar already exists in destination.
> lucene-spatial3d-7.3.0.jar already exists in destination.
> lucene-suggest-7.3.0.jar already exists in destination.
> solr-core-7.3.0.jar already exists in destination.
> solr-solrj-7.3.0.jar already exists in destination.
> zookeeper-3.4.11.jar already exists in destination.
> jackson-core-asl-1.9.13.jar already exists in destination.
> jackson-mapper-asl-1.9.13.jar already exists in destination.
> commons-compiler-2.7.6.jar already exists in destination.
> janino-2.7.6.jar already exists in destination.
> stax2-api-3.1.4.jar already exists in destination.
> woodstox-core-asl-4.4.1.jar already exists in destination.
> jetty-continuation-9.4.8.v20171121.jar already exists in destination.
> jetty-deploy-9.4.8.v20171121.jar already exists in destination.
> jetty-http-9.4.8.v20171121.jar already exists in destination.
> jetty-io-9.4.8.v20171121.jar already exists in destination.
> jetty-jmx-9.4.8.v20171121.jar already exists in destination.
> jetty-rewrite-9.4.8.v20171121.jar already exists in destination.
> jetty-security-9.4.8.v20171121.jar already exists in destination.
> jetty-server-9.4.8.v20171121.jar already exists in destination.
> jetty-servlet-9.4.8.v20171121.jar already exists in destination.
> jetty-servlets-9.4.8.v20171121.jar already exists in destination.
> jetty-util-9.4.8.v20171121.jar already exists in destination.
> jetty-webapp-9.4.8.v20171121.jar already exists in destination.
> jetty-xml-9.4.8.v20171121.jar already exists in destination.
> spatial4j-0.7.jar already exists in destination.
> noggit-0.8.jar already exists in destination.
> asm-5.1.jar already exists in destination.
> asm-commons-5.1.jar already exists in destination.
> org.restlet-2.3.0.jar already exists in destination.
> org.restlet.ext.servlet-2.3.0.jar already exists in destination.
> jcl-over-slf4j-1.7.24.jar already exists in destination.
> slf4j-api-1.7.24.jar already exists in destination.
>
Reply | Threaded
Open this post in threaded view
|

Re: how to use HTMLStripCharFilter in solrJ?

Ahmet Arslan
In reply to this post by Arturas Mazeika
Hi Arturas, 

Here are some things to try :

1) HTMLStripCharFilter stripper = new HTMLStripCharFilter(strReader.markSupported() ? strReader : new BufferedReader(strReader))

2) Consider using HTML Strip update processor factory. 

3) Create a custom Lucene analyzer using html strip char filter and white space tokenizer. Use the "invoking the analyzer" example given in http://lucene.apache.org/core/7_4_0/core/org/apache/lucene/analysis/package-summary.html

Ahmet



On Thursday, July 5, 2018, 9:53:58 AM GMT+3, Arturas Mazeika <[hidden email]> wrote:





Hi Solr Folk,

What would be the easiest way to use some of the Solr and Lucene components
in SolrJ?

I am pretty amazed how much thought and careful engineering went into some
individual components to cover the wild real world effectively. And I
wonder whether one could re-use some of them in other context.

At the bottom, I wanted to strip the HTML code and store the output in solr
(with different reasons behind [0]). I approached the problem
pragmatically: googled with "HTMLStripCharFilter and example", got to [1].
checked which jar I need for that (solr-core), googled for pom dependencies
[2]. and integrated this into my solrj app:

                    StringReader strReader = new StringReader(content);
                    HTMLStripCharFilter stripper = new
HTMLStripCharFilter(new BufferedReader(strReader));
                    StringBuilder o = new StringBuilder();
                    char[] cbuf = new char[1024 * 10];
                    while (true) {
                        int count = stripper.read(cbuf);
                        if (count == -1)
                            break; // end of stream mark is -1
                        if (count > 0)
                            o.append(cbuf, 0, count);
                    }
                    stripper.close();
                    doc.addField("content_stripped", o.toString());


Dependencies were downloaded [3], and if I start the program nothing
happens (I have a feeling that a web server is being started).

Comments?

Cheers,
Arturas

References

[0] Reasons may vary from optimizing highlighting of the text for the end
user to exposing oneself to individual components of solr at the deepest
level, analysis of impact to algorithms like machine learning or data
management

[1]
https://www.programcreek.com/java-api-examples/index.php?api=org.apache.lucene.analysis.charfilter.HTMLStripCharFilter

[2] pom.xml:

  <dependencies>
        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-solrj</artifactId>
            <version>7.3.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-core</artifactId>
            <version>7.3.0</version>
        </dependency>
    </dependencies>

[3]Included Jars:
hppc-0.7.3.jar already exists in destination.
jackson-annotations-2.5.4.jar already exists in destination.
jackson-core-2.5.4.jar already exists in destination.
jackson-databind-2.5.4.jar already exists in destination.
jackson-dataformat-smile-2.5.4.jar already exists in destination.
caffeine-2.4.0.jar already exists in destination.
guava-14.0.1.jar already exists in destination.
protobuf-java-3.1.0.jar already exists in destination.
t-digest-3.1.jar already exists in destination.
commons-cli-1.2.jar already exists in destination.
commons-codec-1.10.jar already exists in destination.
commons-collections-3.2.2.jar already exists in destination.
commons-configuration-1.6.jar already exists in destination.
commons-fileupload-1.3.2.jar already exists in destination.
commons-io-2.5.jar already exists in destination.
commons-lang-2.6.jar already exists in destination.
dom4j-1.6.1.jar already exists in destination.
gmetric4j-1.0.7.jar already exists in destination.
metrics-core-3.2.2.jar already exists in destination.
metrics-ganglia-3.2.2.jar already exists in destination.
metrics-graphite-3.2.2.jar already exists in destination.
metrics-jetty9-3.2.2.jar already exists in destination.
metrics-jvm-3.2.2.jar already exists in destination.
javax.servlet-api-3.1.0.jar already exists in destination.
tools.jar already exists in destination.
joda-time-2.2.jar already exists in destination.
log4j-1.2.17.jar already exists in destination.
eigenbase-properties-1.1.5.jar already exists in destination.
antlr4-runtime-4.5.1-1.jar already exists in destination.
calcite-core-1.13.0.jar already exists in destination.
calcite-linq4j-1.13.0.jar already exists in destination.
avatica-core-1.10.0.jar already exists in destination.
commons-exec-1.3.jar already exists in destination.
commons-lang3-3.6.jar already exists in destination.
commons-math3-3.6.1.jar already exists in destination.
curator-client-2.8.0.jar already exists in destination.
curator-framework-2.8.0.jar already exists in destination.
curator-recipes-2.8.0.jar already exists in destination.
hadoop-annotations-2.7.4.jar already exists in destination.
hadoop-auth-2.7.4.jar already exists in destination.
hadoop-common-2.7.4.jar already exists in destination.
hadoop-hdfs-2.7.4.jar already exists in destination.
htrace-core-3.2.0-incubating.jar already exists in destination.
httpclient-4.5.3.jar already exists in destination.
httpcore-4.4.6.jar already exists in destination.
httpmime-4.5.3.jar already exists in destination.
lucene-analyzers-common-7.3.0.jar already exists in destination.
lucene-analyzers-kuromoji-7.3.0.jar already exists in destination.
lucene-analyzers-phonetic-7.3.0.jar already exists in destination.
lucene-backward-codecs-7.3.0.jar already exists in destination.
lucene-classification-7.3.0.jar already exists in destination.
lucene-codecs-7.3.0.jar already exists in destination.
lucene-core-7.3.0.jar already exists in destination.
lucene-expressions-7.3.0.jar already exists in destination.
lucene-grouping-7.3.0.jar already exists in destination.
lucene-highlighter-7.3.0.jar already exists in destination.
lucene-join-7.3.0.jar already exists in destination.
lucene-memory-7.3.0.jar already exists in destination.
lucene-misc-7.3.0.jar already exists in destination.
lucene-queries-7.3.0.jar already exists in destination.
lucene-queryparser-7.3.0.jar already exists in destination.
lucene-sandbox-7.3.0.jar already exists in destination.
lucene-spatial-extras-7.3.0.jar already exists in destination.
lucene-spatial3d-7.3.0.jar already exists in destination.
lucene-suggest-7.3.0.jar already exists in destination.
solr-core-7.3.0.jar already exists in destination.
solr-solrj-7.3.0.jar already exists in destination.
zookeeper-3.4.11.jar already exists in destination.
jackson-core-asl-1.9.13.jar already exists in destination.
jackson-mapper-asl-1.9.13.jar already exists in destination.
commons-compiler-2.7.6.jar already exists in destination.
janino-2.7.6.jar already exists in destination.
stax2-api-3.1.4.jar already exists in destination.
woodstox-core-asl-4.4.1.jar already exists in destination.
jetty-continuation-9.4.8.v20171121.jar already exists in destination.
jetty-deploy-9.4.8.v20171121.jar already exists in destination.
jetty-http-9.4.8.v20171121.jar already exists in destination.
jetty-io-9.4.8.v20171121.jar already exists in destination.
jetty-jmx-9.4.8.v20171121.jar already exists in destination.
jetty-rewrite-9.4.8.v20171121.jar already exists in destination.
jetty-security-9.4.8.v20171121.jar already exists in destination.
jetty-server-9.4.8.v20171121.jar already exists in destination.
jetty-servlet-9.4.8.v20171121.jar already exists in destination.
jetty-servlets-9.4.8.v20171121.jar already exists in destination.
jetty-util-9.4.8.v20171121.jar already exists in destination.
jetty-webapp-9.4.8.v20171121.jar already exists in destination.
jetty-xml-9.4.8.v20171121.jar already exists in destination.
spatial4j-0.7.jar already exists in destination.
noggit-0.8.jar already exists in destination.
asm-5.1.jar already exists in destination.
asm-commons-5.1.jar already exists in destination.
org.restlet-2.3.0.jar already exists in destination.
org.restlet.ext.servlet-2.3.0.jar already exists in destination.
jcl-over-slf4j-1.7.24.jar already exists in destination.
slf4j-api-1.7.24.jar already exists in destination.
Reply | Threaded
Open this post in threaded view
|

Re: how to use HTMLStripCharFilter in solrJ?

Arturas Mazeika
In reply to this post by Alexandre Rafalovitch
Hi Alex,

I suppose the explanation in [0] (references) did not bring enough of light
into the reasons. So I'll try to give additional and more detailed
arguments.

I can take the HTML text, strip it of HTML tags and index the content. I
can store the original text in solr as well. Storing the intermediate
results is not possible. And those intermediate results (what happens with
the text in the analysis or query chain) maybe of high value. On the high
level, the reasons are presented in [0].

Quite a bit can bit achieved already using the suggestions by Ahmet in the
thread below. There are solid advantages implementing customizations as
extensions in solr/lucene. Getting the flexibility to decide to implement
in solr or not in solr is also very advantageous (for both solr systems and
non-solr systems).

Cheers,
Arturas

On Thu, Jul 5, 2018 at 1:07 PM, Alexandre Rafalovitch <[hidden email]>
wrote:

> I am confused. Why you do not just add the CharFilter definition to the
> field type you need?
>
> You see to be trying to do it completely on the cliwnt side? No sure.
>
> Regards,
>     Alex
>
> On Thu, Jul 5, 2018, 2:53 AM Arturas Mazeika, <[hidden email]> wrote:
>
> > Hi Solr Folk,
> >
> > What would be the easiest way to use some of the Solr and Lucene
> components
> > in SolrJ?
> >
> > I am pretty amazed how much thought and careful engineering went into
> some
> > individual components to cover the wild real world effectively. And I
> > wonder whether one could re-use some of them in other context.
> >
> > At the bottom, I wanted to strip the HTML code and store the output in
> solr
> > (with different reasons behind [0]). I approached the problem
> > pragmatically: googled with "HTMLStripCharFilter and example", got to
> [1].
> > checked which jar I need for that (solr-core), googled for pom
> dependencies
> > [2]. and integrated this into my solrj app:
> >
> >                     StringReader strReader = new StringReader(content);
> >                     HTMLStripCharFilter stripper = new
> > HTMLStripCharFilter(new BufferedReader(strReader));
> >                     StringBuilder o = new StringBuilder();
> >                     char[] cbuf = new char[1024 * 10];
> >                     while (true) {
> >                         int count = stripper.read(cbuf);
> >                         if (count == -1)
> >                             break; // end of stream mark is -1
> >                         if (count > 0)
> >                             o.append(cbuf, 0, count);
> >                     }
> >                     stripper.close();
> >                     doc.addField("content_stripped", o.toString());
> >
> >
> > Dependencies were downloaded [3], and if I start the program nothing
> > happens (I have a feeling that a web server is being started).
> >
> > Comments?
> >
> > Cheers,
> > Arturas
> >
> > References
> >
> > [0] Reasons may vary from optimizing highlighting of the text for the end
> > user to exposing oneself to individual components of solr at the deepest
> > level, analysis of impact to algorithms like machine learning or data
> > management
> >
> > [1]
> >
> > https://www.programcreek.com/java-api-examples/index.php?
> api=org.apache.lucene.analysis.charfilter.HTMLStripCharFilter
> >
> > [2] pom.xml:
> >
> >   <dependencies>
> >         <dependency>
> >             <groupId>org.apache.solr</groupId>
> >             <artifactId>solr-solrj</artifactId>
> >             <version>7.3.0</version>
> >         </dependency>
> >
> >         <dependency>
> >             <groupId>org.apache.solr</groupId>
> >             <artifactId>solr-core</artifactId>
> >             <version>7.3.0</version>
> >         </dependency>
> >     </dependencies>
> >
> > [3]Included Jars:
> > hppc-0.7.3.jar already exists in destination.
> > jackson-annotations-2.5.4.jar already exists in destination.
> > jackson-core-2.5.4.jar already exists in destination.
> > jackson-databind-2.5.4.jar already exists in destination.
> > jackson-dataformat-smile-2.5.4.jar already exists in destination.
> > caffeine-2.4.0.jar already exists in destination.
> > guava-14.0.1.jar already exists in destination.
> > protobuf-java-3.1.0.jar already exists in destination.
> > t-digest-3.1.jar already exists in destination.
> > commons-cli-1.2.jar already exists in destination.
> > commons-codec-1.10.jar already exists in destination.
> > commons-collections-3.2.2.jar already exists in destination.
> > commons-configuration-1.6.jar already exists in destination.
> > commons-fileupload-1.3.2.jar already exists in destination.
> > commons-io-2.5.jar already exists in destination.
> > commons-lang-2.6.jar already exists in destination.
> > dom4j-1.6.1.jar already exists in destination.
> > gmetric4j-1.0.7.jar already exists in destination.
> > metrics-core-3.2.2.jar already exists in destination.
> > metrics-ganglia-3.2.2.jar already exists in destination.
> > metrics-graphite-3.2.2.jar already exists in destination.
> > metrics-jetty9-3.2.2.jar already exists in destination.
> > metrics-jvm-3.2.2.jar already exists in destination.
> > javax.servlet-api-3.1.0.jar already exists in destination.
> > tools.jar already exists in destination.
> > joda-time-2.2.jar already exists in destination.
> > log4j-1.2.17.jar already exists in destination.
> > eigenbase-properties-1.1.5.jar already exists in destination.
> > antlr4-runtime-4.5.1-1.jar already exists in destination.
> > calcite-core-1.13.0.jar already exists in destination.
> > calcite-linq4j-1.13.0.jar already exists in destination.
> > avatica-core-1.10.0.jar already exists in destination.
> > commons-exec-1.3.jar already exists in destination.
> > commons-lang3-3.6.jar already exists in destination.
> > commons-math3-3.6.1.jar already exists in destination.
> > curator-client-2.8.0.jar already exists in destination.
> > curator-framework-2.8.0.jar already exists in destination.
> > curator-recipes-2.8.0.jar already exists in destination.
> > hadoop-annotations-2.7.4.jar already exists in destination.
> > hadoop-auth-2.7.4.jar already exists in destination.
> > hadoop-common-2.7.4.jar already exists in destination.
> > hadoop-hdfs-2.7.4.jar already exists in destination.
> > htrace-core-3.2.0-incubating.jar already exists in destination.
> > httpclient-4.5.3.jar already exists in destination.
> > httpcore-4.4.6.jar already exists in destination.
> > httpmime-4.5.3.jar already exists in destination.
> > lucene-analyzers-common-7.3.0.jar already exists in destination.
> > lucene-analyzers-kuromoji-7.3.0.jar already exists in destination.
> > lucene-analyzers-phonetic-7.3.0.jar already exists in destination.
> > lucene-backward-codecs-7.3.0.jar already exists in destination.
> > lucene-classification-7.3.0.jar already exists in destination.
> > lucene-codecs-7.3.0.jar already exists in destination.
> > lucene-core-7.3.0.jar already exists in destination.
> > lucene-expressions-7.3.0.jar already exists in destination.
> > lucene-grouping-7.3.0.jar already exists in destination.
> > lucene-highlighter-7.3.0.jar already exists in destination.
> > lucene-join-7.3.0.jar already exists in destination.
> > lucene-memory-7.3.0.jar already exists in destination.
> > lucene-misc-7.3.0.jar already exists in destination.
> > lucene-queries-7.3.0.jar already exists in destination.
> > lucene-queryparser-7.3.0.jar already exists in destination.
> > lucene-sandbox-7.3.0.jar already exists in destination.
> > lucene-spatial-extras-7.3.0.jar already exists in destination.
> > lucene-spatial3d-7.3.0.jar already exists in destination.
> > lucene-suggest-7.3.0.jar already exists in destination.
> > solr-core-7.3.0.jar already exists in destination.
> > solr-solrj-7.3.0.jar already exists in destination.
> > zookeeper-3.4.11.jar already exists in destination.
> > jackson-core-asl-1.9.13.jar already exists in destination.
> > jackson-mapper-asl-1.9.13.jar already exists in destination.
> > commons-compiler-2.7.6.jar already exists in destination.
> > janino-2.7.6.jar already exists in destination.
> > stax2-api-3.1.4.jar already exists in destination.
> > woodstox-core-asl-4.4.1.jar already exists in destination.
> > jetty-continuation-9.4.8.v20171121.jar already exists in destination.
> > jetty-deploy-9.4.8.v20171121.jar already exists in destination.
> > jetty-http-9.4.8.v20171121.jar already exists in destination.
> > jetty-io-9.4.8.v20171121.jar already exists in destination.
> > jetty-jmx-9.4.8.v20171121.jar already exists in destination.
> > jetty-rewrite-9.4.8.v20171121.jar already exists in destination.
> > jetty-security-9.4.8.v20171121.jar already exists in destination.
> > jetty-server-9.4.8.v20171121.jar already exists in destination.
> > jetty-servlet-9.4.8.v20171121.jar already exists in destination.
> > jetty-servlets-9.4.8.v20171121.jar already exists in destination.
> > jetty-util-9.4.8.v20171121.jar already exists in destination.
> > jetty-webapp-9.4.8.v20171121.jar already exists in destination.
> > jetty-xml-9.4.8.v20171121.jar already exists in destination.
> > spatial4j-0.7.jar already exists in destination.
> > noggit-0.8.jar already exists in destination.
> > asm-5.1.jar already exists in destination.
> > asm-commons-5.1.jar already exists in destination.
> > org.restlet-2.3.0.jar already exists in destination.
> > org.restlet.ext.servlet-2.3.0.jar already exists in destination.
> > jcl-over-slf4j-1.7.24.jar already exists in destination.
> > slf4j-api-1.7.24.jar already exists in destination.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: how to use HTMLStripCharFilter in solrJ?

Arturas Mazeika
In reply to this post by Ahmet Arslan
Hi Ahmet,

Thanks a lot for the post, details and infos.

I've started trying out all the options that you suggested. And... I must
say that I am not able to reproduce my error. Which means that even the
code that I posted works with flying colors.

I am puzzled.

Cheers,
Arturas


On Thu, Jul 5, 2018 at 5:25 PM, Ahmet Arslan <[hidden email]>
wrote:

> Hi Arturas,
>
> Here are some things to try :
>
> 1) HTMLStripCharFilter stripper = new HTMLStripCharFilter(strReader.markSupported()
> ? strReader : new BufferedReader(strReader))
>
> 2) Consider using HTML Strip update processor factory.
>
> 3) Create a custom Lucene analyzer using html strip char filter and white
> space tokenizer. Use the "invoking the analyzer" example given in
> http://lucene.apache.org/core/7_4_0/core/org/apache/lucene/a
> nalysis/package-summary.html
>
> Ahmet
>
>
>
> On Thursday, July 5, 2018, 9:53:58 AM GMT+3, Arturas Mazeika <
> [hidden email]> wrote:
>
>
>
>
>
> Hi Solr Folk,
>
> What would be the easiest way to use some of the Solr and Lucene components
> in SolrJ?
>
> I am pretty amazed how much thought and careful engineering went into some
> individual components to cover the wild real world effectively. And I
> wonder whether one could re-use some of them in other context.
>
> At the bottom, I wanted to strip the HTML code and store the output in solr
> (with different reasons behind [0]). I approached the problem
> pragmatically: googled with "HTMLStripCharFilter and example", got to [1].
> checked which jar I need for that (solr-core), googled for pom dependencies
> [2]. and integrated this into my solrj app:
>
>                     StringReader strReader = new StringReader(content);
>                     HTMLStripCharFilter stripper = new
> HTMLStripCharFilter(new BufferedReader(strReader));
>                     StringBuilder o = new StringBuilder();
>                     char[] cbuf = new char[1024 * 10];
>                     while (true) {
>                         int count = stripper.read(cbuf);
>                         if (count == -1)
>                             break; // end of stream mark is -1
>                         if (count > 0)
>                             o.append(cbuf, 0, count);
>                     }
>                     stripper.close();
>                     doc.addField("content_stripped", o.toString());
>
>
> Dependencies were downloaded [3], and if I start the program nothing
> happens (I have a feeling that a web server is being started).
>
> Comments?
>
> Cheers,
> Arturas
>
> References
>
> [0] Reasons may vary from optimizing highlighting of the text for the end
> user to exposing oneself to individual components of solr at the deepest
> level, analysis of impact to algorithms like machine learning or data
> management
>
> [1]
> https://www.programcreek.com/java-api-examples/index.php?api
> =org.apache.lucene.analysis.charfilter.HTMLStripCharFilter
>
> [2] pom.xml:
>
>   <dependencies>
>         <dependency>
>             <groupId>org.apache.solr</groupId>
>             <artifactId>solr-solrj</artifactId>
>             <version>7.3.0</version>
>         </dependency>
>
>         <dependency>
>             <groupId>org.apache.solr</groupId>
>             <artifactId>solr-core</artifactId>
>             <version>7.3.0</version>
>         </dependency>
>     </dependencies>
>
> [3]Included Jars:
> hppc-0.7.3.jar already exists in destination.
> jackson-annotations-2.5.4.jar already exists in destination.
> jackson-core-2.5.4.jar already exists in destination.
> jackson-databind-2.5.4.jar already exists in destination.
> jackson-dataformat-smile-2.5.4.jar already exists in destination.
> caffeine-2.4.0.jar already exists in destination.
> guava-14.0.1.jar already exists in destination.
> protobuf-java-3.1.0.jar already exists in destination.
> t-digest-3.1.jar already exists in destination.
> commons-cli-1.2.jar already exists in destination.
> commons-codec-1.10.jar already exists in destination.
> commons-collections-3.2.2.jar already exists in destination.
> commons-configuration-1.6.jar already exists in destination.
> commons-fileupload-1.3.2.jar already exists in destination.
> commons-io-2.5.jar already exists in destination.
> commons-lang-2.6.jar already exists in destination.
> dom4j-1.6.1.jar already exists in destination.
> gmetric4j-1.0.7.jar already exists in destination.
> metrics-core-3.2.2.jar already exists in destination.
> metrics-ganglia-3.2.2.jar already exists in destination.
> metrics-graphite-3.2.2.jar already exists in destination.
> metrics-jetty9-3.2.2.jar already exists in destination.
> metrics-jvm-3.2.2.jar already exists in destination.
> javax.servlet-api-3.1.0.jar already exists in destination.
> tools.jar already exists in destination.
> joda-time-2.2.jar already exists in destination.
> log4j-1.2.17.jar already exists in destination.
> eigenbase-properties-1.1.5.jar already exists in destination.
> antlr4-runtime-4.5.1-1.jar already exists in destination.
> calcite-core-1.13.0.jar already exists in destination.
> calcite-linq4j-1.13.0.jar already exists in destination.
> avatica-core-1.10.0.jar already exists in destination.
> commons-exec-1.3.jar already exists in destination.
> commons-lang3-3.6.jar already exists in destination.
> commons-math3-3.6.1.jar already exists in destination.
> curator-client-2.8.0.jar already exists in destination.
> curator-framework-2.8.0.jar already exists in destination.
> curator-recipes-2.8.0.jar already exists in destination.
> hadoop-annotations-2.7.4.jar already exists in destination.
> hadoop-auth-2.7.4.jar already exists in destination.
> hadoop-common-2.7.4.jar already exists in destination.
> hadoop-hdfs-2.7.4.jar already exists in destination.
> htrace-core-3.2.0-incubating.jar already exists in destination.
> httpclient-4.5.3.jar already exists in destination.
> httpcore-4.4.6.jar already exists in destination.
> httpmime-4.5.3.jar already exists in destination.
> lucene-analyzers-common-7.3.0.jar already exists in destination.
> lucene-analyzers-kuromoji-7.3.0.jar already exists in destination.
> lucene-analyzers-phonetic-7.3.0.jar already exists in destination.
> lucene-backward-codecs-7.3.0.jar already exists in destination.
> lucene-classification-7.3.0.jar already exists in destination.
> lucene-codecs-7.3.0.jar already exists in destination.
> lucene-core-7.3.0.jar already exists in destination.
> lucene-expressions-7.3.0.jar already exists in destination.
> lucene-grouping-7.3.0.jar already exists in destination.
> lucene-highlighter-7.3.0.jar already exists in destination.
> lucene-join-7.3.0.jar already exists in destination.
> lucene-memory-7.3.0.jar already exists in destination.
> lucene-misc-7.3.0.jar already exists in destination.
> lucene-queries-7.3.0.jar already exists in destination.
> lucene-queryparser-7.3.0.jar already exists in destination.
> lucene-sandbox-7.3.0.jar already exists in destination.
> lucene-spatial-extras-7.3.0.jar already exists in destination.
> lucene-spatial3d-7.3.0.jar already exists in destination.
> lucene-suggest-7.3.0.jar already exists in destination.
> solr-core-7.3.0.jar already exists in destination.
> solr-solrj-7.3.0.jar already exists in destination.
> zookeeper-3.4.11.jar already exists in destination.
> jackson-core-asl-1.9.13.jar already exists in destination.
> jackson-mapper-asl-1.9.13.jar already exists in destination.
> commons-compiler-2.7.6.jar already exists in destination.
> janino-2.7.6.jar already exists in destination.
> stax2-api-3.1.4.jar already exists in destination.
> woodstox-core-asl-4.4.1.jar already exists in destination.
> jetty-continuation-9.4.8.v20171121.jar already exists in destination.
> jetty-deploy-9.4.8.v20171121.jar already exists in destination.
> jetty-http-9.4.8.v20171121.jar already exists in destination.
> jetty-io-9.4.8.v20171121.jar already exists in destination.
> jetty-jmx-9.4.8.v20171121.jar already exists in destination.
> jetty-rewrite-9.4.8.v20171121.jar already exists in destination.
> jetty-security-9.4.8.v20171121.jar already exists in destination.
> jetty-server-9.4.8.v20171121.jar already exists in destination.
> jetty-servlet-9.4.8.v20171121.jar already exists in destination.
> jetty-servlets-9.4.8.v20171121.jar already exists in destination.
> jetty-util-9.4.8.v20171121.jar already exists in destination.
> jetty-webapp-9.4.8.v20171121.jar already exists in destination.
> jetty-xml-9.4.8.v20171121.jar already exists in destination.
> spatial4j-0.7.jar already exists in destination.
> noggit-0.8.jar already exists in destination.
> asm-5.1.jar already exists in destination.
> asm-commons-5.1.jar already exists in destination.
> org.restlet-2.3.0.jar already exists in destination.
> org.restlet.ext.servlet-2.3.0.jar already exists in destination.
> jcl-over-slf4j-1.7.24.jar already exists in destination.
> slf4j-api-1.7.24.jar already exists in destination.
>