indexing XML stored on HDFS

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing XML stored on HDFS

Matthew Roth-2
Hi All,

Is there a DIH for HDFS? I see this old feature request [0
<https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to have
gone anywhere. Google searches and searches on this list don't get me to
far.

Essentially my workflow is that I have many thousands of XML documents
stored in hdfs. I run an xslt transformation in spark [1
<https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms to
the expected solr input of <add><doc><field ... /></doc></add>. This is
than written the back to hdfs. Now how do I get it back to solr? I suppose
I could move the data back to the local fs, but on the surface that feels
like the wrong way.

I don't need to store the documents in HDFS after the spark transformation,
I wonder if I can write them using solrj. However, I am not really familiar
with solrj. I am also running a single node. Most of the material I have
read on spark-solr expects you to be running SolrCloud.

Best,
Matt



[0] https://issues.apache.org/jira/browse/SOLR-2096
[1] https://github.com/elsevierlabs-os/spark-xml-utils
Reply | Threaded
Open this post in threaded view
|

Re: indexing XML stored on HDFS

Erick Erickson
Perhaps the bin/post tool? See:
https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/

On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <[hidden email]> wrote:

> Hi All,
>
> Is there a DIH for HDFS? I see this old feature request [0
> <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to have
> gone anywhere. Google searches and searches on this list don't get me to
> far.
>
> Essentially my workflow is that I have many thousands of XML documents
> stored in hdfs. I run an xslt transformation in spark [1
> <https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms to
> the expected solr input of <add><doc><field ... /></doc></add>. This is
> than written the back to hdfs. Now how do I get it back to solr? I suppose
> I could move the data back to the local fs, but on the surface that feels
> like the wrong way.
>
> I don't need to store the documents in HDFS after the spark transformation,
> I wonder if I can write them using solrj. However, I am not really familiar
> with solrj. I am also running a single node. Most of the material I have
> read on spark-solr expects you to be running SolrCloud.
>
> Best,
> Matt
>
>
>
> [0] https://issues.apache.org/jira/browse/SOLR-2096
> [1] https://github.com/elsevierlabs-os/spark-xml-utils
Reply | Threaded
Open this post in threaded view
|

Re: indexing XML stored on HDFS

Matthew Roth-2
Yes the post tool would also be an acceptable option and one I am familiar
with. However, I also am not seeing exactly how I would query hdfs. The
hadoop-solr [0
<https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
lucidworks looks the most promising. I have a meeting to attend to shortly,
and maybe I can explore that further in the afternoon.

I also would like to look further into solrj. I have no real reason to
store the results of the XSL transformation anywhere other than solr. I am
simply not familiar with it. But on the surface it seems like it might be
the most performant way to handle this problem.

If I do pursue this with solrj and spark will solr handle multiple solrj
connections all trying to add documents?

[0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers

On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson <[hidden email]>
wrote:

> Perhaps the bin/post tool? See:
> https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>
> On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <[hidden email]> wrote:
> > Hi All,
> >
> > Is there a DIH for HDFS? I see this old feature request [0
> > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to
> have
> > gone anywhere. Google searches and searches on this list don't get me to
> > far.
> >
> > Essentially my workflow is that I have many thousands of XML documents
> > stored in hdfs. I run an xslt transformation in spark [1
> > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms
> to
> > the expected solr input of <add><doc><field ... /></doc></add>. This is
> > than written the back to hdfs. Now how do I get it back to solr? I
> suppose
> > I could move the data back to the local fs, but on the surface that feels
> > like the wrong way.
> >
> > I don't need to store the documents in HDFS after the spark
> transformation,
> > I wonder if I can write them using solrj. However, I am not really
> familiar
> > with solrj. I am also running a single node. Most of the material I have
> > read on spark-solr expects you to be running SolrCloud.
> >
> > Best,
> > Matt
> >
> >
> >
> > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > [1] https://github.com/elsevierlabs-os/spark-xml-utils
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing XML stored on HDFS

Rick Leir-2
In reply to this post by Erick Erickson
Matthew,
Do you have some sort of script calling xslt? Sorry, I do not know Scala and I did not have time to look into your spark utils.  The script or Scala could then shell out to curl, or if it is python it could use the request library to send a doc to Solr. Extra points for batching the documents.

Erick
The last time I used the post tool, it was spinning up a jvm each time I called it (natch). Is there a simple way to launch it from a Java app server so you can call it repeatedly without the start-up overhead? It has been a few years, maybe I am wrong.
Cheers -- Rick

On December 6, 2017 5:36:51 PM EST, Erick Erickson <[hidden email]> wrote:

>Perhaps the bin/post tool? See:
>https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>
>On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <[hidden email]>
>wrote:
>> Hi All,
>>
>> Is there a DIH for HDFS? I see this old feature request [0
>> <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems
>to have
>> gone anywhere. Google searches and searches on this list don't get me
>to
>> far.
>>
>> Essentially my workflow is that I have many thousands of XML
>documents
>> stored in hdfs. I run an xslt transformation in spark [1
>> <https://github.com/elsevierlabs-os/spark-xml-utils>]. This
>transforms to
>> the expected solr input of <add><doc><field ... /></doc></add>. This
>is
>> than written the back to hdfs. Now how do I get it back to solr? I
>suppose
>> I could move the data back to the local fs, but on the surface that
>feels
>> like the wrong way.
>>
>> I don't need to store the documents in HDFS after the spark
>transformation,
>> I wonder if I can write them using solrj. However, I am not really
>familiar
>> with solrj. I am also running a single node. Most of the material I
>have
>> read on spark-solr expects you to be running SolrCloud.
>>
>> Best,
>> Matt
>>
>>
>>
>> [0] https://issues.apache.org/jira/browse/SOLR-2096
>> [1] https://github.com/elsevierlabs-os/spark-xml-utils

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: indexing XML stored on HDFS

Rick Leir-2
Matthew, Oops, I should have mentioned re-indexing. With Solr, you want to be able to re-index quickly so you can try out different analysis chains. XSLT may not be fast enough for this if you have millions of docs. So I would be inclined to save the docs to a normal filesystem, perhaps in JSONL. Then use DIH or post tool or Python to post the docs to Solr.
Rick

On December 7, 2017 10:14:37 AM EST, Rick Leir <[hidden email]> wrote:

>Matthew,
>Do you have some sort of script calling xslt? Sorry, I do not know
>Scala and I did not have time to look into your spark utils.  The
>script or Scala could then shell out to curl, or if it is python it
>could use the request library to send a doc to Solr. Extra points for
>batching the documents.
>
>Erick
>The last time I used the post tool, it was spinning up a jvm each time
>I called it (natch). Is there a simple way to launch it from a Java app
>server so you can call it repeatedly without the start-up overhead? It
>has been a few years, maybe I am wrong.
>Cheers -- Rick
>
>On December 6, 2017 5:36:51 PM EST, Erick Erickson
><[hidden email]> wrote:
>>Perhaps the bin/post tool? See:
>>https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>>
>>On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <[hidden email]>
>>wrote:
>>> Hi All,
>>>
>>> Is there a DIH for HDFS? I see this old feature request [0
>>> <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems
>>to have
>>> gone anywhere. Google searches and searches on this list don't get
>me
>>to
>>> far.
>>>
>>> Essentially my workflow is that I have many thousands of XML
>>documents
>>> stored in hdfs. I run an xslt transformation in spark [1
>>> <https://github.com/elsevierlabs-os/spark-xml-utils>]. This
>>transforms to
>>> the expected solr input of <add><doc><field ... /></doc></add>. This
>>is
>>> than written the back to hdfs. Now how do I get it back to solr? I
>>suppose
>>> I could move the data back to the local fs, but on the surface that
>>feels
>>> like the wrong way.
>>>
>>> I don't need to store the documents in HDFS after the spark
>>transformation,
>>> I wonder if I can write them using solrj. However, I am not really
>>familiar
>>> with solrj. I am also running a single node. Most of the material I
>>have
>>> read on spark-solr expects you to be running SolrCloud.
>>>
>>> Best,
>>> Matt
>>>
>>>
>>>
>>> [0] https://issues.apache.org/jira/browse/SOLR-2096
>>> [1] https://github.com/elsevierlabs-os/spark-xml-utils
>
>--
>Sorry for being brief. Alternate email is rickleir at yahoo dot com

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: indexing XML stored on HDFS

Cassandra Targett
In reply to this post by Matthew Roth-2
Matthew,

The hadoop-solr project you mention would give you the ability to index
files in HDFS. It's a Job Jar, so you submit it to Hadoop with the params
you need and it processes the files and sends them to Solr. It might not be
the fastest thing in the world since it uses MapReduce but we (I work at
Lucidworks) do have a number of people using it.

However, you mention that you're already processing your files with Spark,
and you don't really need them in HDFS in the long run - have you seen the
Spark-Solr project at https://github.com/lucidworks/spark-solr/? It has an
RDD for indexing docs to Solr, so you would be able to get the files from
wherever they originate, transform them in Spark, and get them into Solr.
It might be a better solution for your existing workflow.

Hope it helps -
Cassandra

On Thu, Dec 7, 2017 at 9:03 AM, Matthew Roth <[hidden email]> wrote:

> Yes the post tool would also be an acceptable option and one I am familiar
> with. However, I also am not seeing exactly how I would query hdfs. The
> hadoop-solr [0
> <https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
> lucidworks looks the most promising. I have a meeting to attend to shortly,
> and maybe I can explore that further in the afternoon.
>
> I also would like to look further into solrj. I have no real reason to
> store the results of the XSL transformation anywhere other than solr. I am
> simply not familiar with it. But on the surface it seems like it might be
> the most performant way to handle this problem.
>
> If I do pursue this with solrj and spark will solr handle multiple solrj
> connections all trying to add documents?
>
> [0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers
>
> On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson <[hidden email]>
> wrote:
>
> > Perhaps the bin/post tool? See:
> > https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
> >
> > On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <[hidden email]> wrote:
> > > Hi All,
> > >
> > > Is there a DIH for HDFS? I see this old feature request [0
> > > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to
> > have
> > > gone anywhere. Google searches and searches on this list don't get me
> to
> > > far.
> > >
> > > Essentially my workflow is that I have many thousands of XML documents
> > > stored in hdfs. I run an xslt transformation in spark [1
> > > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms
> > to
> > > the expected solr input of <add><doc><field ... /></doc></add>. This is
> > > than written the back to hdfs. Now how do I get it back to solr? I
> > suppose
> > > I could move the data back to the local fs, but on the surface that
> feels
> > > like the wrong way.
> > >
> > > I don't need to store the documents in HDFS after the spark
> > transformation,
> > > I wonder if I can write them using solrj. However, I am not really
> > familiar
> > > with solrj. I am also running a single node. Most of the material I
> have
> > > read on spark-solr expects you to be running SolrCloud.
> > >
> > > Best,
> > > Matt
> > >
> > >
> > >
> > > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > > [1] https://github.com/elsevierlabs-os/spark-xml-utils
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing XML stored on HDFS

Matthew Roth-2
Thanks Rick,

While long term storage of the documents in HDFS is not necessary you do
raise that easy access to these documents durning the development phase
will be useful.

Cassandra,

spark-solr I am under the impression that I must be running SolrCloud. At
this time I need some of the features that are not available in SolrCloud.
E.g. Joining across cores. Additionally, the projected demands of solr mean
running it as a single node will be acceptable.

The hadoop-solr project does look the most promising at the moment. I am
hoping to play with it some this afternoon, but it may have to wait until
the new week.

Thanks for the help.

Best,
Matt

On Fri, Dec 8, 2017 at 1:36 PM, Cassandra Targett <[hidden email]>
wrote:

> Matthew,
>
> The hadoop-solr project you mention would give you the ability to index
> files in HDFS. It's a Job Jar, so you submit it to Hadoop with the params
> you need and it processes the files and sends them to Solr. It might not be
> the fastest thing in the world since it uses MapReduce but we (I work at
> Lucidworks) do have a number of people using it.
>
> However, you mention that you're already processing your files with Spark,
> and you don't really need them in HDFS in the long run - have you seen the
> Spark-Solr project at https://github.com/lucidworks/spark-solr/? It has an
> RDD for indexing docs to Solr, so you would be able to get the files from
> wherever they originate, transform them in Spark, and get them into Solr.
> It might be a better solution for your existing workflow.
>
> Hope it helps -
> Cassandra
>
> On Thu, Dec 7, 2017 at 9:03 AM, Matthew Roth <[hidden email]> wrote:
>
> > Yes the post tool would also be an acceptable option and one I am
> familiar
> > with. However, I also am not seeing exactly how I would query hdfs. The
> > hadoop-solr [0
> > <https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
> > lucidworks looks the most promising. I have a meeting to attend to
> shortly,
> > and maybe I can explore that further in the afternoon.
> >
> > I also would like to look further into solrj. I have no real reason to
> > store the results of the XSL transformation anywhere other than solr. I
> am
> > simply not familiar with it. But on the surface it seems like it might be
> > the most performant way to handle this problem.
> >
> > If I do pursue this with solrj and spark will solr handle multiple solrj
> > connections all trying to add documents?
> >
> > [0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers
> >
> > On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson <[hidden email]>
> > wrote:
> >
> > > Perhaps the bin/post tool? See:
> > > https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
> > >
> > > On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <[hidden email]>
> wrote:
> > > > Hi All,
> > > >
> > > > Is there a DIH for HDFS? I see this old feature request [0
> > > > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems
> to
> > > have
> > > > gone anywhere. Google searches and searches on this list don't get me
> > to
> > > > far.
> > > >
> > > > Essentially my workflow is that I have many thousands of XML
> documents
> > > > stored in hdfs. I run an xslt transformation in spark [1
> > > > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This
> transforms
> > > to
> > > > the expected solr input of <add><doc><field ... /></doc></add>. This
> is
> > > > than written the back to hdfs. Now how do I get it back to solr? I
> > > suppose
> > > > I could move the data back to the local fs, but on the surface that
> > feels
> > > > like the wrong way.
> > > >
> > > > I don't need to store the documents in HDFS after the spark
> > > transformation,
> > > > I wonder if I can write them using solrj. However, I am not really
> > > familiar
> > > > with solrj. I am also running a single node. Most of the material I
> > have
> > > > read on spark-solr expects you to be running SolrCloud.
> > > >
> > > > Best,
> > > > Matt
> > > >
> > > >
> > > >
> > > > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > > > [1] https://github.com/elsevierlabs-os/spark-xml-utils
> > >
> >
>