creating Lucene document from an external XML file.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

creating Lucene document from an external XML file.

Phanindra Reva
Hello All,
              I am a newbie using Solr and Lucene. In my task, I have
to create org.apache.lucene.document.Document objects from external
valid Solr xml files.To be brief, depending on the names of the fields
I need to modify corresponding values which is specific to our
project. So I would like to know whether there is an API exposed to
create org.apache.lucene.document.Document type object directly from
an external xml file because here in my case I need to make changes to
the created Document object.
        Please dont mind if it does not make sense.
Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: creating Lucene document from an external XML file.

Otis Gospodnetic-2
Hi,
If I understand you correctly, you really want to be constructing SolrInputDocuments (not Lucene's Documents) and indexing those with SolrJ.  I don't think there is anything in the API that can read in an XML file and convert it into a SolrInputDocuments instance, but aren't there libraries who can convert XML into Java objects and vice-versa?  Maybe that could be used.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: Phanindra Reva <[hidden email]>
> To: [hidden email]
> Sent: Fri, November 20, 2009 10:32:45 AM
> Subject: creating Lucene document from an external XML file.
>
> Hello All,
>               I am a newbie using Solr and Lucene. In my task, I have
> to create org.apache.lucene.document.Document objects from external
> valid Solr xml files.To be brief, depending on the names of the fields
> I need to modify corresponding values which is specific to our
> project. So I would like to know whether there is an API exposed to
> create org.apache.lucene.document.Document type object directly from
> an external xml file because here in my case I need to make changes to
> the created Document object.
>         Please dont mind if it does not make sense.
> Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: creating Lucene document from an external XML file.

hossman

: If I understand you correctly, you really want to be constructing
: SolrInputDocuments (not Lucene's Documents) and indexing those with
: SolrJ.  I don't think there is anything in the API that can read in an

I read your question differently then Otis did.  My understanding is that
you already have code that builds up files in the "<add><doc>..." update
message syntax solr expects, but you want to modify those documents (wi/o
changing your existing code)

one possibility to think about is that instead of modifying the documents
before sending them to Solr, you could write an UpdateProcessor tha runs
direclty in Solr and gets access to those Documents after Solr has already
parsed that XML (or even if the documents come from someplace else, like
DIH, or a CSV file) and then make your changes.


If Otis and i have *both* missunderstood your question, please clarify.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: creating Lucene document from an external XML file.

Phanindra Reva
Hello...,
             Thank you both for patiently reading and understanding my question.
    //  " you already have code that builds up files in the
"<add><doc>..." update
message syntax solr expects, but you want to modify those documents (wi/o
changing your existing code) .. " .. //
          yeah.. I already have the document collection. I have to
change values of some fields of all the documents before indexing.

// "  one possibility to think about is that instead of modifying the documents
before sending them to Solr, you could write an UpdateProcessor tha runs
direclty in Solr and gets access to those Documents after Solr has already
parsed that XML (or even if the documents come from someplace else, like
DIH, or a CSV file) and then make your changes. " //
       I have not decided to modify documents, instead I go for
modifying them at run time. (modifying Java object's variables that
contains information extracted from the document-file).
my question is : Is there any part of the api which take document file
path as input , returns java object and gives us a way to modify
inbetween before sending the same object for indexing (to the
IndexWriter - lucene api).
      I think.. Otis gave an answer that there is no API, instead go
for external java XML apis for the completion of  the task.
I am sorry, If my description is really making things complicated.
Thanks.


On Mon, Nov 23, 2009 at 9:36 PM, Chris Hostetter
<[hidden email]> wrote:

>
> : If I understand you correctly, you really want to be constructing
> : SolrInputDocuments (not Lucene's Documents) and indexing those with
> : SolrJ.  I don't think there is anything in the API that can read in an
>
> I read your question differently then Otis did.  My understanding is that
> you already have code that builds up files in the "<add><doc>..." update
> message syntax solr expects, but you want to modify those documents (wi/o
> changing your existing code)
>
> one possibility to think about is that instead of modifying the documents
> before sending them to Solr, you could write an UpdateProcessor tha runs
> direclty in Solr and gets access to those Documents after Solr has already
> parsed that XML (or even if the documents come from someplace else, like
> DIH, or a CSV file) and then make your changes.
>
>
> If Otis and i have *both* missunderstood your question, please clarify.
>
>
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: creating Lucene document from an external XML file.

hossman

: // "  one possibility to think about is that instead of modifying the documents
: before sending them to Solr, you could write an UpdateProcessor tha runs
: direclty in Solr and gets access to those Documents after Solr has already
: parsed that XML (or even if the documents come from someplace else, like
: DIH, or a CSV file) and then make your changes. " //
:        I have not decided to modify documents, instead I go for
: modifying them at run time. (modifying Java object's variables that
: contains information extracted from the document-file).
: my question is : Is there any part of the api which take document file
: path as input , returns java object and gives us a way to modify
: inbetween before sending the same object for indexing (to the
: IndexWriter - lucene api).

Yes ... as i mentioned the UpdateProcessor API is where you have access to
the Documents as Lucene objects inside of Solr before they are indexed.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: creating Lucene document from an external XML file.

Phanindra Reva
Hello..,
          You have mentioned I can make use of UpdateProcessor API.
May I know when the flow of execution enters that
UpdateRequestProcessor class.? To be brief , it would be perfect for
my case if its after analysis but exactly before its being added to
the index.
Thanks alot.

On Wed, Dec 2, 2009 at 8:56 PM, Chris Hostetter
<[hidden email]> wrote:

>
> : // "  one possibility to think about is that instead of modifying the documents
> : before sending them to Solr, you could write an UpdateProcessor tha runs
> : direclty in Solr and gets access to those Documents after Solr has already
> : parsed that XML (or even if the documents come from someplace else, like
> : DIH, or a CSV file) and then make your changes. " //
> :        I have not decided to modify documents, instead I go for
> : modifying them at run time. (modifying Java object's variables that
> : contains information extracted from the document-file).
> : my question is : Is there any part of the api which take document file
> : path as input , returns java object and gives us a way to modify
> : inbetween before sending the same object for indexing (to the
> : IndexWriter - lucene api).
>
> Yes ... as i mentioned the UpdateProcessor API is where you have access to
> the Documents as Lucene objects inside of Solr before they are indexed.
>
>
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: creating Lucene document from an external XML file.

Otis Gospodnetic-2
I think you'd have to dig into Solr (Lucene actually) to inject yourself after Analysis.  The UpdateRequestProcessor, as the name implies, it at the request level, so pretty high up/early on.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----

> From: Phanindra Reva <[hidden email]>
> To: [hidden email]
> Sent: Fri, December 4, 2009 7:48:46 AM
> Subject: Re: creating Lucene document from an external XML file.
>
> Hello..,
>           You have mentioned I can make use of UpdateProcessor API.
> May I know when the flow of execution enters that
> UpdateRequestProcessor class.? To be brief , it would be perfect for
> my case if its after analysis but exactly before its being added to
> the index.
> Thanks alot.
>
> On Wed, Dec 2, 2009 at 8:56 PM, Chris Hostetter
> wrote:
> >
> > : // "  one possibility to think about is that instead of modifying the
> documents
> > : before sending them to Solr, you could write an UpdateProcessor tha runs
> > : direclty in Solr and gets access to those Documents after Solr has already
> > : parsed that XML (or even if the documents come from someplace else, like
> > : DIH, or a CSV file) and then make your changes. " //
> > :        I have not decided to modify documents, instead I go for
> > : modifying them at run time. (modifying Java object's variables that
> > : contains information extracted from the document-file).
> > : my question is : Is there any part of the api which take document file
> > : path as input , returns java object and gives us a way to modify
> > : inbetween before sending the same object for indexing (to the
> > : IndexWriter - lucene api).
> >
> > Yes ... as i mentioned the UpdateProcessor API is where you have access to
> > the Documents as Lucene objects inside of Solr before they are indexed.
> >
> >
> >
> > -Hoss
> >
> >