multiple binary documents into a single solr document - Vignette/OpenText integration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

multiple binary documents into a single solr document - Vignette/OpenText integration

Fábio Aragão da Silva
hello there,
I'm working on the development of a piece of code that integrates Solr
with Vignette/OpenText Content Management, meaning Vignette content
instances will be indexed in solr when published and deleted from solr
when unpublished. I'm using solr 1.4, solrj and solr cell.

I've implemented most of the code and I've ran into only a single
issue so far: vignette content management supports the attachment of
multiple binary documents (such as .doc, .pdf or .xls files) to a
single content instance. I am mapping each content instance in
Vignette to a solr document, but now I have a content instance in
vignette with multiple binary files attached to it.

So my question is: is it possible to have more than one binary file
indexed into a single document in solr?

I'm a beginner in solr, but from what I understood I have two options
to index content using solrj: either to use UpdateRequest() and the
add() method to add a SolrInputDocument to the request (in case the
document doesn´t represent a binary file), or to use
ContentStreamUpdateRequest() and the addFile() method to add a binary
file to the content stream request.

I don't see a way, though, to say "this document is comprised of two
files, a word and a pdf, so index them as one document in solr using
content1 and content2 fields - or merge their content into a single
'content' field)".

I tried calling the addFile() twice (one call for each file) and no
error but nothing getting indexed as well.

ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("file1.doc"));
req.addFile(new File("file2.pdf"));
req.setParam("literal.id", "multiple_files_test");
req.setParam("uprefix", "attr_");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
server.request(req);

Any thoughts on this would be greatly appreciated.

greetings from Brazil,
Fábio.
Reply | Threaded
Open this post in threaded view
|

Re: multiple binary documents into a single solr document - Vignette/OpenText integration

Andrzej Białecki-2
On 2010-03-24 15:58, Fábio Aragão da Silva wrote:

> hello there,
> I'm working on the development of a piece of code that integrates Solr
> with Vignette/OpenText Content Management, meaning Vignette content
> instances will be indexed in solr when published and deleted from solr
> when unpublished. I'm using solr 1.4, solrj and solr cell.
>
> I've implemented most of the code and I've ran into only a single
> issue so far: vignette content management supports the attachment of
> multiple binary documents (such as .doc, .pdf or .xls files) to a
> single content instance. I am mapping each content instance in
> Vignette to a solr document, but now I have a content instance in
> vignette with multiple binary files attached to it.
>
> So my question is: is it possible to have more than one binary file
> indexed into a single document in solr?
>
> I'm a beginner in solr, but from what I understood I have two options
> to index content using solrj: either to use UpdateRequest() and the
> add() method to add a SolrInputDocument to the request (in case the
> document doesn´t represent a binary file), or to use
> ContentStreamUpdateRequest() and the addFile() method to add a binary
> file to the content stream request.
>
> I don't see a way, though, to say "this document is comprised of two
> files, a word and a pdf, so index them as one document in solr using
> content1 and content2 fields - or merge their content into a single
> 'content' field)".
>
> I tried calling the addFile() twice (one call for each file) and no
> error but nothing getting indexed as well.
>
> ContentStreamUpdateRequest req = new
> ContentStreamUpdateRequest("/update/extract");
> req.addFile(new File("file1.doc"));
> req.addFile(new File("file2.pdf"));
> req.setParam("literal.id", "multiple_files_test");
> req.setParam("uprefix", "attr_");
> req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> server.request(req);
>
> Any thoughts on this would be greatly appreciated.

Write your own RequestHandler that uses the existing
ExtractingRequestHandler to actually parse the streams, and then you
combine the results arbitrarily in your handler, eventually sending an
AddUpdateCommand to the update processor. You can obtain both the update
processor and SolrCell instance from req.getCore().


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: multiple binary documents into a single solr document - Vignette/OpenText integration

Chris Hostetter-3

: > I tried calling the addFile() twice (one call for each file) and no
: > error but nothing getting indexed as well.
        ...
: Write your own RequestHandler that uses the existing ExtractingRequestHandler
: to actually parse the streams, and then you combine the results arbitrarily in
: your handler, eventually sending an AddUpdateCommand to the update processor.
: You can obtain both the update processor and SolrCell instance from
: req.getCore().

The key bit being: yes you contain attach multiple files to your request,
and yes the SolrQueryRequest abstraction can handle that (it appears as
two "ContentStreams" to the RequestHandler) but the existing
ExtractingRequestHandler assumes there will only be one ContentStream and
constructsa one document for it -- the API isn't really designed arround
the idea of how to generate a single SolrInputDOcument from multipole
COntentStreams (where would you get the "title" from? etc...)

There was talk about trying to generalize this, but i don't think anyone
else has looked into it much.  Here's one refrence, but i definitely
remember a more recent thread about this idea...

http://n3.nabble.com/ExtractingRequestHandler-and-XmlUpdateHandler-tt492202.html#a492211



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: multiple binary documents into a single solr document - Vignette/OpenText integration

Lance Norskog-2
Do you want to index the text in the attachments?

If so, you probably are better off creating a unique document for the
mail body and each attachment. A field in the document could give the
id of the main email document. The main email document could contain a
multivalued field giving all of the attachment ids.

On Thu, Mar 25, 2010 at 10:14 AM, Chris Hostetter
<[hidden email]> wrote:

>
> : > I tried calling the addFile() twice (one call for each file) and no
> : > error but nothing getting indexed as well.
>        ...
> : Write your own RequestHandler that uses the existing ExtractingRequestHandler
> : to actually parse the streams, and then you combine the results arbitrarily in
> : your handler, eventually sending an AddUpdateCommand to the update processor.
> : You can obtain both the update processor and SolrCell instance from
> : req.getCore().
>
> The key bit being: yes you contain attach multiple files to your request,
> and yes the SolrQueryRequest abstraction can handle that (it appears as
> two "ContentStreams" to the RequestHandler) but the existing
> ExtractingRequestHandler assumes there will only be one ContentStream and
> constructsa one document for it -- the API isn't really designed arround
> the idea of how to generate a single SolrInputDOcument from multipole
> COntentStreams (where would you get the "title" from? etc...)
>
> There was talk about trying to generalize this, but i don't think anyone
> else has looked into it much.  Here's one refrence, but i definitely
> remember a more recent thread about this idea...
>
> http://n3.nabble.com/ExtractingRequestHandler-and-XmlUpdateHandler-tt492202.html#a492211
>
>
>
> -Hoss
>
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: multiple binary documents into a single solr document - Vignette/OpenText integration

briankous
Hi there,

We are trying to replace opentext (V7.6) autonomy with solr  so that we can index other contents, too.  Due to lack of manpower and time, the management wants to buy the adapter if available. Do you know of any vendor who sells the adapter or professional service?  Thank you.

Brian Ko
bko@behr.com