[lucy-user] Input format to Lucy

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Input format to Lucy

Anil Pachuri


Hi,

Does Lucy have a utility to accept raw XML files as input? I have 50 XML files and I need to index selected fields in them using Lucy.

Also, is there any general perl utility to merge multiple XML files or convert these into tabular format?

Thank you!
AP
Reply | Threaded
Open this post in threaded view
|

RE: [lucy-user] Input format to Lucy

Zebrowski, Zak
Hi,
Well, Lucy will accept raw XML files as input (as lucy will index whatever you give it to index), but you would probably want to do something smart using XML::Simple or another perl module to extract relevant text from the XML and store the processed fields in Lucy, if your XML is at all complicated.  Use search.cpan.org to find a module to help you with XML parsing.
Zak

-----Original Message-----
From: Anil Pachuri [mailto:[hidden email]]
Sent: Thursday, February 21, 2013 4:22 PM
To: [hidden email]
Subject: [lucy-user] Input format to Lucy



Hi,

Does Lucy have a utility to accept raw XML files as input? I have 50 XML files and I need to index selected fields in them using Lucy.

Also, is there any general perl utility to merge multiple XML files or convert these into tabular format?

Thank you!
AP
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Input format to Lucy

Peter Karman
In reply to this post by Anil Pachuri
Anil Pachuri wrote on 2/21/13 3:22 PM:
>
>
> Hi,
>
> Does Lucy have a utility to accept raw XML files as input? I have 50 XML files and I need to index selected fields in them using Lucy.
>

If you install SWISH::Prog::Lucy from CPAN, you get the swish3 tool installed
which will index XML (and HTML et al) files for Lucy. You can specify which XML
elements you want treated as Lucy fields with a configuration file. For example:

# a document like
<doc>
  <foo>bar</foo>
</doc>

# a config file like
MetaNames foo
PropertyNames foo

# and then index the file like:

% swish3 -F lucy -c configfile -i doc.xml

# and search like:

% swish3 -q foo:bar

The configuration docs are at:

http://swish-e.org/docs/swish-config.html

You might also want to look at Dezi, which does the same thing with a
server/client setup. http://dezi.org/



> Also, is there any general perl utility to merge multiple XML files or convert these into tabular format?

CPAN has many XML handling tools. I'm sure there's something there that will do
most or all of what you want.


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Input format to Lucy

Anil Pachuri
Finally, I have been able to run Lucy..:). Thanks a lot Peter for your help.
Is there a way in Lucy to generate a tag map/cloud of specific types of terms/phrases that might be present in the search results/documents returned by Lucy for a particular query? For example, I want to generate a tag map showing all gene-names and also cell-tissue names with their document frequency (from their respective name-lists) that might be co-mentioned in the search results/documents returned by Lucy for a query gene (e.g. nuclear factor 1)?
One other question, how can I change the default size of text excerpt reported in the search results?
Thank you much.
--- On Thu, 2/21/13, Peter Karman <[hidden email]> wrote:

From: Peter Karman <[hidden email]>
Subject: Re: [lucy-user] Input format to Lucy
To: [hidden email]
Date: Thursday, February 21, 2013, 2:55 PM

Anil Pachuri wrote on 2/21/13 3:22 PM:
>
>
> Hi,
>
> Does Lucy have a utility to accept raw XML files as input? I have 50 XML files and I need to index selected fields in them using Lucy.
>

If you install SWISH::Prog::Lucy from CPAN, you get the swish3 tool installed
which will index XML (and HTML et al) files for Lucy. You can specify which XML
elements you want treated as Lucy fields with a configuration file. For example:

# a document like
<doc>
  <foo>bar</foo>
</doc>

# a config file like
MetaNames foo
PropertyNames foo

# and then index the file like:

% swish3 -F lucy -c configfile -i doc.xml

# and search like:

% swish3 -q foo:bar

The configuration docs are at:

http://swish-e.org/docs/swish-config.html

You might also want to look at Dezi, which does the same thing with a
server/client setup. http://dezi.org/



> Also, is there any general perl utility to merge multiple XML files or convert these into tabular format?

CPAN has many XML handling tools. I'm sure there's something there that will do
most or all of what you want.


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Input format to Lucy

Peter Karman
On 4/20/13 11:50 PM, Anil Pachuri wrote:
> Finally, I have been able to run Lucy..:). Thanks a lot Peter for your help.

Glad you've got it working. Did you write your own Lucy application or
implement one of the existing ones (swish3 or Dezi)?


>
> Is there a way in Lucy to generate a tag map/cloud of specific types of
> terms/phrases that might be present in the search results/documents
> returned by Lucy for a particular query? For example, I want to generate
> a tag map showing all gene-names and also cell-tissue nameswith their
> document frequency (from their respective name-lists) that might be
> co-mentioned in the search results/documents returned by Lucy for a
> query gene (e.g. nuclear factor 1)?

That sounds like a variation on facets. You might want to look at

https://metacpan.org/source/KARMAN/Search-OpenSearch-Engine-Lucy-0.16/lib/Search/OpenSearch/Engine/Lucy.pm#L86

for inspiration.


>
> One other question, how can I change the default size of text excerpt
> reported in the search results?
>

That depends on how you are generating your search results. If you are
using Dezi and/or Search::OpenSearch::Engine::Lucy, you can alter that
with snipper_config:

https://metacpan.org/module/Search::OpenSearch::Engine#snipper_config

which is passed through to Search::Tools::Snipper:

https://metacpan.org/module/Search::Tools::Snipper

If you are writing your own highlighter code with Lucy directly, then see
https://metacpan.org/module/LOGIE/Lucy-0.3.2/lib/Lucy/Highlight/Highlighter.pod




--
Peter Karman  .  http://peknet.com/  .  [hidden email]