Solr is indexing XML only?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr is indexing XML only?

David Trattnig
Hello!

I'd like to setup/develop a search-server. I thought I would use Lucene,
then I read about Solr. So I have done the Solr-Tutorial. Firstly really
happy about the additional features to the Lucene-Functionality I now
noticed that Solr can index only XML files. Or am I completely wrong?

What should I use for the following situation:

1. Copy HTML-files to the Live-Server (via RSync)
2. Index them by the search engine
3. Exclude some "tagged" files (these files for example would have a
specific meta-data-tag)
4. Exclude HTML-tags and other unworthy stuff

How much work of development would that be with Lucene or Solr (If
possible)?

Any help would be appreciated!

Thx in advance,
david
Reply | Threaded
Open this post in threaded view
|

Re: Solr is indexing XML only?

Erik Hatcher
David,

Solr doesn't index XML files, but rather XML is used as the wrapper  
of the text that does get indexed.  The document structure is defined  
in schema.xml, and the field text to be indexed is sent wrapped in an  
XML request.

Regarding your scenario, you would need to write code that parsed the  
HTML as desired, taking into account any exclude rules, wrap the text  
to be indexed (along with any metadata such as the HTML filename or  
URL) into XML and POST it to Solr using the XML structure described  
here:

        <http://wiki.apache.org/solr/UpdateXmlMessages>

The XML request body is just a carrier of the data in a structured  
way, nothing more.

        Erik


On Apr 26, 2006, at 4:27 AM, David Trattnig wrote:

> Hello!
>
> I'd like to setup/develop a search-server. I thought I would use  
> Lucene,
> then I read about Solr. So I have done the Solr-Tutorial. Firstly  
> really
> happy about the additional features to the Lucene-Functionality I now
> noticed that Solr can index only XML files. Or am I completely wrong?
>
> What should I use for the following situation:
>
> 1. Copy HTML-files to the Live-Server (via RSync)
> 2. Index them by the search engine
> 3. Exclude some "tagged" files (these files for example would have a
> specific meta-data-tag)
> 4. Exclude HTML-tags and other unworthy stuff
>
> How much work of development would that be with Lucene or Solr (If
> possible)?
>
> Any help would be appreciated!
>
> Thx in advance,
> david

Reply | Threaded
Open this post in threaded view
|

Re: Solr is indexing XML only?

Bill Au
In reply to this post by David Trattnig
With Solr you can index anything Lucene can index since Solr uses
Lucene under the cover.  The input to Solr is in XML format.  You
will need to process that data you want to index (ie exclude certain
files and remove HTML tags) and put them into Solr's input format.

Bill


On 4/26/06, David Trattnig <[hidden email]> wrote:

>
> Hello!
>
> I'd like to setup/develop a search-server. I thought I would use Lucene,
> then I read about Solr. So I have done the Solr-Tutorial. Firstly really
> happy about the additional features to the Lucene-Functionality I now
> noticed that Solr can index only XML files. Or am I completely wrong?
>
> What should I use for the following situation:
>
> 1. Copy HTML-files to the Live-Server (via RSync)
> 2. Index them by the search engine
> 3. Exclude some "tagged" files (these files for example would have a
> specific meta-data-tag)
> 4. Exclude HTML-tags and other unworthy stuff
>
> How much work of development would that be with Lucene or Solr (If
> possible)?
>
> Any help would be appreciated!
>
> Thx in advance,
> david
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr is indexing XML only?

Chris Hostetter-3
: will need to process that data you want to index (ie exclude certain
: files and remove HTML tags) and put them into Solr's input format.

minor clarification: Solr does ship with two Tokenizers that do a pretty
good job of throwing away HTML markup, os you don't have to parse it
yourlsef -- but therye are still analyzers, all of the tokens they produce
go into one fields, there's no way to use them to parse an entire HTML
file and put the <title> in one field and <body> in another.

: > 1. Copy HTML-files to the Live-Server (via RSync)
: > 2. Index them by the search engine
: > 3. Exclude some "tagged" files (these files for example would have a
: > specific meta-data-tag)
: > 4. Exclude HTML-tags and other unworthy stuff
: >
: > How much work of development would that be with Lucene or Solr (If
: > possible)?

with the exception of item #4 in your list (which i addressed above)
The amount of work neccessary to process your files and extract the text
you want to index will largely be the same regardless of wether you use
Lucene or Solr -- what Solr provides is all of the "service" layer stuff,
for example...
  * an HTTP based api so the file processing and the searching don't have
to live on the same machine.
  * a schema that allows you to say "this text should be searchable, and
this number should be sortable" without needing to hardcode those rules
into your indexer .. you can change your mind later and only modify your
schema, not your code.
  * a really smart caching system that knows when the data in your index
has been modified.

...etc.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Solr is indexing XML only?

David Trattnig
In reply to this post by David Trattnig
Hi Chris,

thank you so much! Could you also explain me how to use these two
Tokenizers?
But if there is a Tokenizer which throws away HTML markup it should be also
possible to extend it and exclude additional content easily?

TIA,
david


: will need to process that data you want to index (ie exclude certain

> : files and remove HTML tags) and put them into Solr's input format.
>
> minor clarification: Solr does ship with two Tokenizers that do a pretty
> good job of throwing away HTML markup, os you don't have to parse it
> yourlsef -- but therye are still analyzers, all of the tokens they produce
> go into one fields, there's no way to use them to parse an entire HTML
> file and put the <title> in one field and <body> in another.
>
> : > 1. Copy HTML-files to the Live-Server (via RSync)
> : > 2. Index them by the search engine
> : > 3. Exclude some "tagged" files (these files for example would have a
> : > specific meta-data-tag)
> : > 4. Exclude HTML-tags and other unworthy stuff
> : >
> : > How much work of development would that be with Lucene or Solr (If
> : > possible)?
>
> with the exception of item #4 in your list (which i addressed above)
> The amount of work neccessary to process your files and extract the text
> you want to index will largely be the same regardless of wether you use
> Lucene or Solr -- what Solr provides is all of the "service" layer stuff,
> for example...
> * an HTTP based api so the file processing and the searching don't have
> to live on the same machine.
> * a schema that allows you to say "this text should be searchable, and
> this number should be sortable" without needing to hardcode those rules
> into your indexer .. you can change your mind later and only modify your
> schema, not your code.
> * a really smart caching system that knows when the data in your index
> has been modified.
>
> ...etc.
>



-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: Solr is indexing XML only?

Yonik Seeley
On 4/27/06, David Trattnig <[hidden email]> wrote:
> thank you so much! Could you also explain me how to use these two
> Tokenizers?

Here's the HTMLStrip tokenizer description:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

Read through the Solr example schema.xml and it should hopefully be
apparent how to use it.

> But if there is a Tokenizer which throws away HTML markup it should be also
> possible to extend it and exclude additional content easily?

If the additional content has nothing to do with HTML, it should be
developed as a separate TokenFilter.  Filters are meant to be chained
to gether to gain more configuration flexibility.

-Yonik