search engine for regional bulletins

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

search engine for regional bulletins

Thorsten Scherler-3
Hi all,

I am developing a search engine for a governmental body. This search
engine has to index pure xml documents which follow a custom xml schema.
The xml documents contain information about laws and official
announcements for Andalusia.

I need to implement different filter for the search. The current search
engine which can be found here [1] would need to be extended by ranges
about organizational bodies, kind of announcement (law,
resolution,...), ...

I played a bit with Nutch 0.8 and asked myself whether it is best
tool for the task. I got nutch to index the xml documents and I can as
well search the index, but I would need to add filter conditions for the
search. The alternative I see would be pure lucene since I am actually
not really "crawling" the site since the documents are not linked with
each other but put all the files (which have to be indexed) in the
urls/bulletin file. Then Zaheed pointed me to Solr and I had played a
wee bit around.

To give you a better impression of the underlying architecture and xml
documents, each weekday there is a new bulletin (containing approx. 100
- 200 pages) eg [2]. This bulletin is stored on the file system and need
to be indexed.

We have two different document types summaries and dispositions. The
summary looks like:
<summary year="2006" number="209" date="27-10-2006" section="1"
  startPage="8" endPage="20">
  <title>1. DISPOSICIONES GENERALES</title>
  <organisation name="Consejería de la Presidencia">
    <disposition bojaYear="2006" bojaNumber="209"
      bojaSection="1" type="Decreto" startPage="8" endPage="10"
      date="10-11-2006" detail="999952" law="178/2006"> Decreto
      178/2006, de 10 de octubre, por el que se establecen normas de
      protección de la avifauna para las instalaciones eléctricas de
      alta tensión</disposition>
  </organisation>
  <organisation name="Consejería de Economia y Hacienda">
    <disposition bojaYear="2006" bojaNumber="209"
      bojaSection="1" type="Resolución" startPage="10"
      endPage="12" date="10-11-2006" detail="999961">
      Resolución de 10 de octubre de 2006, de la Dirección General de
      Tesorería y Deuda Pública, por la que se realiza una
      convocatoria de subasta de carácter ordinario dentro del
      Programa de Emisión de Bonos y Obligaciones de la Junta de
      Andalucía.</disposition>
  </organisation>
</summary>

Following the tutorial and looking at the examples it seems that solr
only supports one document type.

<add><doc>
  <field name="id">3007WFP</field>
  <field name="name">Dell Widescreen UltraSharp 3007WFP</field>
  <!-- ... -->
</doc></add>

The root element add is "just" the command for the server that we want
to add the document. Does that mean I would need to stick with this
doctype and transform our internal format for adding the document
information?

Further since the project is for a customer I would need a released
version when I put my engine in production. When does this community
expect to make its first release, or better asked which are the
blockers?

TIA for any information.

salu2

[1] http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html 
[2]
http://andaluciajunta.es/portal/boletines/2006/11/aj-bojaVerPagina-2006-11/0,23167,bi%253D693228039889,00.html

Reply | Threaded
Open this post in threaded view
|

Re: search engine for regional bulletins

Bertrand Delacretaz
Hi Thorsten, good to see you here!

On 11/28/06, Thorsten Scherler
<[hidden email]> wrote:

> ...Following the tutorial and looking at the examples it seems that solr
> only supports one document type.
>
> <add><doc>
>   <field name="id">3007WFP</field>
>   <field name="name">Dell Widescreen UltraSharp 3007WFP</field>
>   <!-- ... -->
> </doc></add>...

That's right, to add documents to a Solr index you need to transform
them to this model. You're basically creating fields to be indexed,
and the Solr schema.xml allows you to define precisely how you want
each field to be indexed, including strict data types, pluggable
Lucene analyzers, etc.

This means some work in converting your content model to an "indexing
model", but it's very worth it as it gives you very precise control
about what you index and how.

> ...Further since the project is for a customer I would need a released
> version when I put my engine in production. When does this community
> expect to make its first release, or better asked which are the
> blockers?...

I'm relatively new here so I'll let others complete this info, but
IIUC the only work needed to do a first release is to make sure all
source files are "clean" w.r.t required Apache license notices. I
don't think there are any technical blockers for a release, many of us
are happily using Solr on production sites.

You might want to look at these links for more info:
  http://wiki.apache.org/solr/SolrResources
  http://wiki.apache.org/solr/PublicServers

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: search engine for regional bulletins

Thorsten Scherler-3
On Tue, 2006-11-28 at 10:00 +0100, Bertrand Delacretaz wrote:
> Hi Thorsten, good to see you here!

:)

Hi Bertrand, thanks very much for this warm welcome and I am as well
glad to meet you here.

>
> On 11/28/06, Thorsten Scherler
> <[hidden email]> wrote:
>
> > ...Following the tutorial and looking at the examples it seems that solr
> > only supports one document type.
> >
> > <add><doc>
> >   <field name="id">3007WFP</field>
> >   <field name="name">Dell Widescreen UltraSharp 3007WFP</field>
> >   <!-- ... -->
> > </doc></add>...
>
> That's right, to add documents to a Solr index you need to transform
> them to this model. You're basically creating fields to be indexed,
> and the Solr schema.xml allows you to define precisely how you want
> each field to be indexed, including strict data types, pluggable
> Lucene analyzers, etc.
>
> This means some work in converting your content model to an "indexing
> model", but it's very worth it as it gives you very precise control
> about what you index and how.
>

Yeah, I thought about it last night and I came to the same conclusion.
The "extra" work involved is "just" a xsl transformation in my use case,
so not really the biggest part of this project.

> > ...Further since the project is for a customer I would need a released
> > version when I put my engine in production. When does this community
> > expect to make its first release, or better asked which are the
> > blockers?...
>
> I'm relatively new here so I'll let others complete this info, but
> IIUC the only work needed to do a first release is to make sure all
> source files are "clean" w.r.t required Apache license notices. I
> don't think there are any technical blockers for a release, many of us
> are happily using Solr on production sites.

That is good to hear, so if somebody (e.g. me) would check all files for
cleanness then we could release, right? Perfect.

>
> You might want to look at these links for more info:
>   http://wiki.apache.org/solr/SolrResources
>   http://wiki.apache.org/solr/PublicServers

Thanks very much Bertrand, I will look at this information. I am still
evaluating what is best for this project, but solr sounds very
interesting ATM.

salu2
>
> -Bertrand

Reply | Threaded
Open this post in threaded view
|

Re: search engine for regional bulletins

Yonik Seeley-2
On 11/28/06, Thorsten Scherler
<[hidden email]> wrote:
> That is good to hear, so if somebody (e.g. me) would check all files for
> cleanness then we could release, right? Perfect.

Correct.  All IP issues have been cleared, so It's just a matter of
taking the time to put the release into a form that will be accepted
by the incubator.  I expect we will be making a release candidate
within a few weeks.  Of course the incubator guys always finds
problems,  so getting an actual release out takes longer.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: search engine for regional bulletins

Thorsten Scherler-3
On Tue, 2006-11-28 at 11:30 -0500, Yonik Seeley wrote:

> On 11/28/06, Thorsten Scherler
> <[hidden email]> wrote:
> > That is good to hear, so if somebody (e.g. me) would check all files for
> > cleanness then we could release, right? Perfect.
>
> Correct.  All IP issues have been cleared, so It's just a matter of
> taking the time to put the release into a form that will be accepted
> by the incubator.  I expect we will be making a release candidate
> within a few weeks.  Of course the incubator guys always finds
> problems,  so getting an actual release out takes longer.
>

Yeah, I have been in the incubator with lenya and we made some valuable
experience back then. Further I see many committer here with some
experience in different Apache PMC's so hopefully we get it straight
right away and the incubator PMC does not find many issues.

I will try to help the best I can.

> -Yonik

Thanks Yonik.

salu2


Reply | Threaded
Open this post in threaded view
|

Re: search engine for regional bulletins

Yonik Seeley-2
On 11/28/06, Thorsten Scherler
<[hidden email]> wrote:

> On Tue, 2006-11-28 at 11:30 -0500, Yonik Seeley wrote:
> > On 11/28/06, Thorsten Scherler
> > <[hidden email]> wrote:
> > > That is good to hear, so if somebody (e.g. me) would check all files for
> > > cleanness then we could release, right? Perfect.
> >
> > Correct.  All IP issues have been cleared, so It's just a matter of
> > taking the time to put the release into a form that will be accepted
> > by the incubator.  I expect we will be making a release candidate
> > within a few weeks.  Of course the incubator guys always finds
> > problems,  so getting an actual release out takes longer.
> >
>
> Yeah, I have been in the incubator with lenya and we made some valuable
> experience back then. Further I see many committer here with some
> experience in different Apache PMC's so hopefully we get it straight
> right away and the incubator PMC does not find many issues.

The incubator is a different level of scrutiny though... I think they
would find problems with a majority of non-incubating ASF projects
too.

> I will try to help the best I can.

Cool, thanks!

-Yonik