subcollection

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

subcollection

Edward Quick

Hi,

I'm trying to get subcollections working in nutch 1.0-dev, and have crawled our intranet with the subcollection.xml configured as below. However when I submit a query to search.jsp eg,

subcollection:im database

I don't get any results (as opposed to submitting this without subcollection:im)

Is this configured wrongly? I realise that subcollection.xml doesn't do regex expressions, but I wasn't sure if I could just put in part of the url, or had to put in the full stem pattern eg, http://planet.somdomain.com/level1/

Thanks,
Ed.

<subcollections>
        <subcollection>
                <name>default</name>
                <id>default</id>
                <whitelist>
                </whitelist>
                <blacklist>
                   planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
                   /aptprop.nsf/Content/Americas+
                   /aptprop.nsf/Content/AB+CityFlyer+
                   /aptprop.nsf/Content/CityFlyer+
                   /im/barch/
                   /im/dms/
                    /im/tech/
                </blacklist>
        </subcollection>

        <subcollection>
                <name>im</name>
                <id>im</id>
                <whitelist>
                   planet.somedomain.com/general/aptrix/aptim.nsf/
                   planet.somedomain.com/im/barch/
                   planet.somedomain.com/im/dms/
                   planet.somedomain.com/im/tech/
                </whitelist>
                <blacklist />
        </subcollection>

        <subcollection>
                <name>news</name>
                <id>news</id>
                <whitelist>
                    planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
                </whitelist>
                <blacklist />
        </subcollection>

</subcollections>

_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: subcollection

Edward Quick



> From: [hidden email]
> To: [hidden email]
> Subject: subcollection
> Date: Tue, 30 Sep 2008 08:55:35 +0000
>
>
> Hi,
>
> I'm trying to get subcollections working in nutch 1.0-dev, and have crawled our intranet with the subcollection.xml configured as below. However when I submit a query to search.jsp eg,
>
> subcollection:im database

Duh! I realised what I was doing wrong. I was literally typing
subcollection: instead of my subcollection name  in the query eg,
should have typed im:database

>
> I don't get any results (as opposed to submitting this without subcollection:im)
>
> Is this configured wrongly? I realise that subcollection.xml doesn't do regex expressions, but I wasn't sure if I could just put in part of the url, or had to put in the full stem pattern eg, http://planet.somdomain.com/level1/
>
> Thanks,
> Ed.
>
> <subcollections>
>         <subcollection>
>                 <name>default</name>
>                 <id>default</id>
>                 <whitelist>
>                 </whitelist>
>                 <blacklist>
>                    planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
>                    /aptprop.nsf/Content/Americas+
>                    /aptprop.nsf/Content/AB+CityFlyer+
>                    /aptprop.nsf/Content/CityFlyer+
>                    /im/barch/
>                    /im/dms/
>                     /im/tech/
>                 </blacklist>
>         </subcollection>
>
>         <subcollection>
>                 <name>im</name>
>                 <id>im</id>
>                 <whitelist>
>                    planet.somedomain.com/general/aptrix/aptim.nsf/
>                    planet.somedomain.com/im/barch/
>                    planet.somedomain.com/im/dms/
>                    planet.somedomain.com/im/tech/
>                 </whitelist>
>                 <blacklist />
>         </subcollection>
>
>         <subcollection>
>                 <name>news</name>
>                 <id>news</id>
>                 <whitelist>
>                     planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
>                 </whitelist>
>                 <blacklist />
>         </subcollection>
>
> </subcollections>
>
> _________________________________________________________________
> Discover Bird's Eye View now with Multimap from Live Search
> http://clk.atdmt.com/UKM/go/111354026/direct/01/

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: subcollection

Edward Quick



> From: [hidden email]
> To: [hidden email]
> Subject: RE: subcollection
> Date: Tue, 30 Sep 2008 13:13:39 +0000
>
>
>
>
> > From: [hidden email]
> > To: [hidden email]
> > Subject: subcollection
> > Date: Tue, 30 Sep 2008 08:55:35 +0000
> >
> >
> > Hi,
> >
> > I'm trying to get subcollections working in nutch 1.0-dev, and have crawled our intranet with the subcollection.xml configured as below. However when I submit a query to search.jsp eg,
> >
> > subcollection:im database
>
> Duh! I realised what I was doing wrong. I was literally typing
> subcollection: instead of my subcollection name  in the query eg,
> should have typed im:database

Double Duh! I wish could delete the last comment :-( Think I was right first time actually. Still stuck!

>
> >
> > I don't get any results (as opposed to submitting this without subcollection:im)
> >
> > Is this configured wrongly? I realise that subcollection.xml doesn't do regex expressions, but I wasn't sure if I could just put in part of the url, or had to put in the full stem pattern eg, http://planet.somdomain.com/level1/
> >
> > Thanks,
> > Ed.
> >
> > <subcollections>
> >         <subcollection>
> >                 <name>default</name>
> >                 <id>default</id>
> >                 <whitelist>
> >                 </whitelist>
> >                 <blacklist>
> >                    planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
> >                    /aptprop.nsf/Content/Americas+
> >                    /aptprop.nsf/Content/AB+CityFlyer+
> >                    /aptprop.nsf/Content/CityFlyer+
> >                    /im/barch/
> >                    /im/dms/
> >                     /im/tech/
> >                 </blacklist>
> >         </subcollection>
> >
> >         <subcollection>
> >                 <name>im</name>
> >                 <id>im</id>
> >                 <whitelist>
> >                    planet.somedomain.com/general/aptrix/aptim.nsf/
> >                    planet.somedomain.com/im/barch/
> >                    planet.somedomain.com/im/dms/
> >                    planet.somedomain.com/im/tech/
> >                 </whitelist>
> >                 <blacklist />
> >         </subcollection>
> >
> >         <subcollection>
> >                 <name>news</name>
> >                 <id>news</id>
> >                 <whitelist>
> >                     planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
> >                 </whitelist>
> >                 <blacklist />
> >         </subcollection>
> >
> > </subcollections>
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: subcollection

Edward Quick
In reply to this post by Edward Quick


I have enabled logging and get the following error with the subcollections plugin. Any ideas what I do to fix this?

Thanks,

Ed.

2008-10-02 09:37:49,401 INFO  collection.CollectionManager - Instantiating CollectionManager
2008-10-02 09:37:49,401 INFO  collection.CollectionManager - initializing CollectionManager
2008-10-02 09:37:49,405 WARN  collection.CollectionManager - Error occured:java.lang.ClassCastException: org.apache.xerces.dom.DeferredCommentImpl
2008-10-02 09:37:49,405 WARN  collection.CollectionManager - java.lang.ClassCastException: org.apache.xerces.dom.DeferredCommentImpl
2008-10-02 09:37:49,405 WARN  collection.CollectionManager - at org.apache.nutch.util.DomUtil.getDom(DomUtil.java:63)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:85)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.<init>(CollectionManager.java:56)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.j
ava:115)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionFie
ld(SubcollectionIndexingFilter.java:66)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(Subcollectio
nIndexingFilter.java:72)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
2008-10-02 09:37:49,406 WARN  collection.CollectionManager - at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201)
2008-10-02 09:37:49,407 DEBUG collection.CollectionManager - subcollections:

>
>
> Hi,
>
> I'm trying to get subcollections working in nutch 1.0-dev, and have crawled our intranet with the subcollection.xml configured as below. However when I submit a query to search.jsp eg,
>
> subcollection:im database
>
> I don't get any results (as opposed to submitting this without subcollection:im)
>
> Is this configured wrongly? I realise that subcollection.xml doesn't do regex expressions, but I wasn't sure if I could just put in part of the url, or had to put in the full stem pattern eg, http://planet.somdomain.com/level1/
>
> Thanks,
> Ed.
>
> <subcollections>
>         <subcollection>
>                 <name>default</name>
>                 <id>default</id>
>                 <whitelist>
>                 </whitelist>
>                 <blacklist>
>                    planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
>                    /aptprop.nsf/Content/Americas+
>                    /aptprop.nsf/Content/AB+CityFlyer+
>                    /aptprop.nsf/Content/CityFlyer+
>                    /im/barch/
>                    /im/dms/
>                     /im/tech/
>                 </blacklist>
>         </subcollection>
>
>         <subcollection>
>                 <name>im</name>
>                 <id>im</id>
>                 <whitelist>
>                    planet.somedomain.com/general/aptrix/aptim.nsf/
>                    planet.somedomain.com/im/barch/
>                    planet.somedomain.com/im/dms/
>                    planet.somedomain.com/im/tech/
>                 </whitelist>
>                 <blacklist />
>         </subcollection>
>
>         <subcollection>
>                 <name>news</name>
>                 <id>news</id>
>                 <whitelist>
>                     planet.somedomain.com/general/aptrix/bani.nsf/Content/Weekly+news
>                 </whitelist>
>                 <blacklist />
>         </subcollection>
>
> </subcollections>
>
> _________________________________________________________________
> Discover Bird's Eye View now with Multimap from Live Search
> http://clk.atdmt.com/UKM/go/111354026/direct/01/

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/