SessionIDs and forums are killing my fetch

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

SessionIDs and forums are killing my fetch

Jon Shoberg
I'm getting a ton of duplicate content from a forum with sessionIDs.
Its a phpBB which uses a question mark in the URL and sid.

What have other people done to crawl forums and minimze duplicates?
These are ones that dedup is not catching.

Anyone able to offer how regex-normalize.xml is used. I'm about to open
the source and see...

These URLs look like and appear to have the same content to the user:

http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea

Below is my regex normalize file:

<?xml version="1.0"?>
<!-- This is the configuration file for the RegexUrlNormalize Class.
      This is intended so that users can specify substitutions to be
      done on URLs. The regex engine that is used is Perl5 compatible.
      The rules are applied to URLs in the order they occur in this
file.  -->

<!-- WATCH OUT: an xml parser reads this file an ampersands must be
      expanded to &amp; -->

<!-- The following rules show how to strip out session IDs
      that are 32 characters long and have the parameter
      name of PHPSESSID. Order does matter!  -->
<regex-normalize>
<regex>
   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
   <substitution></substitution>
</regex>
<regex>
 
<pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
   <substitution>$1$3</substitution>
</regex>
<regex>
   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
   <substitution></substitution>
</regex>
<regex>
 
<pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
   <substitution>$1$3</substitution>
</regex>
</regex-normalize>
Reply | Threaded
Open this post in threaded view
|

Re: SessionIDs and forums are killing my fetch

Gal Nitzan
Jon Shoberg wrote:

> I'm getting a ton of duplicate content from a forum with sessionIDs.
> Its a phpBB which uses a question mark in the URL and sid.
>
> What have other people done to crawl forums and minimze duplicates?
> These are ones that dedup is not catching.
>
> Anyone able to offer how regex-normalize.xml is used. I'm about to
> open the source and see...
>
> These URLs look like and appear to have the same content to the user:
>
> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea 
>
>
> Below is my regex normalize file:
>
> <?xml version="1.0"?>
> <!-- This is the configuration file for the RegexUrlNormalize Class.
>      This is intended so that users can specify substitutions to be
>      done on URLs. The regex engine that is used is Perl5 compatible.
>      The rules are applied to URLs in the order they occur in this
> file.  -->
>
> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
>      expanded to &amp; -->
>
> <!-- The following rules show how to strip out session IDs
>      that are 32 characters long and have the parameter
>      name of PHPSESSID. Order does matter!  -->
> <regex-normalize>
> <regex>
>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>   <substitution></substitution>
> </regex>
> <regex>
>
> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>
>   <substitution>$1$3</substitution>
> </regex>
> <regex>
>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>   <substitution></substitution>
> </regex>
> <regex>
>
> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>
>   <substitution>$1$3</substitution>
> </regex>
> </regex-normalize>
>
> .
>

Hi Jon,

I'm not sure if the normalize file is the correct place, I use the
regex-urlfiter.xml with the following:

-(session|Session|SESS|sid)

I know it might leave a url like obsession.url out, but it is better
than your fetcher running in circles :-)

Hope it helps,

Gal
Reply | Threaded
Open this post in threaded view
|

Re: SessionIDs and forums are killing my fetch

Jon Shoberg
Gal Nitzan wrote:

> Jon Shoberg wrote:
>
>> I'm getting a ton of duplicate content from a forum with sessionIDs.
>> Its a phpBB which uses a question mark in the URL and sid.
>>
>> What have other people done to crawl forums and minimze duplicates?
>> These are ones that dedup is not catching.
>>
>> Anyone able to offer how regex-normalize.xml is used. I'm about to
>> open the source and see...
>>
>> These URLs look like and appear to have the same content to the user:
>>
>> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea 
>>
>>
>> Below is my regex normalize file:
>>
>> <?xml version="1.0"?>
>> <!-- This is the configuration file for the RegexUrlNormalize Class.
>>      This is intended so that users can specify substitutions to be
>>      done on URLs. The regex engine that is used is Perl5 compatible.
>>      The rules are applied to URLs in the order they occur in this
>> file.  -->
>>
>> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>      expanded to &amp; -->
>>
>> <!-- The following rules show how to strip out session IDs
>>      that are 32 characters long and have the parameter
>>      name of PHPSESSID. Order does matter!  -->
>> <regex-normalize>
>> <regex>
>>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>   <substitution></substitution>
>> </regex>
>> <regex>
>>
>> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>
>>   <substitution>$1$3</substitution>
>> </regex>
>> <regex>
>>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>   <substitution></substitution>
>> </regex>
>> <regex>
>>
>> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>
>>   <substitution>$1$3</substitution>
>> </regex>
>> </regex-normalize>
>>
>> .
>>
>
> Hi Jon,
>
> I'm not sure if the normalize file is the correct place, I use the
> regex-urlfiter.xml with the following:
>
> -(session|Session|SESS|sid)
>
> I know it might leave a url like obsession.url out, but it is better
> than your fetcher running in circles :-)
>
> Hope it helps,
>
> Gal

Yes,

   Better than circiles but I'm looking to refine the config to allow
for this, not just avoid them.

-j
Reply | Threaded
Open this post in threaded view
|

Re: SessionIDs and forums are killing my fetch

Jack.Tang
Hi Jon

I think you can revise the URL by discarding "sid" param before
putting it into fetchlist.

Regards
/Jack

On 9/28/05, Jon Shoberg <[hidden email]> wrote:

> Gal Nitzan wrote:
> > Jon Shoberg wrote:
> >
> >> I'm getting a ton of duplicate content from a forum with sessionIDs.
> >> Its a phpBB which uses a question mark in the URL and sid.
> >>
> >> What have other people done to crawl forums and minimze duplicates?
> >> These are ones that dedup is not catching.
> >>
> >> Anyone able to offer how regex-normalize.xml is used. I'm about to
> >> open the source and see...
> >>
> >> These URLs look like and appear to have the same content to the user:
> >>
> >> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> >> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> >> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
> >>
> >>
> >> Below is my regex normalize file:
> >>
> >> <?xml version="1.0"?>
> >> <!-- This is the configuration file for the RegexUrlNormalize Class.
> >>      This is intended so that users can specify substitutions to be
> >>      done on URLs. The regex engine that is used is Perl5 compatible.
> >>      The rules are applied to URLs in the order they occur in this
> >> file.  -->
> >>
> >> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
> >>      expanded to &amp; -->
> >>
> >> <!-- The following rules show how to strip out session IDs
> >>      that are 32 characters long and have the parameter
> >>      name of PHPSESSID. Order does matter!  -->
> >> <regex-normalize>
> >> <regex>
> >>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
> >>   <substitution></substitution>
> >> </regex>
> >> <regex>
> >>
> >> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>
> >>   <substitution>$1$3</substitution>
> >> </regex>
> >> <regex>
> >>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
> >>   <substitution></substitution>
> >> </regex>
> >> <regex>
> >>
> >> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>
> >>   <substitution>$1$3</substitution>
> >> </regex>
> >> </regex-normalize>
> >>
> >> .
> >>
> >
> > Hi Jon,
> >
> > I'm not sure if the normalize file is the correct place, I use the
> > regex-urlfiter.xml with the following:
> >
> > -(session|Session|SESS|sid)
> >
> > I know it might leave a url like obsession.url out, but it is better
> > than your fetcher running in circles :-)
> >
> > Hope it helps,
> >
> > Gal
>
> Yes,
>
>    Better than circiles but I'm looking to refine the config to allow
> for this, not just avoid them.
>
> -j
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: SessionIDs and forums are killing my fetch

Gal Nitzan
Hi Jack,

How can you discard URL from fetchlist?

Regards,
Gal

Jack Tang wrote:

> Hi Jon
>
> I think you can revise the URL by discarding "sid" param before
> putting it into fetchlist.
>
> Regards
> /Jack
>
> On 9/28/05, Jon Shoberg <[hidden email]> wrote:
>  
>> Gal Nitzan wrote:
>>    
>>> Jon Shoberg wrote:
>>>
>>>      
>>>> I'm getting a ton of duplicate content from a forum with sessionIDs.
>>>> Its a phpBB which uses a question mark in the URL and sid.
>>>>
>>>> What have other people done to crawl forums and minimze duplicates?
>>>> These are ones that dedup is not catching.
>>>>
>>>> Anyone able to offer how regex-normalize.xml is used. I'm about to
>>>> open the source and see...
>>>>
>>>> These URLs look like and appear to have the same content to the user:
>>>>
>>>> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>>>> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>>>> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
>>>>
>>>>
>>>> Below is my regex normalize file:
>>>>
>>>> <?xml version="1.0"?>
>>>> <!-- This is the configuration file for the RegexUrlNormalize Class.
>>>>      This is intended so that users can specify substitutions to be
>>>>      done on URLs. The regex engine that is used is Perl5 compatible.
>>>>      The rules are applied to URLs in the order they occur in this
>>>> file.  -->
>>>>
>>>> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>>>      expanded to &amp; -->
>>>>
>>>> <!-- The following rules show how to strip out session IDs
>>>>      that are 32 characters long and have the parameter
>>>>      name of PHPSESSID. Order does matter!  -->
>>>> <regex-normalize>
>>>> <regex>
>>>>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>>>   <substitution></substitution>
>>>> </regex>
>>>> <regex>
>>>>
>>>> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>
>>>>   <substitution>$1$3</substitution>
>>>> </regex>
>>>> <regex>
>>>>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>>>   <substitution></substitution>
>>>> </regex>
>>>> <regex>
>>>>
>>>> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>
>>>>   <substitution>$1$3</substitution>
>>>> </regex>
>>>> </regex-normalize>
>>>>
>>>> .
>>>>
>>>>        
>>> Hi Jon,
>>>
>>> I'm not sure if the normalize file is the correct place, I use the
>>> regex-urlfiter.xml with the following:
>>>
>>> -(session|Session|SESS|sid)
>>>
>>> I know it might leave a url like obsession.url out, but it is better
>>> than your fetcher running in circles :-)
>>>
>>> Hope it helps,
>>>
>>> Gal
>>>      
>> Yes,
>>
>>    Better than circiles but I'm looking to refine the config to allow
>> for this, not just avoid them.
>>
>> -j
>>
>>    
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
> .
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: SessionIDs and forums are killing my fetch

Jack.Tang
Hi Jon

Please can see detail in getOutlinks() method in DOMContentUtils class
of parse-html plugin.

you can revise the URLs before

outlinks.add(new Outlink(url.toString(), linkText
                                    .toString().trim()));

Hope it helps

Regards
/Jack

On 9/28/05, Gal Nitzan <[hidden email]> wrote:

> Hi Jack,
>
> How can you discard URL from fetchlist?
>
> Regards,
> Gal
>
> Jack Tang wrote:
> > Hi Jon
> >
> > I think you can revise the URL by discarding "sid" param before
> > putting it into fetchlist.
> >
> > Regards
> > /Jack
> >
> > On 9/28/05, Jon Shoberg <[hidden email]> wrote:
> >
> >> Gal Nitzan wrote:
> >>
> >>> Jon Shoberg wrote:
> >>>
> >>>
> >>>> I'm getting a ton of duplicate content from a forum with sessionIDs.
> >>>> Its a phpBB which uses a question mark in the URL and sid.
> >>>>
> >>>> What have other people done to crawl forums and minimze duplicates?
> >>>> These are ones that dedup is not catching.
> >>>>
> >>>> Anyone able to offer how regex-normalize.xml is used. I'm about to
> >>>> open the source and see...
> >>>>
> >>>> These URLs look like and appear to have the same content to the user:
> >>>>
> >>>> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> >>>> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> >>>> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
> >>>>
> >>>>
> >>>> Below is my regex normalize file:
> >>>>
> >>>> <?xml version="1.0"?>
> >>>> <!-- This is the configuration file for the RegexUrlNormalize Class.
> >>>>      This is intended so that users can specify substitutions to be
> >>>>      done on URLs. The regex engine that is used is Perl5 compatible.
> >>>>      The rules are applied to URLs in the order they occur in this
> >>>> file.  -->
> >>>>
> >>>> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
> >>>>      expanded to &amp; -->
> >>>>
> >>>> <!-- The following rules show how to strip out session IDs
> >>>>      that are 32 characters long and have the parameter
> >>>>      name of PHPSESSID. Order does matter!  -->
> >>>> <regex-normalize>
> >>>> <regex>
> >>>>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
> >>>>   <substitution></substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>
> >>>> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>>>
> >>>>   <substitution>$1$3</substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
> >>>>   <substitution></substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>
> >>>> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>>>
> >>>>   <substitution>$1$3</substitution>
> >>>> </regex>
> >>>> </regex-normalize>
> >>>>
> >>>> .
> >>>>
> >>>>
> >>> Hi Jon,
> >>>
> >>> I'm not sure if the normalize file is the correct place, I use the
> >>> regex-urlfiter.xml with the following:
> >>>
> >>> -(session|Session|SESS|sid)
> >>>
> >>> I know it might leave a url like obsession.url out, but it is better
> >>> than your fetcher running in circles :-)
> >>>
> >>> Hope it helps,
> >>>
> >>> Gal
> >>>
> >> Yes,
> >>
> >>    Better than circiles but I'm looking to refine the config to allow
> >> for this, not just avoid them.
> >>
> >> -j
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> > .
> >
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

regex-normalize - Re: SessionIDs and forums are killing my fetch

Jon Shoberg
I thought this could be done via regex-normalize?  It is my preference
to use functionality/features of the confuguration rather than
maintaining a local patch.

-j

Jack Tang wrote:

> Hi Jon
>
> Please can see detail in getOutlinks() method in DOMContentUtils class
> of parse-html plugin.
>
> you can revise the URLs before
>
> outlinks.add(new Outlink(url.toString(), linkText
>                                     .toString().trim()));
>
> Hope it helps
>
> Regards
> /Jack
>
> On 9/28/05, Gal Nitzan <[hidden email]> wrote:
>
>>Hi Jack,
>>
>>How can you discard URL from fetchlist?
>>
>>Regards,
>>Gal
>>
>>Jack Tang wrote:
>>
>>>Hi Jon
>>>
>>>I think you can revise the URL by discarding "sid" param before
>>>putting it into fetchlist.
>>>
>>>Regards
>>>/Jack
>>>
>>>On 9/28/05, Jon Shoberg <[hidden email]> wrote:
>>>
>>>
>>>>Gal Nitzan wrote:
>>>>
>>>>
>>>>>Jon Shoberg wrote:
>>>>>
>>>>>
>>>>>
>>>>>>I'm getting a ton of duplicate content from a forum with sessionIDs.
>>>>>>Its a phpBB which uses a question mark in the URL and sid.
>>>>>>
>>>>>>What have other people done to crawl forums and minimze duplicates?
>>>>>>These are ones that dedup is not catching.
>>>>>>
>>>>>>Anyone able to offer how regex-normalize.xml is used. I'm about to
>>>>>>open the source and see...
>>>>>>
>>>>>>These URLs look like and appear to have the same content to the user:
>>>>>>
>>>>>>http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>>>>>>http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>>>>>>http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
>>>>>>
>>>>>>
>>>>>>Below is my regex normalize file:
>>>>>>
>>>>>><?xml version="1.0"?>
>>>>>><!-- This is the configuration file for the RegexUrlNormalize Class.
>>>>>>     This is intended so that users can specify substitutions to be
>>>>>>     done on URLs. The regex engine that is used is Perl5 compatible.
>>>>>>     The rules are applied to URLs in the order they occur in this
>>>>>>file.  -->
>>>>>>
>>>>>><!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>>>>>     expanded to &amp; -->
>>>>>>
>>>>>><!-- The following rules show how to strip out session IDs
>>>>>>     that are 32 characters long and have the parameter
>>>>>>     name of PHPSESSID. Order does matter!  -->
>>>>>><regex-normalize>
>>>>>><regex>
>>>>>>  <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>>>>>  <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>>>
>>>>>>  <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>  <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>>>>>  <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>>>
>>>>>>  <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>></regex-normalize>
>>>>>>
>>>>>>.
>>>>>>
>>>>>>
>>>>>
>>>>>Hi Jon,
>>>>>
>>>>>I'm not sure if the normalize file is the correct place, I use the
>>>>>regex-urlfiter.xml with the following:
>>>>>
>>>>>-(session|Session|SESS|sid)
>>>>>
>>>>>I know it might leave a url like obsession.url out, but it is better
>>>>>than your fetcher running in circles :-)
>>>>>
>>>>>Hope it helps,
>>>>>
>>>>>Gal
>>>>>
>>>>
>>>>Yes,
>>>>
>>>>   Better than circiles but I'm looking to refine the config to allow
>>>>for this, not just avoid them.
>>>>
>>>>-j
>>>>
>>>>
>>>
>>>
>>>--
>>>Keep Discovering ... ...
>>>http://www.jroller.com/page/jmars
>>>
>>>.
>>>
>>>
>>
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars

Reply | Threaded
Open this post in threaded view
|

search with ndfs/mapred index

Gal Nitzan
In reply to this post by Gal Nitzan
Hi,

I have successfully run  mapred .

How do I set the servlet to search the index which is under ndfs

Thanks,

Gal
Reply | Threaded
Open this post in threaded view
|

Re: search with ndfs/mapred index

Gal Nitzan
Gal Nitzan wrote:

> Hi,
>
> I have successfully run  mapred .
>
> How do I set the servlet to search the index which is under ndfs
>
> Thanks,
>
> Gal
>
> .
>

OK, I figured out the part with the bin/nutch server and now the server
is running.

I have created a file /mapred/search-servers.txt
Which contains the line:

localhost:8070

which I'm not sure is what should be there.

In the the WEB-INF/classes/nutch-default.xml I set the value of
searcher.dir to point to /mapred where I have the aforementioned file .

Thanks,

Gal

Reply | Threaded
Open this post in threaded view
|

Re: search with ndfs/mapred index

Gal Nitzan
In reply to this post by Gal Nitzan
Gal Nitzan wrote:

> Hi,
>
> I have successfully run  mapred .
>
> How do I set the servlet to search the index which is under ndfs
>
> Thanks,
>
> Gal
>
> .
>
Please ignore, found the information, thanks
Reply | Threaded
Open this post in threaded view
|

java.lang.ClassNotFoundException: org.apache.nutch.ipc.RPC$NullInstance

Gal Nitzan
In reply to this post by Gal Nitzan
Hi,

While, connecting to search sever i have the following exception, does
anybody have a clue?

050927 205228 10 opening indexes in
/user/root/crawl-20050927142856/indexes/indexes
050927 205228 10 opening segments in
/user/root/crawl-20050927142856/indexes/segments
050927 205228 10 opening linkdb in
/user/root/crawl-20050927142856/indexes/linkdb
050927 205228 12 Server listener on port 8070: starting
050927 205228 13 Server handler on 8070: starting
050927 205228 14 Server handler on 8070: starting
050927 205228 15 Server handler on 8070: starting
050927 205228 16 Server handler on 8070: starting
050927 205228 17 Server handler on 8070: starting
050927 205228 18 Server handler on 8070: starting
050927 205228 19 Server handler on 8070: starting
050927 205228 21 Server handler on 8070: starting
050927 205228 22 Server handler on 8070: starting
050927 205228 20 Server handler on 8070: starting
050928 021500 23 Server connection on port 8070 from 127.0.0.1: starting
050928 021500 21 Call: getSegmentNames()
050928 021500 21 Return: [Ljava.lang.String;@10da5eb
050928 021500 23 Server connection on port 8070 from 127.0.0.1 caught:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.nutch.ipc.RPC$NullInstance
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.nutch.ipc.RPC$NullInstance
        at
org.apache.nutch.io.ObjectWritable.readObject(ObjectWritable.java:183)
        at org.apache.nutch.ipc.RPC$Invocation.readFields(RPC.java:88)
        at org.apache.nutch.ipc.Server$Connection.run(Server.java:136)
050928 021500 23 Server connection on port 8070 from 127.0.0.1: exiting
050928 021510 24 Server connection on port 8070 from 127.0.0.1: starting
050928 021510 13 Call: getSegmentNames()
050928 021510 13 Return: [Ljava.lang.String;@10da5eb

Gal
Reply | Threaded
Open this post in threaded view
|

pattern matching and boolean searches

Edward Quick
In reply to this post by Gal Nitzan
Hi,

I posted this question the other day but didn't get a reply which may have
been because it was an an annoying FAQ, or the subject wasn't catchy enough!
Anyway, one more try, so here goes! Please help if you can.

Should I be able to do lucene type searches with Nutch? I know Nutch can now
do type: and url: queries, but how about pattern matching queries? For
example:

te*t
tes?t

or Boolean searches? I haven't got it to work so far, but wondered whether
there was some I needed to enable. Incidentally, yes, I did enable the
index-more and query-more plugins.

Thanks for any help.

Ed.


Reply | Threaded
Open this post in threaded view
|

Re: java.lang.ClassNotFoundException: org.apache.nutch.ipc.RPC$NullInstance - IGNORE sorry

Gal Nitzan
In reply to this post by Gal Nitzan
Gal Nitzan wrote:

> Hi,
>
> While, connecting to search sever i have the following exception, does
> anybody have a clue?
>
> 050927 205228 10 opening indexes in
> /user/root/crawl-20050927142856/indexes/indexes
> 050927 205228 10 opening segments in
> /user/root/crawl-20050927142856/indexes/segments
> 050927 205228 10 opening linkdb in
> /user/root/crawl-20050927142856/indexes/linkdb
> 050927 205228 12 Server listener on port 8070: starting
> 050927 205228 13 Server handler on 8070: starting
> 050927 205228 14 Server handler on 8070: starting
> 050927 205228 15 Server handler on 8070: starting
> 050927 205228 16 Server handler on 8070: starting
> 050927 205228 17 Server handler on 8070: starting
> 050927 205228 18 Server handler on 8070: starting
> 050927 205228 19 Server handler on 8070: starting
> 050927 205228 21 Server handler on 8070: starting
> 050927 205228 22 Server handler on 8070: starting
> 050927 205228 20 Server handler on 8070: starting
> 050928 021500 23 Server connection on port 8070 from 127.0.0.1: starting
> 050928 021500 21 Call: getSegmentNames()
> 050928 021500 21 Return: [Ljava.lang.String;@10da5eb
> 050928 021500 23 Server connection on port 8070 from 127.0.0.1 caught:
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> org.apache.nutch.ipc.RPC$NullInstance
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> org.apache.nutch.ipc.RPC$NullInstance
>        at
> org.apache.nutch.io.ObjectWritable.readObject(ObjectWritable.java:183)
>        at org.apache.nutch.ipc.RPC$Invocation.readFields(RPC.java:88)
>        at org.apache.nutch.ipc.Server$Connection.run(Server.java:136)
> 050928 021500 23 Server connection on port 8070 from 127.0.0.1: exiting
> 050928 021510 24 Server connection on port 8070 from 127.0.0.1: starting
> 050928 021510 13 Call: getSegmentNames()
> 050928 021510 13 Return: [Ljava.lang.String;@10da5eb
>
> Gal
>
> .
>

Reply | Threaded
Open this post in threaded view
|

Re: pattern matching and boolean searches

Robert Benea
In reply to this post by Edward Quick
I think you can build your own plug-in and do whatever type of search you
want(lucene style), I myself added a query plugin to handle my needs ;-).

Cheers,
R.

On 9/28/05, Edward Quick <[hidden email]> wrote:

>
> Hi,
>
> I posted this question the other day but didn't get a reply which may have
> been because it was an an annoying FAQ, or the subject wasn't catchy
> enough!
> Anyway, one more try, so here goes! Please help if you can.
>
> Should I be able to do lucene type searches with Nutch? I know Nutch can
> now
> do type: and url: queries, but how about pattern matching queries? For
> example:
>
> te*t
> tes?t
>
> or Boolean searches? I haven't got it to work so far, but wondered whether
> there was some I needed to enable. Incidentally, yes, I did enable the
> index-more and query-more plugins.
>
> Thanks for any help.
>
> Ed.
>
>
>