I'm just going to throw this out there...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

I'm just going to throw this out there...

raycrawford
And it may get me banned, but so be it.

I've ben trying to get a Nutch/Solr setup running and, after many hours of
cruising StackOverflow, this list and many documentation sites which talked
about various versions, I've got nothing to show for it.

Why is this so complex and why is a reasonable set of documentation about
how to integrate the solutions so hard to find?

Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
can help me here, I'll write a Chef cookbook that automates the whole
thing.  However, I can't get any of the tutorials I've tried so far to work.

Thanks and hopefully the community will help me (and others) work through
this or absolve me of my apparent ignorance.

- Ray.
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Michael Chen
Hey Ray...

It worked for me a couple of weeks ago with the same setup so don't worry there should be a solution.

Check Solr's log file as well as the nutch Hadoop log. There should be more information there.

I used 8983/solr/nutch for the URL, try that.

My guess is that the problem is still with the solr schema setup, check the logs to make sure. Let me know what you find.

Best,
Michael

> On Aug 13, 2017, at 20:48, Ray Crawford <[hidden email]> wrote:
>
> And it may get me banned, but so be it.
>
> I've ben trying to get a Nutch/Solr setup running and, after many hours of
> cruising StackOverflow, this list and many documentation sites which talked
> about various versions, I've got nothing to show for it.
>
> Why is this so complex and why is a reasonable set of documentation about
> how to integrate the solutions so hard to find?
>
> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
> can help me here, I'll write a Chef cookbook that automates the whole
> thing.  However, I can't get any of the tutorials I've tried so far to work.
>
> Thanks and hopefully the community will help me (and others) work through
> this or absolve me of my apparent ignorance.
>
> - Ray.
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

raycrawford
Do you have any good examples of what the schema should be?  I'm using:
https://pastebin.com/Rtnr1CVU

This time, the Parser and Indexer seemed to succeed, but the mapreduce
failed.  See: https://pastebin.com/1zgF8hcG

Thanks!

On Mon, Aug 14, 2017 at 1:58 AM, Michael Chen <
[hidden email]> wrote:

> Hey Ray...
>
> It worked for me a couple of weeks ago with the same setup so don't worry
> there should be a solution.
>
> Check Solr's log file as well as the nutch Hadoop log. There should be
> more information there.
>
> I used 8983/solr/nutch for the URL, try that.
>
> My guess is that the problem is still with the solr schema setup, check
> the logs to make sure. Let me know what you find.
>
> Best,
> Michael
>
> > On Aug 13, 2017, at 20:48, Ray Crawford <[hidden email]> wrote:
> >
> > And it may get me banned, but so be it.
> >
> > I've ben trying to get a Nutch/Solr setup running and, after many hours
> of
> > cruising StackOverflow, this list and many documentation sites which
> talked
> > about various versions, I've got nothing to show for it.
> >
> > Why is this so complex and why is a reasonable set of documentation about
> > how to integrate the solutions so hard to find?
> >
> > Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
> > can help me here, I'll write a Chef cookbook that automates the whole
> > thing.  However, I can't get any of the tutorials I've tried so far to
> work.
> >
> > Thanks and hopefully the community will help me (and others) work through
> > this or absolve me of my apparent ignorance.
> >
> > - Ray.
>
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Michael Chen
I couldn't see anything wrong with the schema at first glance. The nutch log may not be as informative as the solr one here. Could you check the Solr log in the deployment folder?

Thanks,
Michael

> On Aug 14, 2017, at 06:00, Ray Crawford <[hidden email]> wrote:
>
> Do you have any good examples of what the schema should be?  I'm using:
> https://pastebin.com/Rtnr1CVU
>
> This time, the Parser and Indexer seemed to succeed, but the mapreduce
> failed.  See: https://pastebin.com/1zgF8hcG
>
> Thanks!
>
> On Mon, Aug 14, 2017 at 1:58 AM, Michael Chen <
> [hidden email]> wrote:
>
>> Hey Ray...
>>
>> It worked for me a couple of weeks ago with the same setup so don't worry
>> there should be a solution.
>>
>> Check Solr's log file as well as the nutch Hadoop log. There should be
>> more information there.
>>
>> I used 8983/solr/nutch for the URL, try that.
>>
>> My guess is that the problem is still with the solr schema setup, check
>> the logs to make sure. Let me know what you find.
>>
>> Best,
>> Michael
>>
>>> On Aug 13, 2017, at 20:48, Ray Crawford <[hidden email]> wrote:
>>>
>>> And it may get me banned, but so be it.
>>>
>>> I've ben trying to get a Nutch/Solr setup running and, after many hours
>> of
>>> cruising StackOverflow, this list and many documentation sites which
>> talked
>>> about various versions, I've got nothing to show for it.
>>>
>>> Why is this so complex and why is a reasonable set of documentation about
>>> how to integrate the solutions so hard to find?
>>>
>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
>>> can help me here, I'll write a Chef cookbook that automates the whole
>>> thing.  However, I can't get any of the tutorials I've tried so far to
>> work.
>>>
>>> Thanks and hopefully the community will help me (and others) work through
>>> this or absolve me of my apparent ignorance.
>>>
>>> - Ray.
>>
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Sebastian Nagel
Hi Ray,

which version of Solr is used?

As Michael said, it's probably a schema incompatibility.
2.3 is built for Solr 4.6.0, and yes, with a different version there may be issues.

See also
  https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2400
for a port of Nutch 1.x to a recent Solr version.

Best,
Sebastian

On 08/14/2017 07:16 PM, Michael Chen wrote:

> I couldn't see anything wrong with the schema at first glance. The nutch log may not be as informative as the solr one here. Could you check the Solr log in the deployment folder?
>
> Thanks,
> Michael
>
>> On Aug 14, 2017, at 06:00, Ray Crawford <[hidden email]> wrote:
>>
>> Do you have any good examples of what the schema should be?  I'm using:
>> https://pastebin.com/Rtnr1CVU
>>
>> This time, the Parser and Indexer seemed to succeed, but the mapreduce
>> failed.  See: https://pastebin.com/1zgF8hcG
>>
>> Thanks!
>>
>> On Mon, Aug 14, 2017 at 1:58 AM, Michael Chen <
>> [hidden email]> wrote:
>>
>>> Hey Ray...
>>>
>>> It worked for me a couple of weeks ago with the same setup so don't worry
>>> there should be a solution.
>>>
>>> Check Solr's log file as well as the nutch Hadoop log. There should be
>>> more information there.
>>>
>>> I used 8983/solr/nutch for the URL, try that.
>>>
>>> My guess is that the problem is still with the solr schema setup, check
>>> the logs to make sure. Let me know what you find.
>>>
>>> Best,
>>> Michael
>>>
>>>> On Aug 13, 2017, at 20:48, Ray Crawford <[hidden email]> wrote:
>>>>
>>>> And it may get me banned, but so be it.
>>>>
>>>> I've ben trying to get a Nutch/Solr setup running and, after many hours
>>> of
>>>> cruising StackOverflow, this list and many documentation sites which
>>> talked
>>>> about various versions, I've got nothing to show for it.
>>>>
>>>> Why is this so complex and why is a reasonable set of documentation about
>>>> how to integrate the solutions so hard to find?
>>>>
>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
>>>> can help me here, I'll write a Chef cookbook that automates the whole
>>>> thing.  However, I can't get any of the tutorials I've tried so far to
>>> work.
>>>>
>>>> Thanks and hopefully the community will help me (and others) work through
>>>> this or absolve me of my apparent ignorance.
>>>>
>>>> - Ray.
>>>

Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

lewis john mcgibbney-2
In reply to this post by raycrawford
Hi Ray,
Apart from not being able to find a tutorial, what is wrong exactly?
New users of Nutch are advised to use the Nutch 1.X series.
The Nutch 2.X tutorial introduces more moving parts. This is well
documented on this mailing list for a number of years now.
If you can enumerate what is wrong, we will help you out.
Thanks
Lewis

On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]> wrote:

>
> From: Ray Crawford <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Sun, 13 Aug 2017 23:48:59 -0400
> Subject: I'm just going to throw this out there...
> And it may get me banned, but so be it.
>
> I've ben trying to get a Nutch/Solr setup running and, after many hours of
> cruising StackOverflow, this list and many documentation sites which talked
> about various versions, I've got nothing to show for it.
>
> Why is this so complex and why is a reasonable set of documentation about
> how to integrate the solutions so hard to find?
>
> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
> can help me here, I'll write a Chef cookbook that automates the whole
> thing.  However, I can't get any of the tutorials I've tried so far to
> work.
>
> Thanks and hopefully the community will help me (and others) work through
> this or absolve me of my apparent ignorance.
>
> - Ray.
>
>


--
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Alejandro Caceres
hey Lewis,

I think he's just trying to say that your documentation sucks :D. Glad I
could clarify.

Alex

On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <[hidden email]>
wrote:

> Hi Ray,
> Apart from not being able to find a tutorial, what is wrong exactly?
> New users of Nutch are advised to use the Nutch 1.X series.
> The Nutch 2.X tutorial introduces more moving parts. This is well
> documented on this mailing list for a number of years now.
> If you can enumerate what is wrong, we will help you out.
> Thanks
> Lewis
>
> On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]>
> wrote:
>
> >
> > From: Ray Crawford <[hidden email]>
> > To: [hidden email]
> > Cc:
> > Bcc:
> > Date: Sun, 13 Aug 2017 23:48:59 -0400
> > Subject: I'm just going to throw this out there...
> > And it may get me banned, but so be it.
> >
> > I've ben trying to get a Nutch/Solr setup running and, after many hours
> of
> > cruising StackOverflow, this list and many documentation sites which
> talked
> > about various versions, I've got nothing to show for it.
> >
> > Why is this so complex and why is a reasonable set of documentation about
> > how to integrate the solutions so hard to find?
> >
> > Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
> > can help me here, I'll write a Chef cookbook that automates the whole
> > thing.  However, I can't get any of the tutorials I've tried so far to
> > work.
> >
> > Thanks and hopefully the community will help me (and others) work through
> > this or absolve me of my apparent ignorance.
> >
> > - Ray.
> >
> >
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



--
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Sebastian Nagel
Hi Alex,

I would like to state that it's *your* documentation as well,
as you're part of the community if following this list.

If I had the time to rewrite the tutorials and documentation
(and no open issues on Jira), no question, I probably would
work on it. If you have spare time, you're invited to improve
the documentation in any way you can. Just ask for access to
the Nutch wiki.

Thanks,
Sebastian

On 08/14/2017 09:10 PM, Alejandro Caceres wrote:

> hey Lewis,
>
> I think he's just trying to say that your documentation sucks :D. Glad I
> could clarify.
>
> Alex
>
> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <[hidden email]>
> wrote:
>
>> Hi Ray,
>> Apart from not being able to find a tutorial, what is wrong exactly?
>> New users of Nutch are advised to use the Nutch 1.X series.
>> The Nutch 2.X tutorial introduces more moving parts. This is well
>> documented on this mailing list for a number of years now.
>> If you can enumerate what is wrong, we will help you out.
>> Thanks
>> Lewis
>>
>> On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]>
>> wrote:
>>
>>>
>>> From: Ray Crawford <[hidden email]>
>>> To: [hidden email]
>>> Cc:
>>> Bcc:
>>> Date: Sun, 13 Aug 2017 23:48:59 -0400
>>> Subject: I'm just going to throw this out there...
>>> And it may get me banned, but so be it.
>>>
>>> I've ben trying to get a Nutch/Solr setup running and, after many hours
>> of
>>> cruising StackOverflow, this list and many documentation sites which
>> talked
>>> about various versions, I've got nothing to show for it.
>>>
>>> Why is this so complex and why is a reasonable set of documentation about
>>> how to integrate the solutions so hard to find?
>>>
>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
>>> can help me here, I'll write a Chef cookbook that automates the whole
>>> thing.  However, I can't get any of the tutorials I've tried so far to
>>> work.
>>>
>>> Thanks and hopefully the community will help me (and others) work through
>>> this or absolve me of my apparent ignorance.
>>>
>>> - Ray.
>>>
>>>
>>
>>
>> --
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Alejandro Caceres
Hey Sebastian,

I was just giving Lewis s*** because I know him personally :P. I'm aware
this is an open source project and we're all in this together! No one likes
writing docs..... I should probably be working on my own docs right now.

Alex

On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <[hidden email]
> wrote:

> Hi Alex,
>
> I would like to state that it's *your* documentation as well,
> as you're part of the community if following this list.
>
> If I had the time to rewrite the tutorials and documentation
> (and no open issues on Jira), no question, I probably would
> work on it. If you have spare time, you're invited to improve
> the documentation in any way you can. Just ask for access to
> the Nutch wiki.
>
> Thanks,
> Sebastian
>
> On 08/14/2017 09:10 PM, Alejandro Caceres wrote:
> > hey Lewis,
> >
> > I think he's just trying to say that your documentation sucks :D. Glad I
> > could clarify.
> >
> > Alex
> >
> > On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <
> [hidden email]>
> > wrote:
> >
> >> Hi Ray,
> >> Apart from not being able to find a tutorial, what is wrong exactly?
> >> New users of Nutch are advised to use the Nutch 1.X series.
> >> The Nutch 2.X tutorial introduces more moving parts. This is well
> >> documented on this mailing list for a number of years now.
> >> If you can enumerate what is wrong, we will help you out.
> >> Thanks
> >> Lewis
> >>
> >> On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]>
> >> wrote:
> >>
> >>>
> >>> From: Ray Crawford <[hidden email]>
> >>> To: [hidden email]
> >>> Cc:
> >>> Bcc:
> >>> Date: Sun, 13 Aug 2017 23:48:59 -0400
> >>> Subject: I'm just going to throw this out there...
> >>> And it may get me banned, but so be it.
> >>>
> >>> I've ben trying to get a Nutch/Solr setup running and, after many hours
> >> of
> >>> cruising StackOverflow, this list and many documentation sites which
> >> talked
> >>> about various versions, I've got nothing to show for it.
> >>>
> >>> Why is this so complex and why is a reasonable set of documentation
> about
> >>> how to integrate the solutions so hard to find?
> >>>
> >>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some
> one
> >>> can help me here, I'll write a Chef cookbook that automates the whole
> >>> thing.  However, I can't get any of the tutorials I've tried so far to
> >>> work.
> >>>
> >>> Thanks and hopefully the community will help me (and others) work
> through
> >>> this or absolve me of my apparent ignorance.
> >>>
> >>> - Ray.
> >>>
> >>>
> >>
> >>
> >> --
> >> http://home.apache.org/~lewismc/
> >> @hectorMcSpector
> >> http://www.linkedin.com/in/lmcgibbney
> >>
> >
> >
> >
>
>


--
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Sebastian Nagel
Hi Alex,

no problem. Let's be productive and work!

Best,
Sebastian


On 08/15/2017 04:22 PM, Alejandro Caceres wrote:

> Hey Sebastian,
>
> I was just giving Lewis s*** because I know him personally :P. I'm aware
> this is an open source project and we're all in this together! No one likes
> writing docs..... I should probably be working on my own docs right now.
>
> Alex
>
> On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <[hidden email]
>> wrote:
>
>> Hi Alex,
>>
>> I would like to state that it's *your* documentation as well,
>> as you're part of the community if following this list.
>>
>> If I had the time to rewrite the tutorials and documentation
>> (and no open issues on Jira), no question, I probably would
>> work on it. If you have spare time, you're invited to improve
>> the documentation in any way you can. Just ask for access to
>> the Nutch wiki.
>>
>> Thanks,
>> Sebastian
>>
>> On 08/14/2017 09:10 PM, Alejandro Caceres wrote:
>>> hey Lewis,
>>>
>>> I think he's just trying to say that your documentation sucks :D. Glad I
>>> could clarify.
>>>
>>> Alex
>>>
>>> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <
>> [hidden email]>
>>> wrote:
>>>
>>>> Hi Ray,
>>>> Apart from not being able to find a tutorial, what is wrong exactly?
>>>> New users of Nutch are advised to use the Nutch 1.X series.
>>>> The Nutch 2.X tutorial introduces more moving parts. This is well
>>>> documented on this mailing list for a number of years now.
>>>> If you can enumerate what is wrong, we will help you out.
>>>> Thanks
>>>> Lewis
>>>>
>>>> On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]>
>>>> wrote:
>>>>
>>>>>
>>>>> From: Ray Crawford <[hidden email]>
>>>>> To: [hidden email]
>>>>> Cc:
>>>>> Bcc:
>>>>> Date: Sun, 13 Aug 2017 23:48:59 -0400
>>>>> Subject: I'm just going to throw this out there...
>>>>> And it may get me banned, but so be it.
>>>>>
>>>>> I've ben trying to get a Nutch/Solr setup running and, after many hours
>>>> of
>>>>> cruising StackOverflow, this list and many documentation sites which
>>>> talked
>>>>> about various versions, I've got nothing to show for it.
>>>>>
>>>>> Why is this so complex and why is a reasonable set of documentation
>> about
>>>>> how to integrate the solutions so hard to find?
>>>>>
>>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some
>> one
>>>>> can help me here, I'll write a Chef cookbook that automates the whole
>>>>> thing.  However, I can't get any of the tutorials I've tried so far to
>>>>> work.
>>>>>
>>>>> Thanks and hopefully the community will help me (and others) work
>> through
>>>>> this or absolve me of my apparent ignorance.
>>>>>
>>>>> - Ray.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> http://home.apache.org/~lewismc/
>>>> @hectorMcSpector
>>>> http://www.linkedin.com/in/lmcgibbney
>>>>
>>>
>>>
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

raycrawford
The documentation is a little bit tough... :)

Really, I couldn't find a clear path for the novice from point A to point
B.  Because of this, I'm hoping this Chef Cookbook can be the tool.

Here's what I have so far:
https://github.com/raycrawford/cb_rayCrawford_nutch2

Two problems.  When I do the following, stuff gets into Solr, but it
results in:
cd /opt/nutch/runtime/local/bin
export JAVA_HOME='/etc/alternatives/jre_1.8.0'
/opt/hbase/bin/start-hbase.sh
mkdir urls
echo "http://www.bidfta.com/" > /opt/nutch/runtime/local/bin/urls/seed.txt
/opt/nutch/runtime/local/bin/nutch inject urls/seed.txt
/opt/nutch/runtime/local/bin/crawl ./urls nutch
http://127.0.0.1:8983/solr/nutch
3


DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05

Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch

/opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D solr.server.url=
http://127.0.0.1:8983/solr/nutch -all -crawlId nutch

IndexingJob: starting

Active IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)

solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)

solr.auth : use authentication (default false)

solr.auth.username : username for authentication

solr.auth.password : password for authentication

IndexingJob: done.

SOLR dedup -> http://127.0.0.1:8983/solr/nutch

/opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch

Exception in thread "main" java.lang.RuntimeException: job failed:
name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001

at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:383)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:393)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:403)

Error running:

  /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch

Failed with exit value 1.
---

Second, the site I'm indexing is essentially 3 layers deep.  The first on
has a field on it '<p class="auctionLocation">'. All other children of that
page relate to the following link, but do not have that data on them. What
I would like to do is capture the <p class="auctionLocation"> data and
relate it to all children of that block. I altered the managed schema to
include '<field name="auctionLocation" type="strings"/>', but it doesn't
seem to be adding that to the index.  Also, I don't know how to add that to
the children pages.

What I'm asking here is two parts.  I realize the first part is a
nutch2/Solr integration thing and the second is a solr thing, but hopefully
y'all can help me figure this out...

Thanks!

On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel <
[hidden email]> wrote:

> Hi Alex,
>
> no problem. Let's be productive and work!
>
> Best,
> Sebastian
>
>
> On 08/15/2017 04:22 PM, Alejandro Caceres wrote:
> > Hey Sebastian,
> >
> > I was just giving Lewis s*** because I know him personally :P. I'm aware
> > this is an open source project and we're all in this together! No one
> likes
> > writing docs..... I should probably be working on my own docs right now.
> >
> > Alex
> >
> > On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <
> [hidden email]
> >> wrote:
> >
> >> Hi Alex,
> >>
> >> I would like to state that it's *your* documentation as well,
> >> as you're part of the community if following this list.
> >>
> >> If I had the time to rewrite the tutorials and documentation
> >> (and no open issues on Jira), no question, I probably would
> >> work on it. If you have spare time, you're invited to improve
> >> the documentation in any way you can. Just ask for access to
> >> the Nutch wiki.
> >>
> >> Thanks,
> >> Sebastian
> >>
> >> On 08/14/2017 09:10 PM, Alejandro Caceres wrote:
> >>> hey Lewis,
> >>>
> >>> I think he's just trying to say that your documentation sucks :D. Glad
> I
> >>> could clarify.
> >>>
> >>> Alex
> >>>
> >>> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <
> >> [hidden email]>
> >>> wrote:
> >>>
> >>>> Hi Ray,
> >>>> Apart from not being able to find a tutorial, what is wrong exactly?
> >>>> New users of Nutch are advised to use the Nutch 1.X series.
> >>>> The Nutch 2.X tutorial introduces more moving parts. This is well
> >>>> documented on this mailing list for a number of years now.
> >>>> If you can enumerate what is wrong, we will help you out.
> >>>> Thanks
> >>>> Lewis
> >>>>
> >>>> On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> From: Ray Crawford <[hidden email]>
> >>>>> To: [hidden email]
> >>>>> Cc:
> >>>>> Bcc:
> >>>>> Date: Sun, 13 Aug 2017 23:48:59 -0400
> >>>>> Subject: I'm just going to throw this out there...
> >>>>> And it may get me banned, but so be it.
> >>>>>
> >>>>> I've ben trying to get a Nutch/Solr setup running and, after many
> hours
> >>>> of
> >>>>> cruising StackOverflow, this list and many documentation sites which
> >>>> talked
> >>>>> about various versions, I've got nothing to show for it.
> >>>>>
> >>>>> Why is this so complex and why is a reasonable set of documentation
> >> about
> >>>>> how to integrate the solutions so hard to find?
> >>>>>
> >>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some
> >> one
> >>>>> can help me here, I'll write a Chef cookbook that automates the whole
> >>>>> thing.  However, I can't get any of the tutorials I've tried so far
> to
> >>>>> work.
> >>>>>
> >>>>> Thanks and hopefully the community will help me (and others) work
> >> through
> >>>>> this or absolve me of my apparent ignorance.
> >>>>>
> >>>>> - Ray.
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://home.apache.org/~lewismc/
> >>>> @hectorMcSpector
> >>>> http://www.linkedin.com/in/lmcgibbney
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Michael Chen
Hi Ray,

Haha the documentations :) Let's hope that it'll get better or we'll all
need super human problem solving abilities. But perhaps you're on a
better path by making a cookbook and contributing as you go...

Anyway, I happen to be working on it rn so I can help you troubleshoot
some stuff. As I said earlier you need to go to Solr logs, which you can
get either from the Solr directory directly or look in the webapp logs.
It will tell you if there's a schema mismatch or something else. Post
the log and we can all take a look.

As to your second question, I think I had a similar problem and we're
both in luck because jsoup-extractor just came out. It can parse HTML
with CSS selectors and I think there should be a way to mark the indexed
metadata as outlinks to include in the next round of search.

Hope this helps! let me know if I missed something,

Michael



On 08/15/2017 10:15 PM, Ray Crawford wrote:

> The documentation is a little bit tough... :)
>
> Really, I couldn't find a clear path for the novice from point A to point
> B.  Because of this, I'm hoping this Chef Cookbook can be the tool.
>
> Here's what I have so far:
> https://github.com/raycrawford/cb_rayCrawford_nutch2
>
> Two problems.  When I do the following, stuff gets into Solr, but it
> results in:
> cd /opt/nutch/runtime/local/bin
> export JAVA_HOME='/etc/alternatives/jre_1.8.0'
> /opt/hbase/bin/start-hbase.sh
> mkdir urls
> echo "http://www.bidfta.com/" > /opt/nutch/runtime/local/bin/urls/seed.txt
> /opt/nutch/runtime/local/bin/nutch inject urls/seed.txt
> /opt/nutch/runtime/local/bin/crawl ./urls nutch
> http://127.0.0.1:8983/solr/nutch
> 3
>
>
> DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05
>
> Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch
>
> /opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
> mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D solr.server.url=
> http://127.0.0.1:8983/solr/nutch -all -crawlId nutch
>
> IndexingJob: starting
>
> Active IndexWriters :
>
> SOLRIndexWriter
>
> solr.server.url : URL of the SOLR instance (mandatory)
>
> solr.commit.size : buffer size when sending to SOLR (default 1000)
>
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>
> solr.auth : use authentication (default false)
>
> solr.auth.username : username for authentication
>
> solr.auth.password : password for authentication
>
> IndexingJob: done.
>
> SOLR dedup -> http://127.0.0.1:8983/solr/nutch
>
> /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
> mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch
>
> Exception in thread "main" java.lang.RuntimeException: job failed:
> name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001
>
> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>
> at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:383)
>
> at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:393)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>
> at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:403)
>
> Error running:
>
>    /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
> mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch
>
> Failed with exit value 1.
> ---
>
> Second, the site I'm indexing is essentially 3 layers deep.  The first on
> has a field on it '<p class="auctionLocation">'. All other children of that
> page relate to the following link, but do not have that data on them. What
> I would like to do is capture the <p class="auctionLocation"> data and
> relate it to all children of that block. I altered the managed schema to
> include '<field name="auctionLocation" type="strings"/>', but it doesn't
> seem to be adding that to the index.  Also, I don't know how to add that to
> the children pages.
>
> What I'm asking here is two parts.  I realize the first part is a
> nutch2/Solr integration thing and the second is a solr thing, but hopefully
> y'all can help me figure this out...
>
> Thanks!
>
> On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel <
> [hidden email]> wrote:
>
>> Hi Alex,
>>
>> no problem. Let's be productive and work!
>>
>> Best,
>> Sebastian
>>
>>
>> On 08/15/2017 04:22 PM, Alejandro Caceres wrote:
>>> Hey Sebastian,
>>>
>>> I was just giving Lewis s*** because I know him personally :P. I'm aware
>>> this is an open source project and we're all in this together! No one
>> likes
>>> writing docs..... I should probably be working on my own docs right now.
>>>
>>> Alex
>>>
>>> On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <
>> [hidden email]
>>>> wrote:
>>>> Hi Alex,
>>>>
>>>> I would like to state that it's *your* documentation as well,
>>>> as you're part of the community if following this list.
>>>>
>>>> If I had the time to rewrite the tutorials and documentation
>>>> (and no open issues on Jira), no question, I probably would
>>>> work on it. If you have spare time, you're invited to improve
>>>> the documentation in any way you can. Just ask for access to
>>>> the Nutch wiki.
>>>>
>>>> Thanks,
>>>> Sebastian
>>>>
>>>> On 08/14/2017 09:10 PM, Alejandro Caceres wrote:
>>>>> hey Lewis,
>>>>>
>>>>> I think he's just trying to say that your documentation sucks :D. Glad
>> I
>>>>> could clarify.
>>>>>
>>>>> Alex
>>>>>
>>>>> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <
>>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Hi Ray,
>>>>>> Apart from not being able to find a tutorial, what is wrong exactly?
>>>>>> New users of Nutch are advised to use the Nutch 1.X series.
>>>>>> The Nutch 2.X tutorial introduces more moving parts. This is well
>>>>>> documented on this mailing list for a number of years now.
>>>>>> If you can enumerate what is wrong, we will help you out.
>>>>>> Thanks
>>>>>> Lewis
>>>>>>
>>>>>> On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> From: Ray Crawford <[hidden email]>
>>>>>>> To: [hidden email]
>>>>>>> Cc:
>>>>>>> Bcc:
>>>>>>> Date: Sun, 13 Aug 2017 23:48:59 -0400
>>>>>>> Subject: I'm just going to throw this out there...
>>>>>>> And it may get me banned, but so be it.
>>>>>>>
>>>>>>> I've ben trying to get a Nutch/Solr setup running and, after many
>> hours
>>>>>> of
>>>>>>> cruising StackOverflow, this list and many documentation sites which
>>>>>> talked
>>>>>>> about various versions, I've got nothing to show for it.
>>>>>>>
>>>>>>> Why is this so complex and why is a reasonable set of documentation
>>>> about
>>>>>>> how to integrate the solutions so hard to find?
>>>>>>>
>>>>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some
>>>> one
>>>>>>> can help me here, I'll write a Chef cookbook that automates the whole
>>>>>>> thing.  However, I can't get any of the tutorials I've tried so far
>> to
>>>>>>> work.
>>>>>>>
>>>>>>> Thanks and hopefully the community will help me (and others) work
>>>> through
>>>>>>> this or absolve me of my apparent ignorance.
>>>>>>>
>>>>>>> - Ray.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> http://home.apache.org/~lewismc/
>>>>>> @hectorMcSpector
>>>>>> http://www.linkedin.com/in/lmcgibbney
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: I'm just going to throw this out there...

Edward Capriolo
On Wednesday, August 16, 2017, Michael Chen <
[hidden email]> wrote:

> Hi Ray,
>
> Haha the documentations :) Let's hope that it'll get better or we'll all
> need super human problem solving abilities. But perhaps you're on a better
> path by making a cookbook and contributing as you go...
>
> Anyway, I happen to be working on it rn so I can help you troubleshoot
> some stuff. As I said earlier you need to go to Solr logs, which you can
> get either from the Solr directory directly or look in the webapp logs. It
> will tell you if there's a schema mismatch or something else. Post the log
> and we can all take a look.
>
> As to your second question, I think I had a similar problem and we're both
> in luck because jsoup-extractor just came out. It can parse HTML with CSS
> selectors and I think there should be a way to mark the indexed metadata as
> outlinks to include in the next round of search.
>
> Hope this helps! let me know if I missed something,
>
> Michael
>
>
>
> On 08/15/2017 10:15 PM, Ray Crawford wrote:
>
>> The documentation is a little bit tough... :)
>>
>> Really, I couldn't find a clear path for the novice from point A to point
>> B.  Because of this, I'm hoping this Chef Cookbook can be the tool.
>>
>> Here's what I have so far:
>> https://github.com/raycrawford/cb_rayCrawford_nutch2
>>
>> Two problems.  When I do the following, stuff gets into Solr, but it
>> results in:
>> cd /opt/nutch/runtime/local/bin
>> export JAVA_HOME='/etc/alternatives/jre_1.8.0'
>> /opt/hbase/bin/start-hbase.sh
>> mkdir urls
>> echo "http://www.bidfta.com/" > /opt/nutch/runtime/local/bin/u
>> rls/seed.txt
>> /opt/nutch/runtime/local/bin/nutch inject urls/seed.txt
>> /opt/nutch/runtime/local/bin/crawl ./urls nutch
>> http://127.0.0.1:8983/solr/nutch
>> 3
>>
>>
>> DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05
>>
>> Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch
>>
>> /opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D solr.server.url=
>> http://127.0.0.1:8983/solr/nutch -all -crawlId nutch
>>
>> IndexingJob: starting
>>
>> Active IndexWriters :
>>
>> SOLRIndexWriter
>>
>> solr.server.url : URL of the SOLR instance (mandatory)
>>
>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>
>> solr.mapping.file : name of the mapping file for fields (default
>> solrindex-mapping.xml)
>>
>> solr.auth : use authentication (default false)
>>
>> solr.auth.username : username for authentication
>>
>> solr.auth.password : password for authentication
>>
>> IndexingJob: done.
>>
>> SOLR dedup -> http://127.0.0.1:8983/solr/nutch
>>
>> /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch
>>
>> Exception in thread "main" java.lang.RuntimeException: job failed:
>> name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001
>>
>> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>>
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(Sol
>> rDeleteDuplicates.java:383)
>>
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrD
>> eleteDuplicates.java:393)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(Solr
>> DeleteDuplicates.java:403)
>>
>> Error running:
>>
>>    /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2
>> -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch
>>
>> Failed with exit value 1.
>> ---
>>
>> Second, the site I'm indexing is essentially 3 layers deep.  The first on
>> has a field on it '<p class="auctionLocation">'. All other children of
>> that
>> page relate to the following link, but do not have that data on them. What
>> I would like to do is capture the <p class="auctionLocation"> data and
>> relate it to all children of that block. I altered the managed schema to
>> include '<field name="auctionLocation" type="strings"/>', but it doesn't
>> seem to be adding that to the index.  Also, I don't know how to add that
>> to
>> the children pages.
>>
>> What I'm asking here is two parts.  I realize the first part is a
>> nutch2/Solr integration thing and the second is a solr thing, but
>> hopefully
>> y'all can help me figure this out...
>>
>> Thanks!
>>
>> On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel <
>> [hidden email]> wrote:
>>
>> Hi Alex,
>>>
>>> no problem. Let's be productive and work!
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 08/15/2017 04:22 PM, Alejandro Caceres wrote:
>>>
>>>> Hey Sebastian,
>>>>
>>>> I was just giving Lewis s*** because I know him personally :P. I'm aware
>>>> this is an open source project and we're all in this together! No one
>>>>
>>> likes
>>>
>>>> writing docs..... I should probably be working on my own docs right now.
>>>>
>>>> Alex
>>>>
>>>> On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <
>>>>
>>> [hidden email]
>>>
>>>> wrote:
>>>>> Hi Alex,
>>>>>
>>>>> I would like to state that it's *your* documentation as well,
>>>>> as you're part of the community if following this list.
>>>>>
>>>>> If I had the time to rewrite the tutorials and documentation
>>>>> (and no open issues on Jira), no question, I probably would
>>>>> work on it. If you have spare time, you're invited to improve
>>>>> the documentation in any way you can. Just ask for access to
>>>>> the Nutch wiki.
>>>>>
>>>>> Thanks,
>>>>> Sebastian
>>>>>
>>>>> On 08/14/2017 09:10 PM, Alejandro Caceres wrote:
>>>>>
>>>>>> hey Lewis,
>>>>>>
>>>>>> I think he's just trying to say that your documentation sucks :D. Glad
>>>>>>
>>>>> I
>>>
>>>> could clarify.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <
>>>>>>
>>>>> [hidden email]>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Ray,
>>>>>>> Apart from not being able to find a tutorial, what is wrong exactly?
>>>>>>> New users of Nutch are advised to use the Nutch 1.X series.
>>>>>>> The Nutch 2.X tutorial introduces more moving parts. This is well
>>>>>>> documented on this mailing list for a number of years now.
>>>>>>> If you can enumerate what is wrong, we will help you out.
>>>>>>> Thanks
>>>>>>> Lewis
>>>>>>>
>>>>>>> On Sun, Aug 13, 2017 at 8:49 PM, <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> From: Ray Crawford <[hidden email]>
>>>>>>>> To: [hidden email]
>>>>>>>> Cc:
>>>>>>>> Bcc:
>>>>>>>> Date: Sun, 13 Aug 2017 23:48:59 -0400
>>>>>>>> Subject: I'm just going to throw this out there...
>>>>>>>> And it may get me banned, but so be it.
>>>>>>>>
>>>>>>>> I've ben trying to get a Nutch/Solr setup running and, after many
>>>>>>>>
>>>>>>> hours
>>>
>>>> of
>>>>>>>
>>>>>>>> cruising StackOverflow, this list and many documentation sites which
>>>>>>>>
>>>>>>> talked
>>>>>>>
>>>>>>>> about various versions, I've got nothing to show for it.
>>>>>>>>
>>>>>>>> Why is this so complex and why is a reasonable set of documentation
>>>>>>>>
>>>>>>> about
>>>>>
>>>>>> how to integrate the solutions so hard to find?
>>>>>>>>
>>>>>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some
>>>>>>>>
>>>>>>> one
>>>>>
>>>>>> can help me here, I'll write a Chef cookbook that automates the whole
>>>>>>>> thing.  However, I can't get any of the tutorials I've tried so far
>>>>>>>>
>>>>>>> to
>>>
>>>> work.
>>>>>>>>
>>>>>>>> Thanks and hopefully the community will help me (and others) work
>>>>>>>>
>>>>>>> through
>>>>>
>>>>>> this or absolve me of my apparent ignorance.
>>>>>>>>
>>>>>>>> - Ray.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> http://home.apache.org/~lewismc/
>>>>>>> @hectorMcSpector
>>>>>>> http://www.linkedin.com/in/lmcgibbney
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>
As others have suggested using nutch 1x is the way to go. A problem with
nutch 2.x is that way all the pluggable x's are version specific.

For example the cassandra support uses gora and a really old version of
cassandra.

Hbase is a similar story, latest hbase has breaking api changes.

The management server wont try catch problems well . Sections wont load or
work until you figure the root cause out and the logging to catch the
problems seems off by default.




--
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.