Copy DB by the piece

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Copy DB by the piece

Jakob Heidebrecht
Hi,

I'm trying to copy the nutch database.

It seems to be enough to list all pages by MD5
and get all Links of those pages.

I open up a reader of the db directory, make an new db directory and open a
writer for it.

When i copy all the database the hd space isnt't enough to merge the
tempfile for big databases, but it works for small db's.

I tried to do it by the piece, to close the writer after a number of pages
and reopen it agailn.
It works for pages but now there aren't enough links in the new db.
The more pages and links I do in one round the more links I get in the new
db.

Can somebody help me with this.
Is there a posibility to avoid this?

Regards,
Jakob

--
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++
Reply | Threaded
Open this post in threaded view
|

Re: Copy DB by the piece

Massimo Miccoli
Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at
indexing and fetching time is used on the search.
Using DistributedSearch seams that Pgae Boost is not used to calculate
the ranks for pages. What I see in my result pages
is most pages with low page Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits
results? For an hits I think that ranks must be boost*score for query or
I'm wrong?

Thanks,

Massimo
Reply | Threaded
Open this post in threaded view
|

Hits Rank and Page Boost problem

Massimo Miccoli
In reply to this post by Jakob Heidebrecht
Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at
indexing and fetching time is used on the search.
Using DistributedSearch seams that Page Boost is not used to calculate
the ranks for pages. What I see in my result pages
is most pages with low page Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits
results? For an hits I think that ranks must be boost*score for query or
I'm wrong?

Thanks,

Massimo

Reply | Threaded
Open this post in threaded view
|

RE: Copy DB by the piece

Chirag Chaman
In reply to this post by Massimo Miccoli
Massimo,

The boost gets multiplied at search time.

This boost has already been applied to the "field norms" -- a good way to
confirm is see a field norm that was originally one (URL or anchor is a good
one) and that should now be higher. A lot of the other fields like "content"
is way too small be being with to show any difference.

In shot, if you see the boost on the top of the explain page, it's
definitely there in the field norms -- and thus being applied.

CC-

--------------------------------------------
Filangy, Inc.
We're Improving Search!
www.filangy.com


-----Original Message-----
From: Massimo Miccoli [mailto:[hidden email]]
Sent: Tuesday, June 28, 2005 11:21 AM
To: [hidden email]
Subject: Re: [Nutch-dev] Copy DB by the piece

Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at
indexing and fetching time is used on the search.
Using DistributedSearch seams that Pgae Boost is not used to calculate the
ranks for pages. What I see in my result pages is most pages with low page
Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits results?
For an hits I think that ranks must be boost*score for query or I'm wrong?

Thanks,

Massimo


Reply | Threaded
Open this post in threaded view
|

RE: Copy DB by the piece

Chirag Chaman
In reply to this post by Massimo Miccoli
Massimo,

The boost gets multiplied at search time.

This boost has already been applied to the "field norms" -- a good way to
confirm is see a field norm that was originally one (URL or anchor is a good
one) and that should now be higher. A lot of the other fields like "content"
is way too small be being with to show any difference.

In shot, if you see the boost on the top of the explain page, it's
definitely there in the field norms -- and thus being applied.

CC-

--------------------------------------------
Filangy, Inc.
We're Improving Search!
www.filangy.com


-----Original Message-----
From: Massimo Miccoli [mailto:[hidden email]]
Sent: Tuesday, June 28, 2005 11:21 AM
To: [hidden email]
Subject: Re: [Nutch-dev] Copy DB by the piece

Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at
indexing and fetching time is used on the search.
Using DistributedSearch seams that Pgae Boost is not used to calculate the
ranks for pages. What I see in my result pages is most pages with low page
Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits results?
For an hits I think that ranks must be boost*score for query or I'm wrong?

Thanks,

Massimo


Reply | Threaded
Open this post in threaded view
|

Re: Copy DB by the piece

Massimo Miccoli
In reply to this post by Chirag Chaman
Chirag,

So the boost on top of explain.jsp is for sorting results, the final
value for rank? If so  the Hits on results pages is not ordered by boost.
Because I have in firts positions Hits with low boost.

Thanks

Chirag Chaman ha scritto:

>Massimo,
>
>The boost gets multiplied at search time.
>
>This boost has already been applied to the "field norms" -- a good way to
>confirm is see a field norm that was originally one (URL or anchor is a good
>one) and that should now be higher. A lot of the other fields like "content"
>is way too small be being with to show any difference.
>
>In shot, if you see the boost on the top of the explain page, it's
>definitely there in the field norms -- and thus being applied.
>
>CC-
>
>--------------------------------------------
>Filangy, Inc.
>We're Improving Search!
>www.filangy.com
>
>
>-----Original Message-----
>From: Massimo Miccoli [mailto:[hidden email]]
>Sent: Tuesday, June 28, 2005 11:21 AM
>To: [hidden email]
>Subject: Re: [Nutch-dev] Copy DB by the piece
>
>Dear Nutch dev,
>
>I want to know if  the Boost calulated for Pages from inlink count at
>indexing and fetching time is used on the search.
>Using DistributedSearch seams that Pgae Boost is not used to calculate the
>ranks for pages. What I see in my result pages is most pages with low page
>Boost is on top and some with high Boost below.
>For example by explain.jsp:
>
>
>1)  boost = 5.3968873 score for query= 50.692223
>2 ) boost = 5.586193   score for query= 46.90389
>3)  boost = 6.0371985 score for query= 43.306103
>4) boost = 7.388178    score for query= 37.984783
>....
>
>So only the score for query is considered for sort (rank) the hits results?
>For an hits I think that ranks must be boost*score for query or I'm wrong?
>
>Thanks,
>
>Massimo
>
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
>from IBM. Find simple to follow Roadmaps, straightforward articles,
>informative Webcasts and more! Get everything you need to get up to
>speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>_______________________________________________
>Nutch-developers mailing list
>[hidden email]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>
Reply | Threaded
Open this post in threaded view
|

RE: Copy DB by the piece

Chirag Chaman
Boost are multiplied into the "match score" (aka. The Idf-tf)

Thus, pages are not soted by boosts, but by the final score.

Here's a example:

You have 3 pages:

- news.google.com
- www.blogspot.com/googguy (blog talking about google)
- www.yahoo.com/google-launches-ship-into-space.html

Let's say the boosts factors are 1,2 and 3 respectively.

Now, you do a search for "google".
Let's take the raw scores to be 50,20,15 for the 3 url.

After boosts are applied:

News.google.com - 50 * 1 = 50
www.blogspot.com - 20 * 2 = 40
www.yahoo.com - 15 * 3 = 45

Thus, you'll get ranking as

News.google.com
www.yahoo.com...
www.blogspot.com...


HTH!





 

-----Original Message-----
From: Massimo Miccoli [mailto:[hidden email]]
Sent: Tuesday, June 28, 2005 12:51 PM
To: [hidden email]
Subject: Re: [Nutch-dev] Copy DB by the piece

Chirag,

So the boost on top of explain.jsp is for sorting results, the final value
for rank? If so  the Hits on results pages is not ordered by boost.
Because I have in firts positions Hits with low boost.

Thanks

Chirag Chaman ha scritto:

>Massimo,
>
>The boost gets multiplied at search time.
>
>This boost has already been applied to the "field norms" -- a good way
>to confirm is see a field norm that was originally one (URL or anchor
>is a good
>one) and that should now be higher. A lot of the other fields like
"content"

>is way too small be being with to show any difference.
>
>In shot, if you see the boost on the top of the explain page, it's
>definitely there in the field norms -- and thus being applied.
>
>CC-
>
>--------------------------------------------
>Filangy, Inc.
>We're Improving Search!
>www.filangy.com
>
>
>-----Original Message-----
>From: Massimo Miccoli [mailto:[hidden email]]
>Sent: Tuesday, June 28, 2005 11:21 AM
>To: [hidden email]
>Subject: Re: [Nutch-dev] Copy DB by the piece
>
>Dear Nutch dev,
>
>I want to know if  the Boost calulated for Pages from inlink count at
>indexing and fetching time is used on the search.
>Using DistributedSearch seams that Pgae Boost is not used to calculate
>the ranks for pages. What I see in my result pages is most pages with
>low page Boost is on top and some with high Boost below.
>For example by explain.jsp:
>
>
>1)  boost = 5.3968873 score for query= 50.692223
>2 ) boost = 5.586193   score for query= 46.90389
>3)  boost = 6.0371985 score for query= 43.306103
>4) boost = 7.388178    score for query= 37.984783
>....
>
>So only the score for query is considered for sort (rank) the hits results?
>For an hits I think that ranks must be boost*score for query or I'm wrong?
>
>Thanks,
>
>Massimo
>
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
>from IBM. Find simple to follow Roadmaps, straightforward articles,
>informative Webcasts and more! Get everything you need to get up to
>speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>_______________________________________________
>Nutch-developers mailing list
>[hidden email]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>


Reply | Threaded
Open this post in threaded view
|

RE: Copy DB by the piece

Chirag Chaman
In reply to this post by Massimo Miccoli
Boost are multiplied into the "match score" (aka. The Idf-tf)

Thus, pages are not soted by boosts, but by the final score.

Here's a example:

You have 3 pages:

- news.google.com
- www.blogspot.com/googguy (blog talking about google)
- www.yahoo.com/google-launches-ship-into-space.html

Let's say the boosts factors are 1,2 and 3 respectively.

Now, you do a search for "google".
Let's take the raw scores to be 50,20,15 for the 3 url.

After boosts are applied:

News.google.com - 50 * 1 = 50
www.blogspot.com - 20 * 2 = 40
www.yahoo.com - 15 * 3 = 45

Thus, you'll get ranking as

News.google.com
www.yahoo.com...
www.blogspot.com...


HTH!





 

-----Original Message-----
From: Massimo Miccoli [mailto:[hidden email]]
Sent: Tuesday, June 28, 2005 12:51 PM
To: [hidden email]
Subject: Re: [Nutch-dev] Copy DB by the piece

Chirag,

So the boost on top of explain.jsp is for sorting results, the final value
for rank? If so  the Hits on results pages is not ordered by boost.
Because I have in firts positions Hits with low boost.

Thanks

Chirag Chaman ha scritto:

>Massimo,
>
>The boost gets multiplied at search time.
>
>This boost has already been applied to the "field norms" -- a good way
>to confirm is see a field norm that was originally one (URL or anchor
>is a good
>one) and that should now be higher. A lot of the other fields like
"content"

>is way too small be being with to show any difference.
>
>In shot, if you see the boost on the top of the explain page, it's
>definitely there in the field norms -- and thus being applied.
>
>CC-
>
>--------------------------------------------
>Filangy, Inc.
>We're Improving Search!
>www.filangy.com
>
>
>-----Original Message-----
>From: Massimo Miccoli [mailto:[hidden email]]
>Sent: Tuesday, June 28, 2005 11:21 AM
>To: [hidden email]
>Subject: Re: [Nutch-dev] Copy DB by the piece
>
>Dear Nutch dev,
>
>I want to know if  the Boost calulated for Pages from inlink count at
>indexing and fetching time is used on the search.
>Using DistributedSearch seams that Pgae Boost is not used to calculate
>the ranks for pages. What I see in my result pages is most pages with
>low page Boost is on top and some with high Boost below.
>For example by explain.jsp:
>
>
>1)  boost = 5.3968873 score for query= 50.692223
>2 ) boost = 5.586193   score for query= 46.90389
>3)  boost = 6.0371985 score for query= 43.306103
>4) boost = 7.388178    score for query= 37.984783
>....
>
>So only the score for query is considered for sort (rank) the hits results?
>For an hits I think that ranks must be boost*score for query or I'm wrong?
>
>Thanks,
>
>Massimo
>
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
>from IBM. Find simple to follow Roadmaps, straightforward articles,
>informative Webcasts and more! Get everything you need to get up to
>speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>_______________________________________________
>Nutch-developers mailing list
>[hidden email]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>