Indexing Urls pointing to same content

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing Urls pointing to same content

mamcxyz
I found that in the data I'm searching I have a lot of duplicated content.
Only diference is that the url change, ie, one say
http://localhost/sample.html and the other http://localhost/sample2.html.
However, sample1 and sample2 are diferent files, that its, here is not
involved redirection or linking or nothing like that. Sample1 and Sample2
are two diferent pages copied in diferent dates but with the exact same
content.

I think this account for something like 20% of the cases, so I think is
valuable avoid to index all of this. So I'm thinking in build a
link/location + content databases, in one put the list of links/urls and in
the other only content, so I have a start structure around the content...

But I wondering if exist smart way to do this in the actual Lucene
1.4codebase....

--
Mario Alejandro Montoya
http://sourceforge.net/projects/mutis
MUTIS: The Open source Delphi search engine
AnyNET: Convert from ANY .NET assembly to Delphi code
Reply | Threaded
Open this post in threaded view
|

Re: Indexing Urls pointing to same content

Otis Gospodnetic-2
Mario,

Lucene != web indexer, so Lucene doesn't know anything about files or URLs, etc.  It just indexes what it's told.  You should check how Nutch does it, and I believe it does it by comparing "fingerprints" of web pages.  Fingerprints are MD5 checksums, but I believe the recent changes there allow you to define your own mechanism.

In any case, this is not really a question for java-dev@.  nutch-user@ may be a a better place to ask.

Otis

----- Original Message ----
From: Mario Alejandro M. <[hidden email]>
To: Lucene Developers List <[hidden email]>
Sent: Fri 20 Jan 2006 05:27:01 PM EST
Subject: Indexing Urls pointing to same content

I found that in the data I'm searching I have a lot of duplicated content.
Only diference is that the url change, ie, one say
http://localhost/sample.html and the other http://localhost/sample2.html.
However, sample1 and sample2 are diferent files, that its, here is not
involved redirection or linking or nothing like that. Sample1 and Sample2
are two diferent pages copied in diferent dates but with the exact same
content.

I think this account for something like 20% of the cases, so I think is
valuable avoid to index all of this. So I'm thinking in build a
link/location + content databases, in one put the list of links/urls and in
the other only content, so I have a start structure around the content...

But I wondering if exist smart way to do this in the actual Lucene
1.4codebase....

--
Mario Alejandro Montoya
http://sourceforge.net/projects/mutis
MUTIS: The Open source Delphi search engine
AnyNET: Convert from ANY .NET assembly to Delphi code




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing Urls pointing to same content

mamcxyz
I know Lucene is not a web indexer... maybe I explain this bad.

I'm asking in how STORE the data, not in how locate it. If two files are the
same, using MD5 is my actual approach, then I plan to STORE the content once
but is necesary add the two locations.

Example:

c:\file1 Content: One
c:\file2 Content: One

In the index:

Content:One
    Location: C:\File1
    Location: C:\File2

So, or I put the locations and the content in separate Lucene index or I put
it in the same, but I don't know what can be the best, this is the advice
I'm asking for...

>
> Mario Alejandro Montoya
> http://sourceforge.net/projects/mutis
> MUTIS: The Open source Delphi search engine
> AnyNET: Convert from ANY .NET assembly to Delphi code
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Indexing Urls pointing to same content

Gwyn Carwardine
I'm just a novice but I had to do this recently to store items and attached
files. There is a many-to-many relationship between items and attached
files. If the relationships change I don't want to reindex the
items/attachments.

So I added the item documents (with unique key in ID), I added the
attachment documents (with unique key in ID) and then I added extra
documents to represent the link with the item id stored in the ID field and
the attachment id stored in the ATTID field.

When I have found a matching attachment and I want to look up the items that
contains it I can find all documents that have an ATTID equal to the
attachmentid and then use the list of IDs from the retrieved documents to go
get the actual items if that's what I want to do

I couldn't think of a better way so I look forward to seeing the responses
you get!

-Gwyn

-----Original Message-----
From: Mario Alejandro M. [mailto:[hidden email]]
Sent: 23 January 2006 15:58
To: Otis Gospodnetic
Cc: [hidden email]
Subject: Re: Indexing Urls pointing to same content

I know Lucene is not a web indexer... maybe I explain this bad.

I'm asking in how STORE the data, not in how locate it. If two files are the
same, using MD5 is my actual approach, then I plan to STORE the content once
but is necesary add the two locations.

Example:

c:\file1 Content: One
c:\file2 Content: One

In the index:

Content:One
    Location: C:\File1
    Location: C:\File2

So, or I put the locations and the content in separate Lucene index or I put
it in the same, but I don't know what can be the best, this is the advice
I'm asking for...

>
> Mario Alejandro Montoya
> http://sourceforge.net/projects/mutis
> MUTIS: The Open source Delphi search engine
> AnyNET: Convert from ANY .NET assembly to Delphi code
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]