Format of Wikipedia Index

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Format of Wikipedia Index

Armins Stepanjans
Hi,

I have a question regarding the format of the Index created by DocMaker,
from EnWikiContentSource.

After creating the Index from dump of all Wikipedia's articles (
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-
pages-articles-multistream.xml.bz2), I'm having trouble understanding the
format of Documents created, because when I get a document from the Index,
its only field is docid.
Is this an indicator of incorrect indexation and if not, how should I use
the index, in order to search for occurrences of a term, within an article
(I was imagining of doing a boolean query, with on sub-query being the
article's name and the other the term I'm searching for within the article)?

Regards,
Armīns
Reply | Threaded
Open this post in threaded view
|

Re: Format of Wikipedia Index

wmartinusa
 From the javadoc for DocMaker:


  * *doc.stored* - specifies whether fields should be stored (default
    *false*).
  * *doc.body.stored* - specifies whether the body field should be
    stored (default = *doc.stored*).

So ootb you won't get content stored. Does this help?

regards
-will


On 1/22/2018 10:27 PM, Armins Stepanjans wrote:

> Hi,
>
> I have a question regarding the format of the Index created by DocMaker,
> from EnWikiContentSource.
>
> After creating the Index from dump of all Wikipedia's articles (
> https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-
> pages-articles-multistream.xml.bz2), I'm having trouble understanding the
> format of Documents created, because when I get a document from the Index,
> its only field is docid.
> Is this an indicator of incorrect indexation and if not, how should I use
> the index, in order to search for occurrences of a term, within an article
> (I was imagining of doing a boolean query, with on sub-query being the
> article's name and the other the term I'm searching for within the article)?
>
> Regards,
> Armīns
>