[jira] [Commented] (TIKA-3069) Unpack with header X-Tika-PDFextractInlineImages does not extract content from image

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3069) Unpack with header X-Tika-PDFextractInlineImages does not extract content from image

Parth (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059582#comment-17059582 ]

Carina Antunes commented on TIKA-3069:
--------------------------------------

Thank you so much for the details! 
{quote} If you're not looking for the literal bytes of the embedded files, /unpack is not for you. Perhaps we could look into compressing /rmeta?
{quote}
That would be great! Please look into it. Because of that for now unpack still seems the best option for us.

> Unpack with header X-Tika-PDFextractInlineImages does not extract content from image
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-3069
>                 URL: https://issues.apache.org/jira/browse/TIKA-3069
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.23
>         Environment: Docker image *apache/tika:1.23-full*
>            Reporter: Carina Antunes
>            Priority: Major
>         Attachments: file.pdf, output.zip, parser.json
>
>
> Expected content to be extracted from pdf with image using tesseract, ie same behaviour of _/rmeta/text, but instead no content is extracted._
> Response from */unpack/all* _:_
> {code:java}
> $ curl -T file.pdf http://localhost:9998/unpack/all --header "X-Tika-PDFextractInlineImages: true" > output.zip    
> __TEXT__
>  [image: image0.jpg]
> __METADATA__
>  "pdf:unmappedUnicodeCharsPerPage","0" "pdf:PDFVersion","1.4" "X-Parsed-By","org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser" "pdf:hasXFA","false" "access_permission:modify_annotations","true" "access_permission:can_print_degraded","true" "access_permission:extract_for_accessibility","true" "access_permission:assemble_document","true" "xmpTPg:NPages","1" "pdf:hasXMP","false" "dc:format","application/pdf; version=1.4" "pdf:charsPerPage","0" "access_permission:extract_content","true" "access_permission:can_print","true" "access_permission:fill_in_form","true" "pdf:encrypted","false" "access_permission:can_modify","true" "Content-Type","application/pdf"
> {code}
>  
> Expected response similar to  */rmeta/text:*
> {code:java}
> $ curl -T file.pdf http://localhost:9998/rmeta/text --header "X-Tika-PDFextractInlineImages: true"
> {
>   "Content-Type": "application/pdf",
>   "X-Parsed-By": [
>     "org.apache.tika.parser.DefaultParser",
>     "org.apache.tika.parser.pdf.PDFParser"
>   ],
>   "X-TIKA:embedded_depth": "0",
>   "X-TIKA:parse_time_millis": "4112",
>   "access_permission:assemble_document": "true",
>   "access_permission:can_modify": "true",
>   "access_permission:can_print": "true",
>   "access_permission:can_print_degraded": "true",
>   "access_permission:extract_content": "true",
>   "access_permission:extract_for_accessibility": "true",
>   "access_permission:fill_in_form": "true",
>   "access_permission:modify_annotations": "true",
>   "dc:format": "application/pdf; version\u003d1.4",
>   "pdf:PDFVersion": "1.4",
>   "pdf:charsPerPage": "0",
>   "pdf:encrypted": "false",
>   "pdf:hasXFA": "false",
>   "pdf:hasXMP": "false",
>   "pdf:unmappedUnicodeCharsPerPage": "0",
>   "xmpTPg:NPages": "1"
> },
> {
>   "Component 1": "Y component: Quantization table 0, Sampling factors 2 horiz/2 vert",
>   "Component 2": "Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert",
>   "Component 3": "Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert",
>   "Compression Type": "Baseline",
>   "Content-Type": "image/jpeg",
>   "Data Precision": "8 bits",
>   "File Modified Date": "Wed Mar 11 19:28:01 +00:00 2020",
>   "File Name": "apache-tika-16610492346701338708.tmp",
>   "File Size": "319936 bytes",
>   "Image Height": "1554 pixels",
>   "Image Width": "1206 pixels",
>   "Number of Components": "3",
>   "Number of Tables": "4 Huffman tables",
>   "X-Parsed-By": [
>     "org.apache.tika.parser.DefaultParser",
>     "org.apache.tika.parser.ocr.TesseractOCRParser",
>     "org.apache.tika.parser.jpeg.JpegParser"
>   ],
>   "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLorem Ipsum\n\n\"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit...\"\n\nLorem ipsum dolor sit amet, consectetur\nadipiscing elit. Etiam at posuere mauris.\nInterdum et malesuada fames ac ante ipsum\nprimis in faucibus. Suspendisse potenti. Donec\nut dapibus lectus. Aenean neque mauris,\nconvallis quis eros nec, molestie rhoncus\nlectus. Aliquam dui mauris, sagittis ut posuere\nquis, tempor id tellus. Nunc id varius dolor.\nFusce in elementum enim. Vestibulum\nimperdiet pretium est et rhoncus. Nam in urna\nmauris. Nulla facilisi. Nullam sed sapien libero.\nSed ligula arcu, auctor non nunc sed, viverra\nvehicula sem. Vestibulum orci felis, tristique at\norci id, interdum sodales lectus. Donec sed\nrhoncus massa. Donec laoreet sodales velit at\nfaucibus.\n\nAenean sit amet velit eros. Nam congue\nplacerat eros, vitae mattis turpis ultricies ac.\nPraesent vestibulum, tortor tempor tristique\nsagittis, mi risus semper neque, vel vehicula\ntortor sapien in lorem. Sed sit amet mattis leo.\nPraesent euismod lacinia sapien, nec cursus\ndolor dignissim pharetra. Mauris eleifend\npellentesque erat fermentum tempus. Nulla\ncommodo dolor urna, quis tincidunt diam\nconvallis vel.\n\nAenean ornare imperdiet nibh, sed gravida ante\nsagittis et. Fusce dignissim lectus vitae\nullamcorper malesuada. Donec ultricies ornare\nquam a placerat. Donec euismod nibh vitae\nfacilisis consectetur. Nunc in interdum neque,\nvarius vehicula massa. Ut fermentum lorem id\nante porta mattis. Praesent quis nulla ut lectus\nsodales ultricies. Sed sodales mollis ex, a\nsemper metus faucibus ac. Nulla tempor, ipsum\nvel egestas venenatis, enim est gravida mauris,\na lacinia justo quam eget felis. Maecenas\ncommodo, arcu sit amet aliquam molestie, urna\neros rutrum enim, et blandit nisi magna sit amet\n\nlorem. Suspendisse accumsan nulla vitae\naugue tempus, sed fermentum metus viverra.\nEtiam dapibus tellus eget venenatis rhoncus.\nVivamus eu dolor faucibus, malesuada tellus sit\namet, vulputate orci.\n\nNunc at diam eu nisi sollicitudin varius. Sed a\ntincidunt arcu. Integer vitae fermentum libero,\nac semper justo. Nunc dapibus in magna\ntempus aliquet. Proin interdum lorem eget\nsuscipit ullamcorper. Nulla vitae tincidunt\naugue. Cras turpis elit, dignissim eget metus\nnec, fermentum scelerisque ante. Suspendisse\naliquam tortor in eros rhoncus, eget elementum\nvelit sagittis. Donec et tellus ac dui interdum\nmattis. Duis condimentum quis velit et\ncommodo. Sed congue quam vitae neque\nvolutpat viverra.\n\nProin finibus nunc vel elit iaculis vestibulum.\nNulla et mattis magna. Nunc a ligula leo.\nAliquam bibendum semper tellus at molestie.\nCurabitur pellentesque ullamcorper dolor, at\nfinibus elit iaculis ac. Aliquam vestibulum sit\namet diam sit amet condimentum. Donec\nrhoncus, nisi eu dapibus elementum, tellus ex\nornare dui, nec molestie nulla nulla eget nulla.\nUt sem massa, tristique ac commodo id, rutrum\nat massa. Donec enim velit, luctus ac nisi ac,\nbibendum tempus elit. Proin posuere ex odio,\nsed faucibus elit volutpat in. Suspendisse\nscelerisque mauris nunc, ut tincidunt velit\nvulputate quis. Integer efficitur diam vel urna\ndignissim, a  sodales magna _ eleifend.\nVestibulum malesuada ornare diam, faucibus\nmaximus tellus aliquam et. Sed sed libero\negestas, varius sapien faucibus, interdum\nquam. Nullam a accumsan dui. Vivamus\nscelerisque justo in metus ornare interdum.\n\n",
>   "X-TIKA:content_handler": "ToTextContentHandler",
>   "X-TIKA:embedded_depth": "1",
>   "X-TIKA:embedded_resource_path": "/image0.jpg",
>   "X-TIKA:parse_time_millis": "4023",
>   "embeddedResourceType": "INLINE",
>   "pdf:hasXMP": "false",
>   "resourceName": "image0.jpg",
>   "tiff:BitsPerSample": "8",
>   "tiff:ImageLength": "1554",
>   "tiff:ImageWidth": "1206"
> }
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)