Call for Microsoft OneNote experts for help on OneNote parsing in Tika

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Call for Microsoft OneNote experts for help on OneNote parsing in Tika

Nicholas DiPiazza
Dear Tika Devs:

I am working on a OneNote tika parser. And I'm at the point where I need
some help with some of the workings of OneNote documents.

Here is the project so far:

https://github.com/nddipiazza/onenote-parser-java

Basically I just need some help understanding some of the finer details of
the OneNote format and how to extract info from it.

https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document
https://stackoverflow.com/questions/59020176/onenote-not-able-to-find-all-the-property-ids-in-the-microsoft-documentation

If anyone has a moment, can you please drop in and peak at the source and
also see if you can answer my questions?

-Nicholas
Reply | Threaded
Open this post in threaded view
|

Re: Call for Microsoft OneNote experts for help on OneNote parsing in Tika

Nick Burch-2
On Sun, 24 Nov 2019, Nicholas DiPiazza wrote:
> Basically I just need some help understanding some of the finer details of
> the OneNote format and how to extract info from it.
>
> https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document
> https://stackoverflow.com/questions/59020176/onenote-not-able-to-find-all-the-property-ids-in-the-microsoft-documentation

If you're having issues with implementing bits described in the specs, you
might find it best to ping the Apache POI dev list for help. Most of the
Apache people who've worked with the Microsoft binary file formats are
there!

If you're finding gaps in the published Microsoft specifications, the best
option is to contact the Microsoft docs team. They're really nice people!
And they want to help! They can't always help, because some bits of the
file formats are complicatedly not covered by the open specifications, but
often they can.

For the case where properties are found "in the wild" but missing from the
documentation, it's probably worth just dropping the Microsoft docs team
an email
<https://docs.microsoft.com/en-us/openspecs/dev_center/ms-devcentlp/a7729059-1a2f-4698-a995-c0c011df2580>
Link to the page of the docs you're following, give them the list of IDs
you've found, and ask if it is expected that those IDs are missing. Based
on past experience, they'll take a few days to find someone on the
relevant team, and either come back with a "whoops, our bad, will be fixed
in the next 1-2 releases of the docs" or "sorry, deliberately excluded
for now"

Nick