Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Microsoft releases specifications for binary formats (microsoft.com)
31 points by mqt on Feb 18, 2008 | hide | past | favorite | 18 comments


"Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in these materials. Except as expressly provided in the Microsoft Open Specification Promise and this notice, the furnishing of these materials does not give you any license to these patents, trademarks, copyrights, or other intellectual property."

It's better than nothing but it's still a dangerous format to use.


If you look at the Microsoft Open Specification Promise it says:

"Microsoft irrevocably promises not to assert any Microsoft Necessary Claims against you for making, using, selling, offering for sale, importing or distributing any implementation to the extent it conforms to a Covered Specification (“Covered Implementation”), subject to the following..." [Doesn't cover non-MSoft patents. Not surprising.]

Office Binary formats are under Covered Specifications.

I wouldn't use the format, but not sure it's "dangerous" to write apps that read or generate the format.


Awesome. I'm a junior trader at a major investment bank and I have to deal with Excel all the time... it would be great to be able to programmatically generate well-formed Excel files without having to deal with VBA, COM automation, or anything else. Now someone just has to write a nice Haskell library...


VB.NET (which now has static metaprogramming and closures) + COM automation isn't too bad in my experience. You could also use Python for COM.

(I've found that when you set the application to be visible and each command is carried out visually, it's very impressive to non-programmers, especially ones who don't use macros.)

Other than that, http://poi.apache.org/ and a marginal language that targets the JVM is probably your best bet for now.


Poi has worked well for our projects where we have to export customer data to XLS (a commonly-requested feature that is almost like a checklist item)

I find the file system within a file (OLE2 compound document) fascinating. I wonder who at Microsoft came up with that idea (or was it really an idea by technical committee)


I'd probably use the new Office Open XML (Office 2007+) formats rather than the old binary formats.


You could always whip up scripts that emit XML + XSL stylesheets to convert them to Office XML, yeah.


The Word spec is 210 pages. Yikes! I wonder what kind of open-source tools these specs will spawn.

Related: anyone know how Scribd and folks like that read/display Microsoft-formatted documents?


This is nothing, the Office Open XML Document Format is over 6000 pages long.


It's a wild guess but I would have used COM access to the word.dll to convert it to a more reasonable format.

Again, a wild guess but that is how I would have done it, trying to reverse engineer formats as bloated as the office formats is generally not a good idea if avoidable.


"Antiword is a free MS Word reader for Linux and RISC OS. There are ports to FreeBSD, BeOS, OS/2, Mac OS X, Amiga, VMS, NetWare, Plan9, EPOC, Zaurus PDA, MorphOS, Tru64/OSF and DOS. Antiword converts the binary files from Word 2, 6, 7, 97, 2000, 2002 and 2003 to plain text and to PostScript."

- http://www.winfield.demon.nl/


There is a project from Apache that works across all (the binary) MS Office formats.

http://poi.apache.org/


Yeah, that'll work too. The point is that you need to leverage someone else's work to do it. Focus on your core, find shortcuts for everything else.


Well, given the release of these documents, as well as the existence of the Office Open XML format, there's nothing left to reverse engineer.

Granted, its no picnic implementing the specs these documents outline, but its certainly better than having to figure it all out from a binary file.


Well it is a picnic, a picnic in the park, Jurassic park that is.


Yes, that's surprisingly short, for what it claims to contain. HTTP/1.1, as of June 2007, is 222 pages, for example.


Has anybody with word parsing experience read these, that can speak to their level of detail?

Was this release prompted by any legal decision that anybody knows about?


EU antitrust litigation involving MS Office lack of interoperability , they were already fined 613M USD.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: