My SaaS deals primarily with legal documents that for years had been maintained with Word. The pain of emailing documents is real, but the comfort level with how Word works is also real. Over the years, most organizations have developed internal workflows to share and send documents around that bypass the pains, and while they may not be perfect, they work.
The funny thing is that the document authors like these ways of working. It is the tech people who don't. I've seen "Git for Word" proposed many times a year for a while now. And all of the ideas are interesting, but none of them appeal to my audience because they don't care about git's feature set. Nobody wants to branch and merge. Nobody wants a straight version history. ("Nobody" meaning nobody in my market, not nobody in the world.)
They want a storytelling experience. They want to know the why, not the what. And the workflow tends to be unidirectional, not with collaborative changes coming back together, but with expanding changes as each person adds their ideas and makes change for a specific instance of using a document. The experience we build for them bring in pieces of version history, pieces of comments, pieces of telling the story of why something was done, so people down the line can have more context to decide whether to accept or reject the changes.
It isn't that "Git for Word" is a bad idea - on the contrary, it would be great if someone pulls it off. My point is that building something that improves on Word isn't actually about the software, it is about the document workflows. If you find groups who work like software devs do, where documents receive small updates from a team, and bring all changes together for a final product, there is probably a market. But when evaluating such ideas, there has to be a reality check of whether the actual use of the documents truly matches the use case for git.
As someone who is at the intersection of tech and arts one of the things I like about using git in projects is that it is very clear what is the latest official defintive final variant of a piece of data and you don't have to ask anybody to get it.
When I worked as a VFX freelancer I was amazed at the number of hours (=money) burned by marketing agencies who didn't manage to give me the definitive variant for a simple list of things they wanted. In one instance they gave me everything they had, including crude and unrecognisable filenames, hints about things that I should ignore via telephone etc. I had to make sense of it and compile a list which I sent them to approve. They ended up approving another list (!) which they themselves sent me two weeks prior and they only managed to correct this once I hinted at this.
Of course this is a example of saw qhow things should never be. This usually involves somebody getting sick and some uninformed person taking over etc. But what I learned on film sets is that you should choose the defaults of your communication culture in such a way, that it works under the absolute worst conditions (bad weather, hungry, stressed, confused, etc).
And I have seen so many organisations fail at precisely that. If you get I'll someone else should be able to take over without heading to an oracle. This is not a special function limited to a version control workflow, it is something that has to do with clear communication.
Using git can sometimes help avoiding the whole problem by making it obvious which file is the latest and which is a variant of it, the people using it will have to use clear communication as well (e.g. by writing good commit messages, choosing the "right" commit sizes, naming things the right way etc). So if you know how to use git, you just might value clear communications a little bit more than the average person.
> As someone who is at the intersection of tech and arts one of the things I like about using git in projects is that it is very clear what is the latest official defintive final variant of a piece of data and you don't have to ask anybody to get it.
As git is a distributed system I think it’s not at all clear what the definitive final variant might be —- and that is a strength.
That can be handled externally to git via ad hoc convention, say by using a system like gitlab or github and letting it declare one as “primary”, or by having someone post to a mailing list (“Commit X on a repo you can reach at URI Y is the official release”) both of which are common.
But in your example various people could mail you commits and not have any consensus on which is authoritative.
you should choose the defaults of your communication culture in such a way, that it works under the absolute worst conditions
The defaults are sensible. Throw money at it and pay someone enough to sort things out and get it done, e.g. you as a freelancer get a data dump and ask the right question and the problem is solved. Sure it costs money. But everything costs money.
Git works great among peers. But most organizations are hierarchical. And the boss doesn't have to give a shit about which draft is the latest because the boss is the boss.
> And the boss doesn't have to give a shit about which draft is the latest because the boss is the boss.
As a boss myself I have to say: I totally give a shit. Salaries are my company’s #1 expense by a wide margin. I don’t want my staff spending their time manually merging docs received via email when there are much better solutions out there. I hire people because they are smart and can get shit done that makes money, not because I want servants.
This is the killer app for Office 365 and google docs: stop wasting time emailing shit around, one canonical version even outside of company walls.
Having a lot of experience in both legal (academic and lawyer) and tech (developer and founder), I totally agree with you and think you framed very well the situation.
As a lawyer I can full confirm that our industry works as you have described (as regards documents workflows), and with my tech background I can also confirm that most features of dev-oriented solutions like git are mostly uninteresting from a lawyer's perspective.
Similarly, I am a qualified lawyer. I previously worked in a big law firm in London, which describes itself internally as both the “best” and “the most advanced” law firm in the world. I now work full-time as an engineer in software, and increasingly in hardware.
I agree with both comments.
To add, in large-scale corporate/commercial practice (which is the area which I practised), Git would be useful in replacing email-based collaboration, but the switching costs seem too high.
Currently, the corporate law contract negotiation workflow is as follows:
1. a party adds their tracked changes to a Word document based on a template contract;
2. the party emails this document to party B;
3. party B reads the changes, may discuss the changes with their client, adds their tracked changes, and then emails the updated document to party A.
This process repeats for every document, punctuated by occasional conference calls between the parties, until the parties agree.
‘Git for law’ would be useful for lawyers in increasing efficiency - and thus reducing costs for clients.
However, the benefits for law firms of adopting a new Git-based workflow are likely to seem relatively small to lawyers. Their current email-based version control system is messy and time-inefficient, but generally functions with minimal error.
On this basis, I would predict that most corporate law firms would be very slow to adopt a Git-based system - the benefits may not justify the costs.
One should also note that lawyers, particularly contract/commercial lawyers, are conservative by profession. In my experience, most lawyers are very slow to adopt new technologies, highly risk-averse, and skilled at spotting risks. The combination of these traits means that any technology will have to offer a very high benefit to replace an existing legal workflow.
I am still working in big law and I disagree that the benefits of adopting more advanced version control would be small. However, I work in a field that is a mix of regulatory law and litigation, where almost no work is based on templates and most things are drafted from scratch.
One very large problem that typically comes up in large teams: Only 1 team member can edit the the "live version" of a document (it's locked for editing by the version control system), the other team members need to work "offline" and then reintegrate their changes/drafts into the main document. Everybody has lived through the horror story of a team member in the different time zone still having checked out the live version and going to sleep :)
Sometimes you have to circulate a draft document in parallel to multiple parties (e.g. colleagues with special subject expertise + client's inhouse lawyers + client's technical experts + other party's law firm + other party's inhouse lawyers + other party's technical experts). It can happen that you need to reintegrate comments from different parties to different versions of the drafts, e.g., if your client gives feedback quickly and you re-circulate updated version internally, then you receive other party's comments to the older draft version...
Besides the mechanical aspects of reintegrating comments, it is also difficult to track if everybody who needs to sign off has actually signed off on the parts of the documents they had to review. Often it gets lost who made which comment/change. It can be quite awkward if a regulator asks you "Isn't the technical statement on page 12 contradicted by fact XYZ" - please explain until tomorrow - and you have to quickly figure out who actually put that in...
Actually what is so time inefficient about this workflow? Sending by email doesnt take a minute. Also it is a great way to transfer responsibility (in whose park is the ball).
There are many flows that don't work. You can only have a single person working at the file at a time. You need a mutex (which email is in this case). However imagine that you send back a file, then realize that you missed something. You can add more changes and send it again, but if the other party has started working already now they have no reasonable way to merge in the new changes.
Also is there anything that actually guarantees that the tracked changes were the only changes made? I haven't seen this but it seems like a serious flaw in the process.
Also what if you get an intern to do some of the work, then you want to review the changes between version:$lastyousaw and version:$current. IIUC the mail with tracked changes only allows you to view one "patch" at a time.
To argue for this workflow: the mutex is soft (social) which is a good thing, sometimes ppl do not reply on time and you need to move forward. For having new ideas/paragraphs after sending out, you either inform the colleague to hold it until your done, or you send a paragraph and ask him to merge it.
The mail allows you to see all patches, sometimes 'clean slate' is done by accepting all changes. While this sounds like a problem in theory, in practise its not.
I agree that things can be better somehow, but it is really difficult to see any solution which is at least 10% better. The current workflow also has the advantages:
- data is as safe as your filesystem and email system together
- Word file is generally not considered a vendor lock-in
- everybody understands the workflow
- nobody can block the workflow (like not checking in again with sharepoint)
> Git would be useful in replacing email-based collaboration, but the switching costs seem too high.
Git was originally designed on an email based workflow for software development (hence the commands am, format-patch and send-email).
For contract negotiation, if the template contract was in plain text, then it could be emailed as a patch. The party would then apply that contract to their local git repository, make the changes and email the diff from the original template back to the first party.
So essentially, you could still use email, but have the diff between changes as the content in those emails (along with inline comments).
Unfortunately, corporate Outlook/O365 based email systems don't work very well when used in that fashion.
I am not a lawyer, but as an entrepreneur I've had to send-and-receive a lot of legal documents with investor's lawyers, almost exclusively in .docx.
I never trust the received file's "track changes", always compare to the latest version I've sent -- and it is extremely common to find a change that wasn't mentioned/discussed, and somehow magically "accepted" or otherwise not tracked in the other side's "track changes". Whenever I point these out, I always got a "oh, yes, forgot about that one", or "I didn't intend to put that in" or "I'm not sure why it didn't appear in the track-changes view" -- but out of tens of these (with multiple lawyers over multiple years), not one was ever in my favor.
Branching might not be as interesting on a single project - but diffing is, very much; and I'm sure it's not more coveted mostly because most lawyers either (a) don't realize how good it makes life for you when you can diff and blame easily, or (b) are abusing the fact that it is so hard to diff/blame on documents, and certainly (c) usually charge by the hour, so some efficiencies are actually going to cost them money if they implement them (a famous Upton Sinclair quote comes to mind).
You are right, this happens a lot when drafting documents out-of-court documents.
IMHO this is a useful "feature" for lawyers. Don't forget that usually lawyers of two parties are working"together" only apparently, when in fact they are always litigating for their client's best interest.
The goal is not to reach a common agreement, but to reach the agreement that best serves the interest of one's client, most of the times at the expenses of the other party.
This is achieved in many ways, one being having text in a contract that the other party is not fully aware of, either because it's not properly understood or noticed.
This means that including in a document text without the other party noticing is a good old trick that quite valuable to any lawyer.
As a lawyer my position on this is that it's the other party to blame if it did not check the document properly (I always compare the documents for differences even when sent with revisions).
> (c) usually charge by the hour, so some efficiencies are actually going to cost them money if they implement them
I tend to disagree with this line of thought. Lawyers have a thousand ways to inflate their timesheets. Using a tool that makes their life more miserable by forcing them to do manual work that could be automated is certainly not one of them.
> As a lawyer my position on this is that it's the other party to blame if it did not check the document properly (I always compare the documents for differences even when sent with revisions).
As a non-lawyer, this is why they say "the problem with lawyers is that 95% of them give the rest of them a bad name". And as I mentioned, I also always compare.
> Lawyers have a thousand ways to inflate their timesheets. Using a tool that makes their life more miserable by forcing them to do manual work that could be automated is certainly not one of them.
I agree, and they do inflate them regularly -- all lawyers I asked to draft NDAs and employment agreements for me charged a few hours worth for the first one "because they had to write it" even though it was unchanged from another client (for sure; I've seen that exact one before).
Still, they need to keep an air of "being busy" and "working hard", and the best way to do that is to occasionally work hard.
In 2013, we were hacking away a document tracking system to solve exactly this. We thought we were disrupting the legal market while in reality the lawyers were way too comfortable with Word and emailing docx files.
Exactly like you've hinted, the right way to crack this is to bring a full-fledged word processor like Google Docs, but instead of ad-hoc realtime collaborations the software has to enable customizable unidirectional document workflows with controlled collaboration.
Most serious document creators don't want to branch and merge, instead they want to pass on the document through a series of stages. They want statistics on when, what and why of each stage. And at any point of time the document is in one definitive stage not scattered across emails/folders/versions/forks.
Interesting perspective. But I wouldn't say I want to use git because I want to branch and merge. I would prefer all history to be linear -- it's just not always possible. My main draw toward git is just for keeping track of past versions along with comments describing the changes (that's what I see as "the story" as you put it). It's nice to know I have merging tools available to help me if I get stuck in such a situation, but I would prefer to never have to merge anything.
I agree with this being about workflows, not documents, but isn't branching and merging a workflow and storytelling tool?
It allows multiple people to work in parallel (and in private). When somebody sends a pull-request eventually, they are presenting a story of changes that they want to get into the document and people can discuss them and approve them individually. (Of course, git the tool isn't necessarily suitable for non-technical people, but git-the-workflow seems to be a good foundation.)
Could you elaborate on what such a tool could look like without git style branching?
I think the issue is that parallel workflow is a bridge too far for the current legal profession. They do sequential and they like it. A tool that makes the sequential workflow better will gain more ground than one that tries to change the process whole hog.
I have worked on several hundred of these types of contract processes in the last few years, and you are absolutely correct: sequential is where it's at for these situations. I have, however, encountered a few situations where time was an issue so I had two or three different versions out to different parties at the same time, and then merged the proposed changes where possible and sent out alternate versions where the changes differed. That process was... not fun, and could definitely use a more coherent workflow than manual merges or Word's built in merge features.
The anecdotes told here suggest that most think their work is sequential, while because a lot can happen asynchronously, hell situations happen all the time. Do you agree?
But the workflow only "looks" sequential, isn't it? In many stories told here a user may have inadvertently revert clauses to old versions and this can be missed. This happens because it is also an asynchronous work.
Personally, I'd be tempted to write Git for Word as a plugin that Just Worked. Users are given a new interface (The Document Repository?) from whence to select documents. Every auto-save is a commit. Enforce "explaining why" by ... what, requiring an in-document comment near the changes? A popup asking for explanation, and that's added as a commit message?
I don't know, I'm just spitballing. Sounds like it'd be fun for awhile to attempt to seamlessly get this into the workflow and see how it's accepted.
Exactly. Git for Word doesn't make sense because software applications can be edited independently because of abstractions -- you can't have different people changing different parts of a Word document without ANY idea of what each other is doing the way you can in software. You can't change the tone in one section to conversational in isolation.
Typical workflow with word documents required mutiple people reviewing, commenting,and merging or rejecting the comments. Microsoft Word's change tracking,comment, and review features are adequate for most people.
The big problem with git for word is that git is not designed for the population in general. Using git is not easy, even for developers. Although powerful and flexible, it has a very complicated workflow that is too close to its implementation. In my opinion it is just like wishing that the general population use LaTeX instead of Word.
It is _their clients_ who don't, not just tech people. I hired lawyers a few times. IMO their redline and email workflow is error-prone craziness that could use improvement. That said, I'm a "tech" person, so I might be biased.
I've spent many years in 'collaborative writing' in R&D, mainly grant proposals and joint reports/deliverables, most in the CS/IT domains. Writing those texts is very different from writing the software.
First thing you should realize is there are no 'tests', and all the 'code' is usually in a single big file. Anyone that has touched the document can have potentially messed up everything, both content, layouts and meta-data, and there is no automatic way to check whether it still makes sense. Many times people will not use the agreed upon editor/version, and sometime (often) that means a boatload of minor edits to the document all over the place just from opening and saving. Imagine everyone in your software team using different editors all with their preferred coding conventions that are automatically applied to the whole project at load.
From this you can deduce the enormous responsibility of ownership and gate-keeping in the workflow. The absolute worst collaborations I have been part of were those that somehow believed that if they used a collaborative document editing facility, wikis or Google docs for instance, that would negate the need for assigned owners/editors. Those tug-of-war shitstorms got exponential the closer one came to the submission deadline (technically incorrect, i know, but you know what I mean).
Some tips:
- Have well defined ownership for each section or part of your document. The owner receives and makes all changes for that part.
- have a final editor that is responsible for the complete document receiving the changes of the parts from their owners only.
- Do not trust 'track changes', but use Word's built in document compare if you are the final editor. For complex formatted documents (nearly all instances require you use an insanely styled template, you 'clean room' import (C/P through notepad) the text changes into the correctly formatted doc under your control.
- release the current trunk document often, ideally once per day. This requires staggering, with subeditors closing submission windows and submitting their updates to the main editor before EoB. Everyone editing should work against the latest release.
-Every version published by the final editor should be immutable. Mail it to everyone if needed, but if you use a link to some sort of repository make sure it is a deep link to a version that can not be updated in the repository, or hilarity will ensue.
- use versioning in the filename. filename_YYYYMMDD_HHMM_dXXX_rNN.docx where XXX is the assigned party acronym for the person making the update. 'YYYYMMDD_HHMM' is only touched by the editor, 'dXXX_rNN' is the NN'ed changes release by part XXX against version YYYYMMDD_HHMM .
Most certainly Git can function as a repository, but there will be people that will not work with it (nor any other repository) so always assume mail interactions as well.
Finally, there should be a special place in hell for the people that designed SharePoint versioning. Don't even think of going there.
nice story (really), makes me wonder if a special kind of group merge to give a better idea of who/why on the changes at a particular time would be interesting
I've made this script which automatically extracts the Office file format (which is a ZIP archive of XML documents) and versions the XML documents and their extracted text contents alongside the binary Office file. This is done using a Git hook and it seems to work pretty well. If you're in need of versioning Office documents, this might be a good enough solution for you.
Edit: I should also address why not use the built-in Office versioning feature? The reason I don't use it is because I like to be able to view the diffs in Git. I don't want to have to use Office just to see the changes. My solution offers that. By doubling-up the way the original is versioned in the way of tracking the extracted XML and text contents as well, each commit's diff will have the binary change as well as the textual diff which in my experience is good enough to tell the gist of changes. And you're using standard Git / text manipulation tools you would use with any other diff.
This looks very interesting. Do you think it can be applied to other kinds of XML files? I'm interested in using git with a vfx software (The Foundry Nuke) that writes XML projects, and it would be great to have some versioning system for it.
I've tried using the git diff patience algorithm, but didn't work well - frequently, the diff was about to remove every single line and add all them back to the XML file.
As with source code, if you can get a consistent linter/formatter run on the file before commit you should see less "jitter" in the diffs those commits produce.
I got some decent results with `xmllint --format` which is the linter/formatter from libxml2 (so available in most Linux distros and ported to most platforms).
(I was using xmllint as a formatting step when unpacking ODT files in my similar tool to the directly above; mentioned in a sibling comment. I found the XML files in ODT files were much more prone to being minimalized and reformatted/reordered on every save in comparison to DOCX which was surprisingly more stable in XML formatting.)
In your situation, I'd just whip together a quick PowerShell script like I have here, but tailor it to the structure of your file format: traverse the XML tree and have a few if-else statements which filter out noisy metadata you don't need to see in the diff, if any, and save the resulting collected text node contents as a text file alongside the XML files. Each commit with changes to the XML will thanks to the Git hook also have a corresponding TXT file so you can very easily view the changes in a skimable way, unlike the potentially really big and messy XML diff you'd have if you versioned only the original.
Because I built it to be extensible/support plugins I've used it for all sorts of interesting file types beyond DOCX too. (CELTX, a screenwriting format from years back; prettier diffs for Inform 7 source text; experimented with an SQLite deconstructor; ...)
Looks like I take a slightly different approach too, in that I store a bunch more metadata about the deconstructed contents (not just relying on directory listings), so I end up trusting my reconstruction tool a bit more and I mostly don't store the binary blobs in git, as I assume I can reconstruct them quickly enough.
One benefit of your solution over the `textconv`-based approach mentioned in the article is that your solution offers two different levels of diffs (XML and TXT).
To simulate that with textconv, you’d have to switch between two `diff.doc.textconv` variants.
On Windows you can just use TortoiseGIT, it can do diff and even merge by calling Word's internal compare tools. I can attest that diff works fine (differences show up as if you had used track changes within word), but I haven't had occasion to try merging Word documents with TortoiseGIT yet. The same functionality was already available in TortoiseSVN.
Interesting, I didn't know about fodt! Only knew that godot engine had done something similar (for git specifically).
I downloaded a docx document from the net, opened it in libre office, removed a single word, saved it as fodt, removed a single word again, saved it as fodt again, and the diff between the two fodt is gigantic.
Apparently there are lots of items like <text:p text:style-name="P20> whose content didnt change, but their ID did. It didn't even only affect IDs of content after the removed word, but content before as well.
The file has 19361 lines and the diff size is 1110 lines so there is some level of locality, but note that a lot of those lines are just base64 data of image content. The fodt is 1.5 times as large as the original file.
You have to save, close, re-open, save, close, re-open a few times before the diffs stabilise – and even then it'll seemingly-arbitrarily rename all the tags.
I recommend having a commit hook that (somewhat) pretty-prints and line-wraps the XML – perhaps splitting on sentences too, so that adding a word doesn't proliferate all down the page. I haven't tried this, though, so it might not help. If you do, could you release the code?
I don’t understand why this 6 year old article has been posted when current Microsoft 365 versions of Word et al have built in version control and real time collaboration.
Yep. They've had version control for at least a decade, diff'ing also, by way of Compare. I'm also not sure why people are fascinated with using git here. It's weird seeing all of the complex solutions in this thread for a problem which does not exist.
Edit: I meant 'fascinated with using git here in this context'.
i don't know why you are getting down voted for speaking the truth. I don't know a single person who this article would relate to in today's climate since everyone i know is using the latest version of office or on google docs.
I suspect most people in this crowd either don't have direct experience with Office 365, or haven't discovered its versioning/real-time collab features.
That said, "track changes" is still used extensively especially with parties outside the organization, especially for legal documents.
Some of the proposed solutions were very nice, particularly Draftable - but it's expensive and my bosses didn't feel it was worth it. To this day they still work on huge slide decks that are partially shared, but I'm just not involved anymore with that side of things so I stopped pushing. I still think a way of tracking Powerpoint decks on a slide-by-slide basis, with partial merging and synching, would be really good to have (existing features for embedding are '90s-era).
For Word there are quite a few solutions nowadays, most are clearly superior to the stuff Office ships with. So the problem is still there, just not as bad as 15 years ago.
I remember that discussion. If I recall, your experience was poor because you didn't have Sharepoint storage, yes? (merge conflicts?)
I use O365 collab features daily (with SharePoint/OneDrive) storage and the experience has been similar to that of GSuite. I regularly work on PowerPoints with multiple people simultaneously editing the slides.
It already existed at the time this article was written, as well, I think. There is probably a niche for purely ergonomic tooling that works with, not against, the built-in features, but a lot of this is a matter of positioning - I know some expensive and widely used legal document management systems that are strictly worse than the built-in features (regularly lose important data thanks to user error in ways that are impossible with the built in features). They still sell and get used.
I really don't understand how Word remains so popular. It was created at a time when few people had internet access, and was designed to produce printed documents. It was the perfect tool to write newsletters, flyers, articles, academic papers and manuscripts. The world has moved on though, and I fail to see Word's relevance today, other than the sheer number of people that are familiar with it.
Word is expensive, proprietary and the XML it generates is unfathomable. There are so many better FOSS tools and systems that we could be using. If you're collaborating on a document then markdown or LaTeX has you covered. You get version control though git and multiple people can contribute. If you're writing a book or article, then the graphic designers and typesetters are going to make the design decisions, not the author, so why bother messing around with fonts and colours and the infuriating placement of images and tables.
I authored a kid's book on coding, and the process was a nightmare. I authored in markdown, used pandoc to convert and then further edited in libreoffice, to be able to send stuff through in docx format. Then revisions were sent back in docx and I had to reverse the whole process, so I could maintain my plain-text version of the book. Then the proofs were sent through as PDFs, which I then had to markup for corrections. Many of the mistakes were due to the crappy way Word places images. In the end I just bought a copy of Word, and submitted to the way my publisher wanted me to work, which disrupted the authorial process.
It's time we ditched Word, in the same way we ditched VHS and DVD. It's an outdated technology that remains dominant just because everyone uses it at school, and then refuses to move on. If schools insisted that all homework was submitted in something like markdown, we'd see a dramatic change in a very short period of time. (BTW when I was teaching CS, my kids authored in markdown and submitted on GitHub)
> If you're collaborating on a document then markdown or LaTeX has you covered.
These are not WYSIWYG solutions which answers 99% of your question "why". When people want to write a document they want to write things and have the things appear on a page, possibly in different formatting. Injecting ideas like source files, rendering pipeline, etc. will just result in confused people.
That's why online solutions like Google docs are popular. No special app, things look like expected, you can collaborate, and few people actually need any fancy features.
The worst part is the default font, Computer Modern, which is absolutely deplorable. I'm not a big fan of the general style of "modern" fonts (a name coined at the end of the 19th centry when they still were actually modern). But, worse than that, Computer Modern is horrifically slim and spindly. I've read repeated rumours over the years that it was deliberately this way because the printers of the time used ink that would run a bit so the font was like that to compensate, but I don't know if it's true.
That's not such a big deal since you can obviously change the font. For a long time there were lots of text fonts but very few math fonts, and those math fonts that did exist would either have some symbols from Computer Modern or wouldn't have a suitably similar text font. But now there are a fair number of choices. Personally I like mathpazo (with Palatino for text) but I've found people used to Computer Modern can find this a bit much of a radical departure. (Edit: I've found a more conservative choice is Times for text and Utopia (MathDesign) for math.)
TeX does have a few small typesetting niggles. For example, if you set f(x)g(y) with normal small brackets around the x but large brackets around the y (because it's really a displayed fraction) then you'll find g is miles away from its argument but right next to f's argument. (I'll avoid opening the can of worms about what the root cause is here, but it's very clearly wrong in this case.) This is actually not that big a deal either - there are lots of problems like this but they're all fairly minor and small in number compared to the huge number of things typeset correctly. The only problem comes when people refuse to correct things because they assume that if TeX typesets it that way then that must be correct by definition.
Have you tried XeTeX with the TeX Gyre fonts? It made a big difference for me. (I previously used pslatex and then pdflatex with Palatino/mathpazo as well.)
I can't stand the fonts, don't like how it applies space, etc.
TeX is like a programmable pocket calculator from the 1970s, way ahead of it's time but today it's something that conspires with Word, Google Docs, and other dull tools to suck out the oxygen for sharp tools.
I think my argument is that WYSIWYG needs to die. For the vast majority of people they want nothing more than:
> text
> image
> more text
> table
> more text
There are any number of applications that allow you to write markdown and view the generated HTML in whatever formatting you want. Your recipient then gets to choose their own fonts, colours etc, which from an accessibility point of view, is much better.
Unless you're printing a hardcopy or creating a PDF, what is the point of Word?
I don't believe you'd be able to convince anyone not in tech that writing "" is in any way better than clicking "insert image" if you just want to send someone a report. Never mind explaining that paths need to be relative and resources included in the attachments.
Even I, happily maintaining some pages in reST, wouldn't want to inflict that on people.
> I think my argument is that WYSIWYG needs to die.
WYSIWYG is a big part of what made the GUI revolution so successful. The computer for the rest of us, wouldn't be for the rest of us, if we had to worry about Git and how to render our file format.
I've had the same frustrations dealing with publishers and Word templates as you had. Your mistake is that you are conflating our experience writing a technical book with the vast majority of users who are not writing technical literature. A writing system for the masses should be as easy to use (for the basics at least) as paper and pencil. Git and learning even a simple markup language does not meet this standard.
WYSIWYG for professional word processing is like training wheels - it lets you start being productive on day one, but if you don't spend the effort to learn how to work without them, they get in the way and make you slower -- although you wouldn't know that unless you've seen someone who can do the job without them.
I have not used Word for ~10 years, but not in the last ~20 or so years, after I realized how much time and effort it cost me -- nearly missed an important deadline because of a Word 2 vs Word 6 incompatibility that manifested in a very inopportune moment.
It's been around for almost 30 years. I'm constantly receiving documents from people who've used it for >25years. And there is never use of styles, often spaces instead of tabs, many "new lines" instead of a page break, and a host of other things like that. References are not dynamic (just typed out) meaning that an item inserted in the middle of a list makes many of them wrong.
The vast majority of people who have used it for decades use it mostly as a smart typewriter, because the "pro" features like styles require a lot of discipline and the "let's just press the bold button" is too easy and enticing.
WYSIWYG needs to die whenever anything professional is needed.
You have now lost almost all people who currently write documents. Nobody who is not a developer wants to write in markdown. The mass market wants point and click, buttons, and WYSIWYG.
I think you can turn off that smart selection thing. There is corresponding "smart" behaviour when you drag a selection to a different place so you don't end up with missing or duplicated spaces, so the precise boundary of the selection isn't as important as you think unless you want to select mid-word. I think for most people this behaviour that you find annoying is actually useful - remember that Microsoft is one of the few companies that actually does real user testing.
For your other objection (and maybe what you were also really getting at with your selection objection), maybe you'd like WordPerfect 5.1's "reveal codes"? :-) We can all agree that Microsoft wouldn't have hesitated to steal that feature if it would have benefitted them. The fact that they didn't is proof that formatting markup is something that was historically tried (or considered) and rejected, rather than something waiting to happen in the future.
In any case, for a program as huge as Microsoft Word, I think this is all quite minor. How much of your day is really ruined if you start typing after some bold text, find that the new text is bold when you didn't want it to be, and have to manually turn it off again? It's a fundamental problem with the model, like you said, but has surprisingly tiny impact on usability. If this is your biggest objection, it's almost proof that the program is pretty good. (But I can sympathise with minor objections: I hate copy and paste works differently in Excel then any other program!)
I deeply and honestly would love to agree with you but unfortunately I can't based on my experience.
I write a lot of stuff in the legal area (articles, books, contracts, court documents, etc) and there's nothing that comes close to Word.
For some time I had tried to switch to LibreOffice. My goal was to quit Word, which is the only software that still binds me to Windows/Mac (not interested in Wine). I hoped to finally be able to switch to Linux without any hiccups.
Unfortunately LibreOffice is not quite as good as Word. I use many of the advanced features of Word, and the more you use these in LibreOffice, the more you encounter bugs. At one point I had a .odt file with tons of cross-references in footnotes pointing to other footnotes. When I was ready to ship the document I found out that all cross-references were messed up and I had to redo them all.
Now it's true that LibreOffice has a huge and active community that works hard to improve the product, but as word processors are my main and most important tool for work, I need the most reliable software I can get. Unfortunately that is still Word...
On top of that I must add that I do need to properly format documents 99% of the times, and also on this I find Word slightly superior, even if admittedly on this is quite comparable to FOSS solutions. The only quite big problem at this regard is interoperability. Since I know that most, if not all, my colleagues/counterparts use Word, whenever I send a document I need to send something that "will just work" for them, which is a docx. This means that using anything other than Word might give some problems in relation to formatting, which in same cases is pretty important.
I agree that word is bad, but I don't think we have a great replacement yet.
Markdown -- standard markdown isn't expressive enough (no tables for example), there are lots of extensions but none which are "standard".
LaTeX -- doesn't produce accessible documents, so is a non-starter in lots of areas (seriously, the PDFs it generates are some of the worst around when it comes to accessibility. Word's are amazing).
If we put aside the Word file format, and maybe the ribbon, is Word bad though?
I've been using it for decades, and have tried OpenOffice and LibreOffice too over the years - nothing comes close to Word.
Markdown is not suitable for "normal" users, but as a developer, I've come to prefer markdown for technical documentation and such (especially where I want a history, diffs etc), but I still use Word for a lot of other things.
Word is incredibly fully-featured - I use a lot of functionality, but am likely still only using a fraction of what it has. It really does have all your document editing needs covered.
Aside from the file format, I think Word is a fantastic piece of software. I have a few annoyances with it now and then, but it's been very dependable and kept me in good stead over the years.
> If we put aside the Word file format, and maybe the ribbon, is Word bad though?
Is the ribbon bad, though, or are we just living through 2003 era UI shock for 13 years now?
I lambasted the ribbon as was fashionable back then, but actually using it in 2020?
It's good. Like, legit amazing UI design. Makes it easy to peck through menus and find what you want. Structures chords for power users, and importantly makes those chords visually discoverable. I've found a pile of good keyboard shortcuts just by starting to slowly press a chord on the ribbon.
It's great for mouse users.
It's great for noobs just learning the ropes.
It's great for keyboard users who already know their stuff.
It's amazing at facilitating the learning process into knowing your stuff
It's been great for 17 years. Do we still have to pretend it sucks because we were used to a different way of doing things?
I said that because I know the ribbon is still contraversial, and some people still don't like - personally, I've come to like it, if not love it as you have.
In my opinion the issue is path dependency. Everybody in my generation grew up with Microsoft Office free on every installation of Windows, and teaching everyone new efficiency software is costly. It looks like Google is trying to win the next generation and Chromebooks are taking over the classroom. When today's youth grow up I wouldn't be surprised to Google Docs as the new standard.
> Microsoft Office free on every installation of Windows
Microsoft Office is quite useful, and probably good value for money. But it has never ever been free, and I’ve been buying it for work and home since Office 4.x on Windows 3.1 was the new hotness.
My theory is that the answer lies along a different axis than the other commenters. I think Word is a consequence of the the fact that modern office work is optimized to extract the maximum short-term resources out of people, or at least put up the appearance of doing so.
Consider someone dealing with inter-departmental collaboration on documents at a company in the 70s or 80s. They could potentially invent their own system, make paper copies mandatory, go full computer, or any number of solutions in between. Technology was considered hard and looked recognizably so, and management was less likely to question technical views and opinions about this. People were way less likely to get fired and generally visualized staying there for a while, so they were comfortable sticking to their viewpoints.
Today, your boss and their boss are all concerned with how to get the maximum amount of work out of you in the time you're at the company. So if you propose retraining everyone on Open Office or Markdown, because it has high potential for a better way of tracking changes or something, you'll get pushback from a) management, because the CEO is going to say “but I use Word all the time, why can't you just use that?” and b) the workers, because they know they will be forced to learn it on their own time rather than being given a proper amount of time to train and learn. [1]
I think modern society and modern work are slowly defaulting to the idea of quickly throwing in the towel and just using whatever technology is approved by the milieu. This is true even in our industry, consider this article [2] by Latacora [3] for instance: it's full of statements which approximately say “Just use CloudTrail”, “Just Use Jamf”, “Just Use Okta SSO” etc. If our industry is doing things like this to optimize extraction (even the article acknowledges that SOC2 is purely documentation optimized for selling to big companies), why would we be so surprised that publishing departments and such are optimized to Just Use Microsoft Word rather than a technically better system?
-------
[1] Think back: when was the last time you had a proper training about how to use a certain piece of software by people from the company building it, or at least certified trainers? These were way more common back in the day.
Word is not the best anything. It's not the best for layout (InDesign) or best for tracking changes/collaborating (Git) or the best for visual archiving (PDF) or responsive layout (HTML) or…
Basically, it's a "good enough" WYSIWYG, and a number of industries have standardized on it, in spite of the fact they should actually use an open standard + tool that actually fits their needs. I think screenwriting might be the one industry to escape Word, since they use Final Draft as I understand it.
> I really don't understand how Word remains so popular.
Because Office/Word has become the hammer of the document writing world.
It isn't an issue that it's a bad product and better products our out there (and there are).
It is that everyone is expected to know how to use word at a basic level. From Secretaries to VPs and CEOs, almost universally these people can open a word document, edit it, and save it.
Because of this expectation, it is easier to throw money at Microsoft and have the tool you can expect everyone to use.
I gave up using Word to write manuscripts when I switched to Markdown documents in git.
In the last few months, though, I gave up on Markdown to switch to a more robust format - LaTeX. Before I switched, I didn't know LaTeX at all, but I knew from my reading that it had the features I needed.
I don't know if you're already familiar, but pandoc perfectly bridges this gap for me. You can write things in markdown, then covert it latex no problem with pandoc. You can even make templates for it, and write mathmode in markdown.
It certainly makes for less _noisy_ source files in my opinion, and it also means that you get to take advantage of the fact that, if you want to, you can easily convert your markdown to HTML, with maths using something like mathjax.
This was a bit of a ramble, but I honestly can't say enough nice things about pandoc.
I hope to see a post from you one day saying "I gave up on LaTeX and switched to org" :-)
Seriously, org has served all my authoring needs for over a decade now. You can export to LaTeX and HTML easily, and now pandoc does a decent job of exporting to other formats. You can embed LaTeX lines in your org document, so you get the full power of LaTeX, without having to write LaTeX for everything. Tables are hellish in LaTeX, and even lists are a pain.
Of course, there is the whole "You have to use Emacs" thing...
Oh, I was all about LaTeX for a decade prior to switching to org, so I know the feeling. However, org is just so much more lightweight that I found myself writing/authoring a lot more once I switched. I now often author emails where I need rich text (embedded code with syntax highlighting, tables, etc) in org.
Oh, and I try to do all my presentations in it too - it can export to Beamer.
As for Emacs, I know what you mean. I tried it on and off for 10-11 years before I finally stuck to it. In my case, what helped was that vi/vim really was much worse, so it's not like I had a seriously good alternative. I tired of repeatedly switching editors per task (had one for Python, another for LaTeX, etc). I finally one day said "I need to learn a really good editor and stick to it." I bought the Emacs book, spent a whole week reading it, and forced myself to Google a solution whenever I couldn't remember how to do something. I was surprised how quickly I became proficient in it - within a month of use.
(All without learning elisp - I was a "power" Emacs user for 8-9 years before I learned elisp properly).
And then I discovered org mode. While I've encountered people who were proficient Emacs users but left for something else, I haven't found anyone who is an org mode user who switched to something else. I know people who use other editors in general, but still use Emacs just to use org mode.
That's really cool. It sounds like your commitment to learning Emacs mirrored my commitment to LaTeX - didn't know it, but was highly motivated to learn. I love that.
FWIW you can use also convert ReStructuredText (RST) to LaTeX. RST is very similar to Markdown, and it provides useful features that are missing. It also renders consistently across different viewers (presumably due to a tight specification). If you're working a lot with tables then you'll really enjoy the ease of the "list table" syntax.
I'd say RST is suitable for many types of documentation but I'm not convinced that it's suitable for conference/workshop submissions.
Not long ago I read some article here on HN that the world is still waiting for a git equivalent for documents. This seems like a good start.
Now we need a native diff viewer for structured files, where the changes are presented with attribution either side by side, or alongside (like gitk, or like gitlab diff viewer).
Then we need an editor that supports doing the gitty stuff natively, so that the non-technical writer doesn't have to worry about creating repos and committing the changes from the command line.
After a lot of trying to get Git and Word to play nice together, we ended up building a collaboration tool to bring the power of Git (branching & merging) to non-technical Word users.
Yes, it very much depends on the type of content you're editing. My take is that if you don't care about tracking formatting changes, you don't really need Microsoft Word - many other tools could do better.
However, when you're a professional copy editor of non-fiction publications, a substantial part of your work will consist of checking little details, such as making sure the titles of books, magazines, and articles are formatted correctly and so on.
Potentially this could be used to remove some content (or make it appear removed) without it being highlighted. I doubt this is an issue in practice, considering there's a full audit trail and collaborators are usually trusted, but this is good feedback and we'll see if we can improve this.
You do know that a Word document is really a ZIP file? The text content is inside an XML document that, in principle, Github would work on. All you have to do is unzip the document, store the directory in GitHub and repack it for Word to use.
Wait, wouldn't this actually be fairly simple to set up? I'll admit my Git knowledge is a little shaky, but couldn't you set up a Hook that runs when you commit a Docx/Pptx/etc. file to unzip it in memory first, and then another Hook when you checkout to zip it back into the original structure? I guess conflicts are the major issue... asking users to navigate the XML/binary structure stored inside could be a mess. A GUI could help, but that would invoke different issues.
Before Word integrated its own improved version tracking in more modern versions, during my undergrad I participated in a research project to add version tracking to Word documents by abusing its zip file format[1]. My research partner created a plugin to manage the versions, and my main contribution was a Java tool that attempted version merging[2].
It wasn't fleshed out or usable, but it was an interesting project. I was impressed at how open the Word/Office format was, this was before Microsoft's reemergence into openness and open source.
Ages ago I wrote a little Word VBA that exported a plaintext copy to go along with the .doc every time I hit save. Worked quite well for eyeballing the changes in a diff. Obviously you don't get merge support for .doc but since that was still running on SVN where workflows tend to be less merge-heavy (or was it still CVS? I feel old..) and I was working solo anyways the human-readable diff worked well enough.
I'm using Fossil for my book. My book is about business systems simplicity so it's a great fit. If I hadn't started using sqlite for a project, I would have never even heard of Fossil. What a great, beautifully simple combination. Don't add complexity unless the complexity is worth the dysfunction it addresses.
It's currently badly broken—see the issues, someone points out what needs fixing—but I have a tool that uses Word's built in track changes functionality as git diff backend: https://github.com/Gaelan/WordGit
Wow nice. I'm a big fan of git's `--word-diff` option for text edits. The output is almost as good as `latexdiff` and so much faster.
Another useful trick is to pipe the ANSI-colored terminal output through `aha` (https://github.com/theZiz/aha or `brew install aha`) which produces HTML output, e.g.
git wdiff | aha > ~/Desktop/mydiff.html
You can then send the file mydiff.html to collaborators by email or add to CI build script.
Back in the bad old days of version control (thinking of VSS here), I was overall pretty satisfied with how the check-in/check-out mechanics worked for Word docs and the like. In this case you have the benefit of the sequential workflow, in fact enforced or hinted by the tool itself, while also getting rid of the recurrent weakness of email-based document storage. There were plenty of other things to dislike about VSS (like, pretty much the rest of them) but it wasn't so bad for maintaining documents.
I work with telco standards and the organizations that I follow use Word documents. The way we keep a paper trail of all the changes to a new standard’s draft is by separating the change proposals into their own documents (using change marks against the latest agreed draft) and only allowing a named editor to actually implement the agreed change proposals back to the master document. The change proposal documents, together with the meeting minutes create a perfect history of who proposed what changes and when.
Has anybody used SimulDocs[0], which sells itself as a "version control for Microsoft Word documents"? I've been really curious if it's a decent solution in this space, but I tend to keep myself away from Word docs in my life recently.
.docx is just an archive format. If I remember correctly the contents inside the .docx archive are plain text. Can’t we just use version control inside of there? We would have to of course figure out a way to have git unpack and pack the archive each time.
Apart from attachments and metadata the actual document is some kind of xml monstrosity that contains the text and the markup. It’s not very useful to just create diffs from that, it looks a bit like the HTML created by FrontPage if you remember that.
You can just rename a docx file to .zip, unpack it and peek around.
Not really, as there isn't a linearity or markup feel to the XML. Outside of straight text changes, formatting, rearranging, and internal markups, are not possible to 'visually' diff in the XML.
it is a zip with collection of xml files. Diff on as-is xml from word doesn't work, there are a lot of false positives. Things looks the same from a user perspective, but internally it is different. You would have to interpret/render the content to really tell if it is different.
There is also plain tracking noise of word itself.
However, diff on word xml is perfect tool to understand how the microsoft interprets the spec.
At first, I'm like... but it's just a zipped archive of XML and other content files which can be used with git successfully, but yeah there's a mess in there. It's not really meant to be human readable.
yes! i have a collection of tweets where fiction/non-fiction writers joke about naming their versions different things. i'm like, use git?
i wrote a novella using a folder system + text editor + git. i'm trying to put that into a web app. don't know how useful it would be for other people though. and don't know if it will ever be finished because i need to write.
I wrote a CLI tool in Python for this exact use case and am currently using it to write a novel. Basically the CLI tool solves a lot of the tedious issues that come up (e.g., combining all of the text files, reordering them, etc.)
If you like writing out of a text editor (I use Atom) it's super useful.
The funny thing is that the document authors like these ways of working. It is the tech people who don't. I've seen "Git for Word" proposed many times a year for a while now. And all of the ideas are interesting, but none of them appeal to my audience because they don't care about git's feature set. Nobody wants to branch and merge. Nobody wants a straight version history. ("Nobody" meaning nobody in my market, not nobody in the world.)
They want a storytelling experience. They want to know the why, not the what. And the workflow tends to be unidirectional, not with collaborative changes coming back together, but with expanding changes as each person adds their ideas and makes change for a specific instance of using a document. The experience we build for them bring in pieces of version history, pieces of comments, pieces of telling the story of why something was done, so people down the line can have more context to decide whether to accept or reject the changes.
It isn't that "Git for Word" is a bad idea - on the contrary, it would be great if someone pulls it off. My point is that building something that improves on Word isn't actually about the software, it is about the document workflows. If you find groups who work like software devs do, where documents receive small updates from a team, and bring all changes together for a final product, there is probably a market. But when evaluating such ideas, there has to be a reality check of whether the actual use of the documents truly matches the use case for git.