The Parser that Cracked the MediaWiki Code

neilk · on May 6, 2011

This isn't the first alternative parser for MediaWiki content -- there are 28 rows in this table. (I just added Sweble's and my own project...)

http://www.mediawiki.org/wiki/Alternative_parsers#Known_impl...

Most of these are special purpose hacks. Kiwi and Sweble are the most serious projects I'm aware of, that have tried to generate a full parse.

However, few of these projects are useful for upgrading Wikipedia itself. Even the general parsers like Sweble are effectively special-purpose, since we have a lot of PHP that hooks into the parser and warps its behaviour in "interesting" ways. The average parser geek usually wants to write to a cleaner spec in, well, any language other than PHP. ;)

Currently the Wikimedia Foundation is just starting a MediaWiki.next project. Parsing is just one of the things we are going to change in major ways -- fixing this will make it much easier to do WYSIWYG editing or to publish content in ways that aren't just HTML pages.

(Obviously we will be looking at Sweble carefully.)

If this sounds like a fun project to you, please get in touch! Or check out the "Future" portal on MediaWiki.org.

http://www.mediawiki.org/wiki/Future

knowtheory · on May 6, 2011

Hey Neilk!

Did you ever turn up anything regarding this? http://news.ycombinator.com/item?id=2216249

btw, neat js parser, i'll have to check it out. :)

neilk · on May 6, 2011

FYI the JS parser is broken for some cases, but it works great for most things you want from message strings.

As for your original question, I don't think there is a forum that tries to unite the left-brained and right-brained wikipedians. There is a bit of a divide. I'll send an email right now to someone who might know better.

We don't have contests per se to try to steer the community, other than I guess GSoC, or reaching out to developers that we think are already doing good things.

sigil · on May 6, 2011

It's great to see people tackling this problem, but I wouldn't declare victory for sweble just yet ("The Parser That Cracked..."). There are other promising MediaWiki parser efforts out there.

For one, sweble is a Java parser, and I'm not sure this makes it a good drop-in replacement for the current MediaWiki PHP code. The DBPedia Project also has what looks like a decent AST-based Java parser [1]. I would be interested in a comparison between sweble and DBPedia's WikiParser.

I stumbled across a very nice MediaWiki scanner and parser in C a while ago [2]. It uses ragel [3] for the scanner; the parser is not a completely generic AST builder, but is rather specific to the problem of converting MediaWiki markup to some other wiki markup. It does do quite a bit of the parser work already though.

Presumably a PHP extension around a C or C++ scanner/parser could someday replace the current MediaWiki parsing code.

[1] http://wiki.dbpedia.org/DeveloperDocumentation/WikiParser?v=...

[2] http://git.wincent.com/wikitext.git

[3] http://www.complang.org/ragel/

ZoFreX · on May 6, 2011

Given the complexity of Wikipedia's deployment compared to a typical MediaWiki installation, it really wouldn't be much effort to hook into a parser in say, Java rather than PHP, and would be well worth doing if it had significant benefits.

Of course, a PHP parser would still have to be maintained in parallel as not everyone would be able to do the Java option.

sigil · on May 6, 2011

> Given the complexity of Wikipedia's deployment compared to a typical MediaWiki installation, it really wouldn't be much effort to hook into a parser in say, Java rather than PHP...

No doubt the incremental complexity for Wikipedia would be small in relative terms. I assume that argument would support a variety of proposals.

A solid scanner and parser in C/C++ would benefit a broader audience though. All the major scripting languages can be extended in C/C++. In fact, the ragel-based parser I mentioned earlier [1] was built to be used from within Ruby code.

[1] http://git.wincent.com/wikitext.git

bjonathan · on May 6, 2011

Site down, here is a mirror: https://www.readability.com/articles/r9i55x6e

cache version: http://webcache.googleusercontent.com/search?q=cache:8xjwEj-...

rwolf · on May 6, 2011

Your readability link redirects me to readabilities home page.

VMG · on May 6, 2011

dito (because upvote doesn't suffice anymore)

bjonathan · on May 6, 2011

Mirror of the mirror: http://dl.dropbox.com/u/2577298/The%20Parser%20that%20Cracke...

ropers · on May 6, 2011

Or try Coral Cache:

http://dirkriehle.com.nyud.net/2011/05/01/the-parser-that-cr...

"It worked for me." ;-)

car · on May 6, 2011

http://www.sweble.org is the actual Wikitext parser project homepage. Please go there until dirkriehle.com is back up.

sunir · on May 6, 2011

This is a breakthrough and a welcome one. From a end user point of view, it has a couple major implications.

First, I believe this reveals the complexity of the parser, which implies a complex syntax, which implies a complex user interface as felt by end users. A more complex the user interface may make it harder it is to attract new editors, although it's unclear (to me) if that is a fact.

Second, having an AST representation is awesome. It makes it possible to even think about building a path towards WYSIWYG or some other form of rich text editing. It was not really possible to build a WYSIWYG editor around the wiki syntax.

If you have an AST, you can also store the page as the AST since you can regenerate the wiki syntax from the AST for people who need text-based editors.

tokenadult · on May 6, 2011

A more complex the user interface may make it harder it is to attract new editors

There may be friction against gaining new editors from the user interface of the MediaWiki software, but I think the greatest barrier to participation by new editors is the hostile, drama-filled environment on many controversial topics on Wikipedia. My evidence for that is the decline in "unsustainable fashion"

http://strategy.wikimedia.org/wiki/Story_of_Wikimedia_Editor...

in the number of Wikipedian administrators, who presumably for the most part are people who know how to use Wikimedia software. Too many of best contributors (people who look up facts in reliable sources and edit articles for better readability) on Wikipedia feel attacked and that their time is wasted. I know a lot of dedicated hobbyists who quietly work on their hobby-related subjects putting together great articles, but on any subject that is controversial, and for which looking up reliable sources takes some effort, Wikipedia is becoming a war zone and is not improving in quality.

http://strategy.wikimedia.org/wiki/Strategic_Plan/Movement_P...

_delirium · on May 7, 2011

First, I believe this reveals the complexity of the parser, which implies a complex syntax, which implies a complex user interface as felt by end users.

That's the case to some extent, but the opposite is also the case to some extent. Some of the difficulty of parsing is because "ease of human use" has been a much higher priority than "ease of parsing" when discussing syntax, which leads to some constructs that aren't easy to parse with typical CFG-type parsing approaches. It's also designed to be very lenient to ordering and common errors, much like a modern non-strict HTML parser, which makes hand-writing the syntax more friendly and forgiving, but with a tradeoff that the parser has to be more complex, because it doesn't have the luxury of just returning a parse error.

mdaniel · on May 6, 2011

From reading the article, and especially the interesting comments thereon, it seems this problem is half a bogus "language" specification and half that the unwashed masses are inputting any damn thing they like and Wikipedia accepts it.

I suppose this is one of the knobs that must be tuned to balance between reproducible I/O and turning away meaningful contributions from the community.

Semiapies · on May 6, 2011

I hadn't realized that there were any parsing issues around MediaWiki's markup. 5000 lines of PHP? Eek.

sigil · on May 6, 2011

It's worse. The MediaWiki PHP code doesn't implement a proper scanner and parser, it's a bunch of regexes around which the code has grown more or less organically. Silent compensation for mismatched starting and ending tokens abounds, and causes problems for all consumers of the markup, in the same way that lenient HTML parsers have. The difference is that Wikipedia, as the sole channel for editing markup, could have easily rejected syntax errors with helpful messages instead of silently compensating.

If it was anything else, I'd say "who cares," but this is "the world's knowledge" -- we absolutely should care about the format it's stored in. I'm glad to see people tackling this problem.

VMG · on May 6, 2011

Interestingly markdown has the same problem.

Another example of the imperfect but working implementation winning.

seanp2k · on May 6, 2011

Markdown...ugh. Let's just stick to DokuWiki or Mediawiki syntax for everything, please. If you need something more advanced than that, you should be using LaTeX. Actually, it'd be cool to build a working MediaWiki + Markdown => LaTeX converter....in something like Python.

VMG · on May 6, 2011

there is markdown2pdf written Haskell, which seems to have XeTeX as an intermediary step.

Personally I'd be happy to see any markup language becoming the default, regardless which one it is. Having a proper grammar would be a bonus.

gbog · on May 7, 2011

"Personally I'd be happy to see any markup language becoming the default."

I don't agree with this. All lightweight humane markup languages are not born equal, some are better others, and Mediawiki's is not in the list of the best ones. Now there seem to be a trend towards Markdown, but it should be improved and then, migrating Wikipedia to this Markdown2 could be a real good thing.

sigil · on May 6, 2011

> it'd be cool to build a working MediaWiki + Markdown => LaTeX converter....in something like Python.

For parsing MediaWiki in Python, check out mwlib [1], which was part of cooperation between Wikimedia Foundation and PediaPress. It's neither very complete nor very fast, but you might be able to hack up some LaTeX conversion with it.

[1] http://code.pediapress.com/wiki/wiki/mwlib

lloeki · on May 7, 2011

There are actually a bunch of Markdown parsers that create a DOM, or at least a proper tree (e.g markdown2 in PyPI). Markdown is outrageously simple compared to MediaWiki.

I actually wrote a subset-of-MediaWiki parser in C#, which stayed at the "subset" stage because of the ridiculous complexity and corner-cases that crop up even very early.

cpeterso · on May 7, 2011

Regarding the storage and access of Wikipedia's "world knowledge", DBpedia is a project that scrapes Wikipedia (InfoBoxes and categories) to create a structured, semantic database of knowledge.

https://secure.wikimedia.org/wikipedia/en/wiki/DBpedia

pornel · on May 6, 2011

AST of an example page is the interesting bit:

http://sweble.org/crystalball/result?query=ASDF&format=t...

car · on May 6, 2011

Site is down due to harddisk problems, but the actually referenced Sweble Wikipedia Parser project site is at http://www.sweble.org.

brianjolney · on May 6, 2011

link died. any mirrors?

pornel · on May 6, 2011

Googlecache: http://webcache.googleusercontent.com/search?client=opera...

driehle · on May 6, 2011

It's back up at http://dirkriehle.com - the project site is actually http://sweble.org where under Crystalball Demo you can play with the parser without having to install anything.

seanp2k · on May 6, 2011

I think we killed this poor site.