Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Abstract Wikipedia (wikimedia.org)
798 points by infodocket on July 2, 2020 | hide | past | favorite | 227 comments


Example notation for the project, called AbstractText:

————

Input 1:

Subclassification(Wikipedia, Encyclopedia)

Result 1:

English: Wikipedias are encyclopedias.

German: Wikipedien sind Enzyklopädien.

————

Input 2:

  Article(
   content: [
     Instantiation(
       instance: San Francisco (Q62),
       class: Object_with_modifier_and_of(
         object: center,
         modifier: And_modifier(
           conjuncts: [cultural, commercial, financial]
         ),
         of: Northern California (Q1066807)
       )
     ),
     Ranking(
       subject: San Francisco (Q62),
       rank: 4,
       object: city (Q515),
       by: population (Q1613416),
       local_constraint: California (Q99),
       after: [Los Angeles (Q65), San Diego (Q16552), San Jose (Q16553)]
     )
   ]
 )
Result 2:

English: San Francisco is the cultural, commercial, and financial center of Northern California. It is the fourth-most populous city in California, after Los Angeles, San Diego and San Jose.

German: San Francisco ist das kulturelle, kommerzielle und finanzielle Zentrum Nordkaliforniens. Es ist, nach Los Angeles, San Diego und San Jose, die viertgrößte Stadt in Kalifornien.

————

I didn’t understand quite what the proposal was until I saw these examples from https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Examples


This example is quite child like.

It's what bright children in the 70's learning about computers thought.

50 years later they haven't solved it because it doesn't work that way.

Is there a real example not using proper nouns?

A city changes, a population changes depending on the country and language and time. Town X having a population Y might be considered a Village X and Population Z, because in some countries population includes the rural parts, the population of San Francisco might be different in another country.

The rabbit hole goes on forever, and more importantly it's been tried constantly for over 50 years.

Unlike Machine translation which is amazing compared to 50 years ago, and getting better, and you could see how you could integrate it better with Wikipedia (It's already used) yet it's tossed out in the white paper for no real good reason I can see. There's also lots of stuff like Duolingo style methods that you could look at.


English and German have very similar vocabulary and syntactic structure. So this example is not very elucidating. Comparing it to Chinese, Turkish or Javanese would probably be better.


This maps rather nicely to something like Grammatical Framework [0]. I wonder whether they'll adopt an existing project for translation; getting things into this graph form seems like the hard part, honestly.

As far as the comparison goes, it should be easy enough to map the trees from the abstract form into language specific trees. We're you hoping to understand the current limitations? Maybe get a benchmark of the state of things that updates automatically as the project continues?

[0]: https://www.grammaticalframework.org/


>English and German have very similar vocabulary and syntactic structure

Hm, the sentences are structured in a parallel way, but is that really proper German? I don't remember anything from high school German class, but people make jokes about putting the verb way at the end. Or is that an obsolete style?


It's actually great German. Syntactically sophisticated. I am surprised by the use of the subsentence (not sure what the proper babe for this is) which puts the three larger cities in the middle of the last sentence.

It would have been possible to place the three larger cities at the end of the sentence similar to the English example. This would have sounded a bit more bot-like, and was somehow what I expected.

So seeing this particular German example is actually quite a good example showing the power of this approach.


I think the syntax needs to be derived, not designed. This one here is just English.


And worthless for most of us English speakers...


I wonder what happens in more literal languages, where "center" doesn't mean "main area".


Yes, this. Well, in this case, the solution is obvious: you need to have two separate concepts for center. But…

When I first learned about the OmegaWiki project (called WiktionaryZ then, I think), I was thrilled. It tried to represent lexical (Wiktionary) definitions and other language concepts using data. For each sense of each word, a so called DefinedMeaning was created. In the same sense, Wikidata has its entities. But soon, I learned about a problematic aspect of OmegaWiki’s concept, and the same thing appears on Wikidata: You represent some set of concepts in a single language, then another language comes and needs to split some concepts in two, because your language uses one word for both, but the other differentiates between them. Then, a third language comes and it maps its concepts to your existing set still a bit differently, so you might get four entities for just three languages. Etc.

On Wikidata, more focus is, I guess, on “concrete” entities: people, places, etc., where this does not appear that often. But it contains the abstract entities as well, and the problem appears there all the time. You might try to “fix” the problematic entities by splitting them to more elementary, linked using “subclass of” etc.; in some cases it might work quite fine (but losing the interwiki links in the process, which is unfortunate, given those were the original use case of Wikidata), in others, it is basically impossible without a degree in philosophy and deep understanding of ten languages, to be able to correctly distinguish and represent their relations. And imagine somebody trying to _use_ those entities. Like “I would like to say this person was a writer”, but there are seventeen entities with the English label of “writer”, distinguished by some obscure difference used by a group of Sino-Tibetan languages.

And… Wikidata entities represent basically just nouns.

So… I am a bit sceptical.


I believe that Wikidata's Lexeme system is trying to fix that, is it not?


In Input 2 "center" is a keyword, because the markup is using English for keywords. The example output just happens to be in English as well. I assume it will be mapped to a more appropriate word in another language.


But the word/concept 'Center' does not appear anywhere in the input data, as far as I can see? It just lists a number of things for which SF ranks highly, and whether that means you call it a 'center' is up to the template writer - unless I'm misreading.


Yes it does. Line 6: "object: center"


Well... I retract my previous comment then. Thanks for pointing it out.

(I blame viewing it on mobile.)


Also the grammars of English and German are pretty similar. How well would it scan in other languages? Perhaps “well enough” is sufficient.


The key idea is that if the semantic description is abstracted enough, a grammar engine can convert the ideas encoded in it into the right structure for the language.

Not all languages have "X is Y" constructs, but all known human languages have some structure to declare that object X has property Y. Capture the idea "Object X has property Y" in your semantic language, and a grammar engine can wire that down to your target language.

The largest risk is that the resulting text will be dry as hell, not that it's an impossible task.


Though being dry doesn't diminish the value of the text, though. Very exciting.

I'd also be worried about ambiguity; humans can (sometimes) detect when they may be parsed the wrong way in context. I wonder if there will be a way to flag results that don't properly convey the data. How would that be integrated into the generator? (There's probably an answer in the literature.)

Lots of fun questions to explore.


The main problem is that language X has an implicit definition of Foo, which is similar but not identical to language Y's definition of Bar. This might work when the languages share common ancestry like German and English, where Foo and Bar are both descendent from Baz and have similar meanings, but will not work when you try to translate to language Z, whose speakers have a different word Foobar which has a meaning that encompasses Baz and Qux but excluding Xyzzy and with a completely different connotation.


Try Finnish or Hungarian, for example.


Finnish would likely work, though it would require very extensive rules on declinations. Some compound word and list rules are also fun... Finnish is rather liberal in word-order, but that's a simple fix.

What is hard in that the conjucts do not have unique identifiers in the example. That is an essential thing to have. As there is plenty of synonyms and meaning might change. Same applies to center.


Hopefully some word sense index is applied (or implied).


This is one big hurdle I think. If one has to refer to the english meaning of words for the whole project to work, then how is this different from just writing the whole thing in english and translating everything from this?


What a horribly myopic way to organize information. They seem to have unthinkingly copied from vernacular English various loosely defined concepts like "city". What do they mean by San Francisco? The City and County of San Francisco? What about Los Angeles? Is that the entire LA metro or just LA county? Is Santa Monica a part of Los Angeles or a seperate settlement? How is the concept of "city", "metro", and "town" going to translate into "市", "Burg", and "Grad"?


This is getting very close to the Universal Language that Umberto Eco describes in his book The Search for the Perfect Language. I wonder what he would think about this if he were alive today...


The syntax looks well optimized for human editing.

The example seems like it would be machine generated though.

I hope the syntax learns from SQL, and allows for easy generation by either man or machine, preferably a little of both.


That's the beauty here. It's not the syntax. It's just a syntax to express the abstract thing. Saying this syntax is an issue is like saying "I don't like binary trees because their syntax is so weird". One particular syntax may be weird, but the syntax is only specific to one specific representation. Everybody will be free to choose any representation they like, as long as it can somewhat automatically be translated back into the abstract thing that this project is aiming to produce and maintain.


The way I'd do it, would be to store an intermediate representation, and have multiple front-ends with different syntaxes. Have the editable text be generated from the IR.

This would be a huge plus, as it would not require the editor to know English keywords. Most keywords could be translated into the contributor's native language, lowering the barrier for editing.

It would also allow the syntax to be changed over time, or provide multiple different syntax paradigms, a bit like wikipedia's code vs visual editors.

Of course, comments are an issue, but hopefully, this is as close to "self-commenting" code as it gets.


How does it go beyond the headline and general info?


For reference, this is from the same developer [1] that created Semantic MediaWiki [2] and lead the development of Wikidata [3]. Here's a link to the white paper [4] describing Abstract Wikipedia (and Wikilambda). Considering the success of Wikidata, I'm hopeful this effort succeeds, but it is pretty ambitious.

[1] https://meta.wikimedia.org/wiki/User:Denny

[2] https://en.wikipedia.org/wiki/Semantic_MediaWiki

[3] https://en.wikipedia.org/wiki/Wikidata

[4] https://arxiv.org/abs/2004.04733



Damn. Big kudos to Denny.

And to all the other people doing awesome work but not on the top of HN.


Considering the close relationship with Google and Wikimedia https://en.wikipedia.org/wiki/Google_and_Wikipedia and the considerable money Google gives them, how can one not see this project as "crowdsourcing better training data-sets for Google?"

Can the data be licensed as GPL-3 or similar?


That's an incredibility zero-sum way of looking at the world.

Almost every research group and company doing NLP work uses Wikipedia I'd say it is a fantastic donation by Google which improves science generally.

> Can the data be licensed as GPL-3 or similar?

It's under CC BY-SA and (with a few exceptions) the GNU Free Documentation License.


I dont think the relationship is that close - all it says is google donated a chunk of money in 2010 and in 2019, it was a large chunk of money(~3% of donations) but not like so much to make a dependency.

> Can the data be licensed as GPL-3 or similar?

Pretty unlikely tbh. I dont know if anything is decided for licensing, but if it is to be a "copyleft" license it would be cc-by-sa (like wikipedia) since this is not a program.

Keep in mind that in the united states, an abstract list of facts cannot be copyrighted afaik (i dont think this qualifies as that, wikidata might though)


How so? Wikimedia-provided data can be used by anyone. Google could have kept using and building on their Freebase dataset had they wanted to - other actors in the industry don't have it nearly as easy.


Denny seems to be leaving Google and joining Wikimedia Foundation to lead the project this month, so probably you do not need to worry too much about Denny's affiliation with Google.


As a long-time Wikipedian, this track record is actually worrisome.

Semantic Mediawiki (which I attempted to use at one point) is difficult to work with and far too complicated and abstract for the average Wiki editor. (See also Tim Berners-Lee and the failure of Semantic Web.)

WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.


> WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.

Note that the internal data format used by Wikidata is _not_ RDF triples [0], and it's also highly non-relational, since every statement can be annotated by a set of property-value pairs; the full data set is available as a JSON dump. The RDF export (there's actually two, I'm referring to the full dump here) maps this to RDF by reifying statements as RDF nodes; if you wanted to end up with something queryable by SQL, you would also need to resort to reification – but then SPARQL is still the better choice of query language since it allows you to easily do path queries, whereas WITH RECURSIVE at the very least makes your SQL queries quite clunky.

[0] https://www.mediawiki.org/wiki/Wikibase/DataModel [1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Fo...


The sparql api is no fun. Limited to 60s for example is death. I had to resort to getting the full dump.


How do you dump general purpose, encyclopedic data into a relational database? What database schema would you use? The whole point of "triples" as a data format is that they're extremely general and extensible.


Most structured data in Wikipedia articles is in either infoboxes or tables, which can easily be represented as tabular data.

  Table country:

  Name,Capital,Population
  Aland,Foo,100
  Bland,Bar,200
Now you need a graph for representing connections between pages, but as long as the format is consistent (as they are in templates/infoboxes) that can be done with foreign keys.

  Table capital
  ID,Name
  123,Foo
  456,Bar

  Table country
  Name,Capital_id,Population
  Aland,123,100
  Bland,456,200


> Most structured data in Wikipedia articles is in either infoboxes or tables

Most of the data in Wikidata does not end up in either Infoboxes or Tables in some Wikipedia, however, and, e.g., graph-like data such as family trees works quite poorly as a relational database; even if you don't consider qualifiers at all.


Those infoboxes get edited all the time to add new data, change data formats, etc. With a relational db, every single such edit would be a schema change. And you would have to somehow keep old schemas around for the wiki history. A triple-based format is a lot more general than that.


RDF shouldn't be lumped in with SPARQL


That’s the same set of technology. SPARQL is used to query RDF graphs, that’s pretty tightly coupled.


People might be interested to know that semantic web ideas have been more successful in some niches than others. Computational biology for example makes extensive use of "ontologies" which are domain specific DAGs that do exactly what Abstract Wikipedia is attempting. Much of the analysis of organism's genomes and related sequences relies on these ontologies to automatically annotate the results so that meaningfull relationships can be discovered.

There are of course HUGE issues with the ontologies. They are not sexy projects so they are often underfunded and under resourced - even though the entirety of bioinformatics uses them! The ontologies are incomplete and sometimes their information is years behind the current research.

For the curious, the Gene Ontology is the golden child of biology ontologies. See here: http://geneontology.org/


Amazingly fascinating field. I have learned a (very) small amount about this from Dr. David Sinclair's book Lifespan


Semantic Web[1] reborn (after alleged[2] death)? Also I wonder how helpful Prolog infrastructure could be since they provided some useful frameworks [3][4] for that.

[1] https://www.w3.org/standards/semanticweb/

[2] https://twobithistory.org/2018/05/27/semantic-web.html

[3] https://www.swi-prolog.org/web/

[4] https://www.swi-prolog.org/pldoc/doc_for?object=section(%27p...


We actually looked into SWI prolog semantic web package for corporate work! We ended up finding RDFox ( https://www.oxfordsemantic.tech/ ) which is the bleeding edge in research on inference databases and linked data. Unfortunately COVID changed the plans but we were really really impressed with the capabilities.

Semantic web is used broadly; Google Structured data you see for reviews and infoboxes, wikidata. Data is broadly available, even if jobs in semantic technologies are not.

We're familiar with common databases like key value stores, OLAP, OLTP, but reasoning technology offers unique properties many people aren't aware of. For example you can have your business logic integrated with your database in a way that's much more flexible that stored procedures. You express your business rules as logic programs, the automatically run multi-core, they run as soon as data is inserted into the database and there is no function call; the data does not need to be aware of what logic is in the database, logical rules are applied incrementally so that adding new data or new rules does not trigger re-computation of all the data, business rules can use data produced by other business rules, and finally you use the explain command to get a mathematical proof of why an outcome happened.

Reasoning technology may be old but recently this idea of automatically stating things in a declarative form and having the application reconcile the differences has been the differentiating factor for the most popular software out there; kubernetes, teraform, ansible, react, graphql, flutter. Without the declarative reasoning capabilities, these tools may not be considered some of the best.

Think postgresql 12 generated columns except infinitely chainable, recursive and connectable to other tables. Think pre-computed materialized views, but automatically updated as new data is inserted (no refresh needed).


OH MY GOD. Is this jimmyruska of jimmyr.com and those youtube tutorials from way way back?

I'm going to get every opportunity I get to tell you this, but you are the reason I'm where I am. I followed you since middle/primary school. Always checked up on you every now and then, and your site[0] is STILL my homepage. In fact I reached this very link from your HN tab (although the link to the HN tab leads to a webarchive).

Thank you. This is a little weird, I'm sure, but you're definitely had a tangible and very significant impact on my life!

[0] http://www.jimmyr.com


>> Also I wonder how helpful Prolog infrastructure could be since they provided some useful frameworks [3][4] for that.

That's a good point, because looking at the working paper on the proposed architecture of the project [1], the example of a "constructor" in Figure 1 is basically a set of frames and it has a straightforward translation in Prolog and the example of a "renderer" in English is basically a pattern with holes that also has a very straightforward Prolog implementation via Definite Clause Grammars. In fact the whole architecture reminds me a lot of IBM's Watson - the good bits (i.e. the Prolog stuff they used to store the knowledgebase).

________

[1] https://arxiv.org/pdf/2004.04733.pdf


Reminder to link to abstracts, rather than directly to PDF: https://arxiv.org/abs/2004.04733 .


Can you explain why this is preferred? Is it to ensure everyone knows where it comes from (journal or preprint)?


For me, at least two reasons:

1. As you say, to see more information about where the paper comes from.

2. It's easy to get from the abstract page to the PDF, but not vice versa.

Personally, I also think it's good for people to get into the habit of linking to a text description of data-heavy resources rather than directly to the resources. PDFs aren't that data-heavy, but there are plenty of other things that are that could do with a text landing page, and I think it's good to get in that habit.


Makes sense, thanks!


Oh, sorry. My bad and thanks for posting the link to the abstract.


The prolog renaissance is unstoppable!


Twobithistory article is pretty good. Outlines several things I wasn't aware of like DBPedia.


That kind of AI has been tried, 30 years ago, and it doesn't go far enough. It's really difficult to get that out of the toy domains.


A bit changed in AI since 30 years ago. The way we use Internet changed as well. Perhaps if we had a better semantic network and today's algorithms, we could go further?


You could have said exactly the same thing for neural networks...


HolyShit!

> The goal of Abstract Wikipedia is to let more people share in more knowledge in more languages. Abstract Wikipedia is an extension of Wikidata. In Abstract Wikipedia, people can create and maintain Wikipedia articles in a language-independent way. A Wikipedia in a language can translate this language-independent article into its language. Code does the translation.

from https://meta.wikimedia.org/wiki/Abstract_Wikipedia

Will this mean that knowledge is encoded in machine readable format and that we can start to write programs over this knowledge graph? This is huge.


Very cool. I’m fascinated by the Wolfram Language paradigm of Knowledge Base+Programming language=Computable everything (demo: https://youtu.be/3yrVuM2SYZ8). But I could never get into the Wolfram ecosystem because it’s totally proprietary. This makes me think, does Wikidata’s model (ontologies?) provide a way to recreate the Wolfram computable everything concept as an open community project?


Yes, essentially.

Wikidata models assertions of facts about things and it has a very powerful system for writing queries to get at the facts and relationships you are interested in.

Here are some examples of what you can do with Wikidata:

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...


The Abstract Wikipedia idea is a great advance.

> knowledge is encoded in machine readable format and that we can start to write programs over this knowledge graph

But surely that abilty has been the goal of wikidata from the start?


Perhaps, but this seems to be moving towards a more holistic machine-readable article graph. If you look at a page from wikidata[0], it seems to be basically a key-value database (e.g. earth.highest point = [ mount everest { from sea level, 8000m } ], while the "full article" terminology used in the announcement seems like it may be even more connected/informative/structured than that.

[0]: https://www.wikidata.org/wiki/Q2


I don't see any indication that Abstract Wikipedia articles are anything more than a sequence of "constructors" and those "constructors" are essentially just triples (with qualifiers) that a "renderer" turns into a specific human language.

The example they give is the constructor:

rank(SanFrancisco, city, 4, population, California)

And the English renderer will output:

"San Francisco is the fourth largest city by population in California."


Agreed, but my point was that the aim has always been to encode these facts and then mix them into wikipedia for any assertion / attribute, so that any fact is backed by an assertion.


Exactly. The main difference is that they would now not be used to generate the infoboxes, but actual prose.


The data is stored as a bunch of triples but it encodes a graph. You can use tools like sparql to query that graph.

https://www.stardog.com/tutorials/sparql/


I'm curious whether this new project has been driven in any way by the difficulty of integrating data from Wikidata into Wikipedia. It various a lot by language, but the user communities are quite hostile to Wikidata in some cases. I think it's generally on the grounds that since Wikidata is a wiki, it can be easily vandalized and its data can't be trusted.


I dont really think so - the goal is to fill in the gaps of knowledge, not to replace existing wikipedias and whatever policies they may have.


It assumes articles will say the same thing in every language, which to me means that edit wars can now proceed on a more global basis. You're no longer fighting only the people who feel comfortable enough with your language to edit in it, you're fighting anyone who can edit the article at all around the world.

Do the Hebrew Wikipedia and the Arabic Wikipedia agree on the status of Israel?


> Do the Hebrew Wikipedia and the Arabic Wikipedia agree on the status of Israel?

If they don’t agree, perhaps they can agree to disagree and we can encode that fact instead.


A lot of serious social problems would be non-issues if only people could agree to disagree.


Unless they agree to disagree about the facts in a case where one side is factually right while the other has the consistency of a lie your 3-year-old would make up to stop you from discovering they ate all the cookies.

Agree to disagree is like saying there is no right side in this matter which is okay for topics where there isn't. Many topics however are not a matter of perspective, they are a matter of who is factually right.

IMO in such a case agreeing to disagree can often be destructive, because it legitimises a position which is factually wrong and constructs an illusion of balance where there is none.

If one side says it rains and the other says it doesn't, agreeing to disagree is wrong. If there is disagreement about what the facts mean, then sometimes both perspective can be true at the same time. This is however not as often the case as I wish it would be..


"If one side says it rains and the other says it doesn't, agreeing to disagree is wrong"

No. Agreeing to disagree is accepting, that a subject is disputed with different, even contradicting opinions.

An AI could in theory extract both versions and present them as disputed. Meaning, there are different definitions of a word and the correctness of facts.

And "raining or not" is also not as simple as you think. A person from spain will consider a certain state as raining, which a british person might see as some humidity in the air ..


Let’s take the theory of evolution for a spin. Being a theory and not a hypothesis, it is a proven scientific fact. Yet a large chunk of the population chooses to not believe it to be true. So how do we encode this knowledge?

Option 1: we call it a controversy and make it sound like because some people don’t believe in it means it might not be correct.

Option 2: we state upfront that the theory of evolution is correct but link to a “incorrect but competing viewpoint” of creationism.

Option 3: we create three articles: one about the theory of evolution, one about creationism, and one about the disagreements between the creationists and the rest of the modern world.

I like option 3 best as it is the most complete picture of the three, and all it requires is the abstract concept of controversy or disagreement. I don’t know how you encode “dumbass” in this new format but it might be a useful concept to explain to aliens if they decide to visit Earth.


>Being a theory and not a hypothesis, it is a proven scientific fact

A theory is distinct from a hypothesis, but surely it isn't itself a fact either.

Wouldn't it be better to say that the theory of evolution explains a lot of facts, which we may also call (observed) evolution? Really, aren't we just calling two related concepts "evolution"?


We use evolution every day in pharmaceuticals. We use it in our crops and in domesticated animals. If it was under question we wouldn’t call it a theory. Evolution is as real as gravity except it has even more evidence and scientific understanding, while gravity still doesn’t play nice with quantum mechanics and of course general relativity defines it as something most people cannot intuit. Evolution is a fact. I don’t see two related concepts here. Besides, consider that the second closest hypothesis that explains life on earth is that a bearded man in the sky got bored one day and created the universe, then made a man and a woman and gave them a bunch of dinosaurs to play with but instead they played with a snake, an apple, and each other’s bodies until he kicked them out of his play garden because they didn’t play by his rules, so the two of them through tremendous amounts of incest populated the earth (a fact easily disproven by a number of methods including simple genetic testing). No I don’t think it would be better to call the theory of evolution anything but a proven fact. If people want to believe in fairy tales that’s fine. But that’s not a reason to cloud scientific discovery.


I don’t know about other people, but your definition of “theory” doesn’t match mine. To me the word “theory” is almost identical to “hypothesis”, but generally a bit more comprehensive (eg consists of multiple hypothesis). Calling something a theory doesn’t require any proof nor that it be true.

There are a number of theories related to gravity. That doesn’t mean gravity isn’t a fact, it just means we don’t 100% understand how it works in all situations (eg quantum).

Similar for evolution - there are various scientific theories about the origins of species. And it is absolutely the case that we don’t know 100% how we got from Big Bang to here, thus the theories are still theories. If you just want to point to Darwinism and say fact, you’d be doing a large disservice to us all.


Gravity continued being the same thing before Newton, after Newton, and after Einstein. If we come up with a new theory of gravity, it's distinct from the phenomenon we observe. The planets are still going to go around and around, etc.


The semantic structure is the structure of the language (or the independent thing between languages). This does not automatically facilitate machine Understandable knowledge. You would need to write the code to understand it first, which is probably almost as difficult as understanding English for example.

Only when relations are defined in a kind of Prolog style in those examples it will be usable as knowledge about things other than language.

A computer might spit out texts which it guessed are connected to some question you ask it though. Does not mean it understands the relations of things.

But when can we ever really say "a machine understands" something? So perhaps there is not much of a difference?


I'm sorry if I didn't understand. Wouldn't a json or xml type data structure (where some Wikipedia stuff is already stored) would support this?


While the implications are huge, it is hard to think about what is actually being done. XML and JSON are just languages, and the type of information could be stored in any number of ways.

In my point of view, the problem here is that you could say something like "'water' and 'heat' produces 'steam'", but knowledge is never that simple, and understanding of that information is even more complicated.

I would think that Abstract Wikipedia is not the first people trying to solve such a problem, and I am very curious to see what they come up with.

EDIT: Looks like they have an example: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Examples


Isn't this the same as expert systems that led to the first AI winter?


Weren't expert systems the second AI winter? In any case, I see the end of the cold war as being the driving factor behind that winter, whatever its ordinal.


Maybe, but that doesnt mean this cant be pulled off. Neural networks were dismissed as non viable and look at them now!


Anyone who has studied old-school AI will know that this is an incredibly ambitious project; it is essentially throwing itself at the problem of "knowledge frames", i.e. how to encode information about the world in a way that an AI system can access it and, well, be intelligent about it. (Also at the problem of natural language generation, but as hard as that is, at the moment it seems like the easier of the two.)

But...

One of the biggest problems with a lot of the old "Big AI" projects that were developing some sort of knowledge frames (and there were several, and some of them still exist and have public faces) was, who the hell is going to get all the info in there in a way that's complete enough to be useful? Now you have a learning problem on top of the knowledge representation problem. But throw the wikimedia community at it and crowdsource the information?

This actually starts to seem plausible.


It seems more similar to an elaborate version of the internationalization and translation of messages done in any program that targets multiple languages? If you think of it as a principled template language for generating text from the results of canned database queries, it starts seeming a lot more feasible. The templates themselves do need to be translated into every language, much like the messages in internationalization.

Ideally this enables something like an improved version of the ICU library, with a lot more data available.


Yes, and the paper do only lists 2 relevant references regarding the decades of work in this domain.


Even if it's not successful, there's certainly enough interest in it to make it worth trying.

Maybe we only get 30% of the way. So what? That's 30% more than zero!


Technically, 30% more than zero is still zero though.

But yeah 30% of the way more than zero is something :-)


How is that different from Wolfram Alpha?


So do people find Wikidata that impressive? Here's what Wikidata says about Earth, an item that is number 2 in the ID list, and also on their front page as an example of incredible data.

https://www.wikidata.org/wiki/Q2

I struggle to find anything interesting on this page. It is apparently a "topic of geography", whatever that means as a statement. It has a WordLift URL. It is an instance of an inner planet.

The first perhaps verifiable, solid fact, that Earth has a diameter of "12,742 kilometre", is immediately suspect. There is no clarifying remark, not even a note, that Earth is not any uniform shape and cannot have a single value as its diameter.

This is my problem with SPARQL, with "data bases", in that sense. Data alone is useless without a context or a framework in which it can be truly understood. Facts like this can have multiple values depending on exactly what you're measuring, or what you're using the measurement for.

And this on the page for Earth, an example that is used on their front page, and has the ID of 2. It is the second item to ever be created in Wikidata, after Q1, "Universe", and yet everything on it is useless.


I find it pretty well stuffed with appropriate information. You're looking at an ontology, not a wikipedia article, it's supposed to be dry (subject, relation, object). It's being used to disambiguate concepts, named entities and support machine learning models with general knowledge in a standard format. There are plenty of papers on the topic of link prediction, auto-completion and triplet mining.

Also, if you look:

> radius: 6,378.137±0.001 kilometre

> applies to part: equator

So it clearly states how the radius was measured.


> I find it pretty well stuffed with appropriate information. You're looking at an ontology, not a wikipedia article, it's supposed to be dry (subject, relation, object).

We're talking about a research project with a large amount of funding to go from the former to the latter. But pretty much none of the stuff on Earth's Wikipedia page is represented here.

> applies to part: equator

An equator (the general concept to which the ontology links to) has no given orientation. Earth's Equator is a human construct distinct from an oblate spheroid's equator, as are the specific locations of the poles. Nowhere is it specified in the ontology that this is measured at a specific Equator, not just any equator.

This is all human context and understanding that we've built on top, and it's part of what I mean when I say that the data is kinda pointless. All of these facts depend on culture to understand.


Well, the linked equator (Q23528) has a geoshape which defines what it is.


I believe that in most modern human cultures the sentence "the diameter of the Earth" has a very imprecise, very informal, but very recognisable meaning. In fact, I really doubt that most people on the Earth would think of what precisely is the shape of the Earth when talking about its diameter.


Q2 is just an id, probably one shouldn't interpret too much into it except that it defines an entity. Regarding the diameter, probably it depends how you define it. For instance according to Wikipedia one can generalize it as sup { d(x,y) }, seems legitimate to me although Wikidata's referenced diameter definition (P2386) isn't that general, probably it should be updated... But to be fair, Earth (Q2) has the shape (P1419) oblate spheroid (Q3241540) under sourcing circumstances (P1480) approximation (Q27058) :-)

To me Wikidata (and similar projects like OSM) shine because they tend to have so many details.


I've worked with the Wikidata set a bit. On first glance the entries do seem to lack any useful information as it's all heavily abstracted into other items and properties - as well as containing a bunch of references and qualifiers to validate the facts.

Once you start connecting the items to other items and properties, you begin to see better information and context.

A lot of the "snaks" of items are units of measurement, so no worries converting them into other languages. This project should help in generating articles in other languages based on these facts.


I dont think its interesting in itself so much as in applications. I remember talking to someone once who was working on a project where you stick a probe in some soil, and then it uses wikidata to tell you the best type of plant to grow. I have no idea whatever happened to this project, if it worked or not - but it always struck me as a great example of the enabling value of wikidata - that you can use it to power ideas totally unrelated to the original purpose the data was collected for.



Anything that gives a boost to Wikidata is great. Being able to run queries over wiki remains one of the most magical things on the internet:

https://query.wikidata.org/


I recently learned that words and translations from Wiktionary are in Wikidata's graph as well, which enables e.g. this simple lemmatizer: https://tools.wmflabs.org/ordia/text-to-lexemes (The Wikidata query it uses is linked at the bottom.)


Lexical content on Wikidata is (unfortunately) separate and independent of Wiktionary. Same topic, different execution (and much smaller coverage at the moment). https://m.wikidata.org/wiki/Wikidata:Lexicographical_data


It's magic but the SparQL language is very hard to learn.


It's the identifiers that make querying Wikidata difficult, IMO. SPARQL is pretty easy, certainly no more difficult than SQL. It might even be easier than SQL since there are no joins.


I found the hardest part of sparql is forgeting my sql knowledge. Its a very different query language than sql, but some constructs look similar, and its very easy to confuse yourself thinking the construct does a similar thing as sql when it really doesnt


> It might even be easier than SQL since there are no joins.

Every dot between Triple Patterns in a Basic Graph Pattern is actually a JOIN; you just don't need to worry about using them.

As for the identifiers, you get used to them if you work regularly with them, and query.wikidata.org actually has completion for identifiers if you press CTRL-Space.


Right, the joins are in the graph.


There are some libs that simplify access like for example https://github.com/molybdenum-99/reality


Thanks for sharing this, didn't know it existed.


Hi, founder of Diffbot here, we are an AI research company spinout from Stanford that generate the world's largest knowledge graph from crawling the whole web. I didn't want to comment, but I see a lot of misunderstandings here about knowledge graphs, abstract representations of language, and the extent as to which this project uses ML.

First of all, having a machine-readable database of knowledge(i.e. Wikidata) is no doubt a great thing. It's maintained by a large community of human curators and always growing. However, generating actually useful natural language that rivals the value you get from reading a Wikipedia page from an abstract representation is problematic.

If you look at the walkthrough for how this would work (https://github.com/google/abstracttext/blob/master/eneyj/doc...), this project does not use machine and uses CFG-like production rules to generate natural sentences. Works great for generating toy sentences like "X is a Y".

However, human languages are not programming languages. Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (those that have taken grammar class can relate to the number of exceptions to the ruleset)

Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format. Plenty of text is opinion, subjective, or describes notions that don't have an proper entity. Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.

A much better and direct approach to the stated intention of making the knowledge accessible to more readers is to advance the state of machine translation, which would capture nuance and non-facts present in the original article. Additionally, exploring ML-based ways of NL generation from the dataset this will produce will have academic impact.


> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (...)

> Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format.

It doesn't seem like the goal of this project is to describe those languages, or to represent ever sentence in a typical Wikipedia article? The goal doesn't seem to be to have all Wikipedia articles generated from Wikidata, but rather to have a couple of templates to the order of "if I have this data available about this type of Subject, generate this stub article about it". That would allow the smaller Wikipedia language editions to automatically generate many baseline articles that they might not currently have.

For example, the Dutch Wikipedia is one of the largest editions mainly because a large percentage of its articles were created by bots [1] that created a lot of articles on small towns ("x is a town in the municipality of y, founded in z. It is nearby m, n and o.") and obscure species of plants. This just seems like a more structured plan to apply that approach to many of the smaller Wikipedia's that may be missing a lot of basic articles and are thus not exposing many basic facts.

[1] https://en.wikipedia.org/wiki/Dutch_Wikipedia#Internet_bots


This is addressed in the white paper describing the project's architecture:

10.2 Machine translation

Another widely used approach —mostly for readers, much less for contributors— is the use of automatic translation services like Google Translate. A reader finds an article they are interested in and then asks the service to translate itinto a language they understand. Google Translate currently supports about a hundred languages — about a third of thelanguages Wikipedia supports. Also the quality of these translations can vary widely — and almost never achieves thequality a reader expects from an encyclopedia [33, 86].*

Unfortunately, the quality of the translations often correlates with the availability of content in the given language [1],which leads to a Matthew effect: languages that already have larger amounts of content also feature better results intranslation. This is an inherent problem with the way Machine Translation is currently trained, using large corpora. Whereas further breakthroughs in Machine Translation are expected [43], these are hard to plan for.

In short, relying on Machine Translation may delay the achievement of the Wikipedia mission by a rather unpredictabletime frame.

One advantage Abstract Wikipedia would lead to is that Machine Translation system can use the natural language generation system available in Wikilambda to generate high-quality and high-fidelity parallel corpora for even morelanguages, which can be used to train Machine Translation systems which then can resolve the brittleness a symbolic system will undoubtedly encounter. So Abstract Wikipedia will increase the speed Machine Translation will become better and cover more languages in.

https://arxiv.org/abs/2004.04733

(Theres's more discussion of machine learning in the paper but I'm quoting the section on machine translation in particular).


Additionally of course Google Translate is a proprietary service from Google, and Wikimedia projects can't integrate it in any way without abandoning their principles. It's left for the reader to enter pages into Google Translate themselves, and will only work as long as Google is providing the service.

What is the quality of open source translation these days?


State of the art is always open source in MT.


>> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them.

Is that realy true? If natural languages have rules, then there exists a ruleset that can describe any natural language- the set of all rules in that language. Of course, a "rule" is a compact representation of a set of strings, so if natural languages don't have such rules it's difficult to see how any automated system can represent a natural language "compactly". A system without any kind of "rules" would have to store every grammatical string in a language. That must be impossible in theory and in practice.

If I may offer a personal perspective, I think that the goal of the plan is to produce better automated translations than is currently possible with machine translation between language pairs for which there are very few parallel texts. My personal perspective is that I'm Greek and I am sad to report that basicaly translation from any language to Greek by e.g. Google Translate (which I use occasionally) is laughably, cringe-inducingly bad. From what I understand the reason for that is not only the morphology of the Greek language which is kind of a linguistic isolate (as opposed to, say, Romance languages), but also that, because there are not many parallel texts between most languages (on Google Translate) and Greek, the translation goes through English- which results in completely distorted syntax and meaning. Any project that can improve on this sorry state of affairs (and not just for Greek- there are languages with many fewer speakers and no paralle texts at all, not even with English) is worth every second of its time.

To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.


> To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.

The training data for machine translation models is also human-created. Given some fixed amount of human hours, would you rather them be spent annotating text that can train a translation system that can be used for many things, or a system that can just be used for this project? It all depends on the yield that you get per man-hour.


As the paper I quote below says, the system that would result from this project could be re-used in many other tasks, one of which is generating data for machine translation algorithms.

I think this makes sense. The project aims to create a program, basically ("a set of functions"). There are, intuitively, more uses for a proram than for a set of labelled data.


> Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.

So, the obvious solution is to create robo-annotators, and that's what your company is supposedly trying to do?


> The project will allow volunteers to assemble the fundamentals of an article using words and entities from Wikidata. Because Wikidata uses conceptual models that are meant to be universal across languages, it should be possible to use and extend these building blocks of knowledge to create models for articles that also have universal value. Using code, volunteers will be able to translate these abstract “articles” into their own languages. If successful, this could eventually allow everyone to read about any topic in Wikidata in their own language.

This is a great idea. I bet the translations will be interesting as well. I was wondering about how the translation was going to work and it looks like they thought of that as well. They're going to use code to help with the translation.

> Wikilambda is a new Wikimedia project that allows to create and maintain code. This is useful in many different ways. It provides a catalog of all kind of functions that anyone can call, write, maintain, and use. It also provides code that translates the language-independent article from Abstract Wikipedia into the language of a Wikipedia. This allows everyone to read the article in their language. Wikilambda will use knowledge about words and entities from Wikidata.


Pretty-printing the abstract content into an arbitrary target language (a better way of putting it than "translation") would be quite the challenge, because "conceptual models" do vary by language. One can attempt to come up with something that's "as abstract/universal as possible" but it remains to be seen how practically useful that would be.

For that matter, making the source model "logical" and "compositional", as implied by the Wikilambda idea, only opens up further cans of worms. Linguists and cognitive scientists have explored the idea of a "logical" semantics for natural language, even drawing on the λ-calculus itself (e.g. in Montague grammar and Montague semantics), but one can be sure that a lot of complexity will be involved in trying to express realistic notions by relying on anything like that.


I didn't assume the translations would be lossless. It's obvious there will be conceptual mismatches but that's why this is interesting. Because when the abstract model is made concrete people can notice the gaps and improve the abstract model. I can imagine a feedback loop that improves both the abstract and concrete/translated models as people work on improving both to reduce the conceptual gaps between the abstract and concrete models.


I’m starting to feel the structure and content of “abstract content” is going to be quite like Wikipedia pages in target languages zipped into single archive plus overheads...


Sorry to be the typical pessimistic HN commenter (e.g., Dropbox is just ftp), but this seems ambitious enough to remind me of https://en.wikipedia.org/wiki/Cyc.


Even Wikidata today is already a lot more usable and scalable than Cyc. The latter always seemed like a largely-pointless proof of concept; Wikidata by contrast is very clearly something that can contain real info, and be queried in useful ways. (Of course knowledge is not always consistently represented, but that issue is inherent to any general-purpose knowledge base - and Wikidata does at least try to address it, if only via leveraging the well-known principle "many eyes make all bugs shallow".)


Scalable? Citation needed.

It is well-known wikidata does not scale. Whether it is in terms of number of data contribution or number of queries. Not only that, but the current infrastructure is... not great. WBStack [0] try to tackle that but it is still much more difficult to enter the party, than it could be. Changes API? None. That means that it is not possible to keep track of changes in your own wikidata/wikibase instance improved with some domain specific knowledge. Change-request mechanic? Not even in the roadmap. Neither is it possible to query for history of changes over the triples.

Wikidata GUI can be attractive and easy to use. Still, there is big gap between the GUI and the actual RDF dump, that is, making sense of the RDF dump is big endeavor. Who else wants to remember properties by number? It might be a problem of tooling. Question: how to add a new type of object to the GUI? PHP? Sorry.

I do not downplay the role of wikimedia.

[0] https://addshore.com/2020/01/wbstack-infrastructure/


> Neither is it possible to query for history of changes over the triples.

And why should it? The triples (and hence the full RDF dump as well) are a “lossy” (there's actually two different translations, the “truthy” triples that throw away large parts of the data, and the full dump that reifies the full statements, but is therefore much more verbose) translation of the actual information encoded in the graph. Revision history for the _actual_ items has been queryable via the Mediawiki API for a long time.


With regards to bugs apparently largest human by mass is 20 years old gymnast:

https://www.wikidata.org/wiki/Q15710550


Looks like someone fixed it after your comment. Thanks for contributing your eyeballs to the hunt!


Yeah, I think that query should actually return this result:

https://www.wikidata.org/wiki/Q3572342


I don't think that's pessimistic, more like cautionary. For a project like this it behooves them (IMO) to do a "related work" review, eh?


Agreed. "[since 1982,] by 2017 [Lenat] and his team had spent about 2,000 person-years building Cyc, approximately 24 million rules and assertions (not counting "facts") and 2,000 person-years of effort." https://en.wikipedia.org/wiki/Douglas_Lenat


I think Patick Cassidy (dict) and his MICRA project might be vaguely similar as well.

http://micra.com


Why is thinking of an analogy pessimistic?


Because Cyc is not seen as having been successful, so comparing a new project to it implies that Abstract Wikipedia won't be successful either. And, of course, all new approaches in each discipline fail, until sometimes they start succeeding.


I didn't read it that way. I thought it was an interesting comparison.


Because Cyc crashed in a truly spectacular fashion.


How did it crash? And what was spectacular about it?


Cyc got hyped for while in the early 90s. It became apparent, however, that rule-based wasn't going to play as big a role as ML in the future of AI research. It still exists, but the company is really secretive, and hasn't released anything viable in years.

[edit: I wasn't alive back then, so most of what I know comes from the Wikipedia article and a recent HN thread: https://news.ycombinator.com/item?id=21781597 . My view of Cyc probably comes across as slightly negative. Their (Cycorp) view seems to have evolved since then, and they seem to be creating some really interesting stuff.]


Always worthwhile to revisit assumptions. I don't know much about Cyc, I was just curious.


I hope at least 20-30% of the people involved in the project are at least near-native level speakers of non-Indo-European languages. Linguistic biases based on your mother tongue die hard, and I know this from having waded through tons and tons of software designed with biases built-in that woefully disregard Asian syntax, typography, input, grammar, semantics, etc etc etc. As the whole point of the project is multilingual support, I really hope the developers don’t underestimate how grammatically and semantically distant different language families can be.


I think a consistent multilingual Wikipedia is a fantastic goal.

But I'm not sure this is the right way to do it.

Given that most of the information on Wikipedia is "narrative", and doesn't consist of facts contained in Wikidata (e.g. a history article recounting a battle, or a movie article explaining the plot), this scope for this will be extremely limited. The creators are attempting to address this by actually containing every single aspect of a movie's plot as a fact, and that sentences are functions that express those facts... but this seems entirely unwieldy and just too much work.

What I've wished for instead, for years, is actually an underlying "metalanguage" that expresses the vocabulary and grammatical concepts in all languages. Very loosely, think of an "intermediate" linguistic representation layer in Google Translate.

Obviously nobody can write in that directly in a user-friendly way. But what you could do is take English (or any language) text, do an automated translation into that intermediate representation, then ask the author or volunteers to identify all ambiguous language cases" -- e.g. it would ask if "he signed" means made his signature, or communicated in sign language. It would also ask for things that would need clarification perhaps not in your own language but in other languages -- e.g. what noun does "it" refer to, so another language will know to use the masculine or feminine version. All of this can be done within your own language to produce an accurate language-agnostic "text".

Then, out of this intermediate canonical interpretation, every article on Wikipedia would be generated back out of it, in all languages, and perfectly accurately, because the output program isn't even ML, it's just a straight-up rule engine.

Interestingly, an English-language original might be output just a little bit different but in ways that don't change the meaning. Almost like a language "linter".

Anyways -- I think it would actually be doable. The key part is a "Google Translate"-type tool that does 99% of the work. It would need manual curation of the intermediate layer with a professional linguist from each language, as well as manually curated output rules (although those could be generated by ML as a first pass).

But something like that could fundamentally change communication. Imagine if any article you wanted to make available perfectly translated to anyone, you could do, just with the extra work of resolving all the ambiguities a translating program finds.


>The creators are attempting to address this by actually containing every single aspect of a movie's plot as a fact, and that sentences are functions that express those facts... but this seems entirely unwieldy and just too much work.

Doesn't this also get into issues where facts aren't clearly defined. I can think of a lot of interpretation of meaning from my literature classes, but there are also questions such as ownership of land at contested borders, if something was a legal acquisition or theft, or even coming up with a factual distinction between when something is grave robbery vs archaeology. A personal favorite would be mental illness, especially with some of the DSM V changes that have largely been rejected (or outright ignored) by society. And there are all sorts of political disagreements.

And as this applies to different languages, and different languages are likely aimed at different cultures and different nations, this gets messy. I could see some differences in an article written in Hindi vs Chinese when concerning issues involving both China and India. Creating a common language will force a unification of these differences that currently might exist with a sort of stalemate with each linguistic side by maintained by the dominant country for that language.


> questions such as ownership of land at contested borders

The entity representing Kashmir https://www.wikidata.org/wiki/Q43100 has three statements each for "country" and "territory claimed by", reflecting the claims by China, India and Pakistan. There are separate entities for Taiwan (the Republic of China) https://www.wikidata.org/wiki/Q865 , Taiwan (the island) https://www.wikidata.org/wiki/Q22502 , Taiwan (the province of the Republic of China) https://www.wikidata.org/wiki/Q32081 and Taiwan (the province of the People's Republic of China) https://www.wikidata.org/wiki/Q57251 .

So Wikidata can handle conflicting information by collecting all of it, but clearly separating the different viewpoints in a kind of "split brain". That works so long as the different sides can agree that their opponent's views are what they state they are.

In an Abstract Wikipedia article, that means that all viewpoints with a sufficiently large userbase might end up represented equally in all language versions, but they'll still be clearly distinguished as such, so the reader can apply their own value judgments to support their own side's viewpoint over that of their enemies.


So Abstract Wikipedia could achieve the ultimate goal of being banned in China, India and Pakistan at the same time.


And maybe even in the USA.


Ambiguity, probability, and conflict can themselves be modeled.

As long as participants agree on something resembling a common reality, they should at least be able to agree that there is a disagreement.


This reminds me greatly of Leonard Cohen's line

"There is a war between the ones who say there is a war and the ones who say there isn't."

https://www.youtube.com/watch?v=1qjmxTQ1o0M

(In Wikipedia jargon, I guess some people or countries are apt to say that their opponents' points of view are non-notable?)


> But what you could do is take English...

Then it is not english but a subset of english. They are translation systems that work the way you describe already using restricted natural language grammars.


To the contrary, it would be a superset. The point is that it is not restricted, but would also require additional distinctions that aren't even present in English.

Although, there may be some cases where truly identical meanings in English are collapsed into one. For example, the arguable lack of difference between "it isn't" and "it's not". (Things that are quotations could me marked up as such to not disturb the text in its original language, but quotations could still be translated into other languages.)


Even if you wrote the articles in a subset of English like the https://simple.wikipedia.org/wiki/Main_Page and then used ML to translate into other languages and then formed a feedback loop with the translation so that the original author could have some assurance that the translated texts were valid, this would be huge.


>What I've wished for instead, for years, is actually an underlying "metalanguage" that expresses the vocabulary and grammatical concepts in all languages.

We've been down this road before... https://en.wikipedia.org/wiki/Characteristica_universalis


Thanks for the reference. What I'm talking about is quite different, though.

Leibniz's proposal and similar ones are about a language intended to represent knowledge (formal concepts), and usable by people (with training).

I'm talking about simply a translation layer. It doesn't "know" or symbolize anything on its own -- it's merely a central mapping between all constructs in different languages. Nobody would read or write in it directly, and it couldn't be used to make inferences or anything like that.

What Abstract Wikipedia is proposing is actually quite similar to what Leibniz proposed, an intricately detailed map of knowledge and true propositions.

What I'm proposing is vastly less ambitious and vastly more useful: just a way to store the meaning of text in an intermediate representation that belongs to no single language. But that representation doesn't bother with what are facts or propositions or anything like that -- just entities and relations that mapped to however we ultimately understand languages in the real world. And assisted with a ton of ML, rather than built up from elementary propositions or set theory or something.


I think ultimately this endeavor would run into the same issues which befell the efforts of the GOFAI research program. A translation layer would need to "understand" embedded higher-order contexts of language making implicit references to the world, and which vary by different languages and cultures. Such contexts are not analytically decomposable in a manner which permits a centralized mapping.

As with GOFAI, a formalized representation of meaning entails certain ontological assumptions about "how the world hangs together". These assumptions often hold true for an isolated domain of inquiry but falter when dealing with more complex phenomena. I could, however, see such a system useful for 'Simple Wikipedia' articles which largely consist of existentially quantifiable sentences.

There is a wealth of philosophical scholarship on the issues of semantics and formal representation; I recommend the works of Hubert Dreyfus if you want to dig into this more.


hasn’t it been suggested that the internals of google translate and similar systems have effectively constructed such a meta-language?

It seems like you could construct a meta-linguistic sentence by writing a phrase in one language and choosing from possible translations in additional languages, which would hone the cross-linguistic meaning. You might not even have to define a representation of the metalanguage, you could just keep a set of (phrase, score) tuples, and interpolate translations of those phrases into the target language at display time! that almost sounds practical to me


>hasn’t it been suggested that the internals of google translate and similar systems have effectively constructed such a meta-language?

No, Google Translate and other state of the art machine translation software rely on large corpuses of linguistic data from which they make statistical inference. They do not have a 'centralized metalanguage' which ontologizes meaning in the manner described by OP.


Is the Grammatical Framework something like the thing you are imagining?


There could be some very interesting meta analytics that could be done on knowledge structured in this way. For example, this research which identifies the structural differences in the fact graphs of conspiracy theories vs accurate accounts: https://phys.org/news/2020-06-conspiracy-theories-emergeand-...


Interesting link, thanks for sharing. I wonder what this means precisely:

  If you take out one of the characters or story elements of a conspiracy theory, the connections between the other elements of the story fall apart.
I guess I have to read the paper, but what are these "connections" and what does "fall apart" actually mean?

EDIT: I just skimmed the paper https://journals.plos.org/plosone/article?id=10.1371/journal...

The connections capture context-specific relationships, such as co-occurrences. The "fall apart" part comes from the fact that conspiracy theories rely on hidden, unsubstantiated, subjective interpretations of intent or actions whose validity can be questioned. If they are key pillars of the narrative, then their falsity can negate the truth of the narrative.

This reminds me of a philosophical discussion around what "truth" means. Coherent theory of truth: truth is defined as a property that's coherent among a set of beliefs. It can also be used as an epistemic justification -- that is, any set of internally consistent beliefs can be taken as true. Of course, in practice, certain truth statements have to correspond to reality, which is where the correspondence theory of truth comes in.


Great article. Thanks for the recommendation/link.


truly love the mobile design of wikipedia and find myself adding ".m" to every link that i visit on wikipedia. it has larger fonts, more readable copy (for me at least), and works great on mobile. surprisingly the trick worked with this one as well!

how come the mobile design is not the default?


Because it is limited in design. It misses the sidebar which has useful links, the discussion page, the history page, doesn't have account links. Also the more compact text the desktop version is preferable to some.

You can automatically redirect to the mobile version if you want using a user script. I'm using a similar one for the reverse.

Edit: Here you're. https://gist.github.com/leakypixel/1b0a30fbdc815016c14264b82...


On mobile I struggle to remove the m. on every link. The mobile version does not work well with the research feature, and I never took time to know where do you switch to another language.


I'm not very involved in wikipedia politics so i might be wrong here, but my perception is that the desktop wikipedia has a lot of eyes on it and any change is received with a "aahh, change is scary" response.

the people who react negatively to change are also the people don't like mobile versions of websites, so the mobile site is more free to experiment and evolve their design.


Mobile is read- and very-occasionally-edit.

Desktop is basically the full admin interface for a Google-scale website with only rudimentary authentication requirements and a default-allowed policy. That works.


Truly hate that I have to deal with deleting the .m off other people's links. It's so ugly, it has these disgustingly oversized fonts, and because all the sections are hidden by default, it works badly on mobile and even worse on a real computer.

How come I can't just stop seeing that loathsome mobile design, both on my desktop and on my phone?

And no, I'm not kidding.


Note the mobile skin is separate from the mobile site. If you set in your preferences the skin Minerva Neue, you will get it on desktop (just the skin, there are non skin differences with the mobile site you wont get).

As for why not default- i imagine partially because the mobile site tends to be unpopular with power users (of course opinions vary)


The math template (the HTML one, not MathJax) and many other templates don't work well with the mobile version, and the talk page isn't there.

On desktop, I suggest installing Stylus and using a more readable and elegant Wikipedia themes. I like wikipedia.rehash by krasjet.


Ironically, switching to different language versions of the same page is irritating on the mobile version.


How do you even do it? My biggest compliant about the mobile site.


I seem to remember it used to be a foldout thingie as you scrolled down (at which point you might as well just hit the 'Desktop' link at the bottom). But apparently no more! I just found it again after some intense staring - it's the icon with two characters on the far left of the bar right under the article title.


Wow, I've seen that button many times but I was almost certain it was related to something... else. Thank you!


Try Wikiwand. It's a browser extension that gives all wikipedia pages a mobile-like design.


If successful, this could open huge doors in machine translation and NLP. Very cool.


It kinda would, basically a huge library of labelled NLP data may come available as the result of this.


A Wikipedia Signpost article[1] gives a more detailed overview of the goals of the project, but it also made me think of an interesting failure case. From the article:

> Instead of saying "in order to deny her the advantage of the incumbent, the board votes in January 2018 to replace her with Mark Farrell as interim mayor until the special elections", imagine we say something more abstract such as elect(elector: Board of Supervisors, electee: Mark Farrell, position: Mayor of San Francisco, reason: deny(advantage of incumbency, London Breed)) – and even more, all of these would be language-independent identifiers, so that thing would actually look more like Q40231(Q3658756, Q6767574, Q1343202(Q6015536, Q6669880)).

But Q1343202 doesn't mean "denial" as in "preventing someone else from getting something", it means "denial" as in "refusing to accept reality". (See [2].) The two concepts are represented by the same word in English, but they might not be in other languages.

It seems like it'd be kind of tricky to create an interface that ensures other English-speaking editors indicate the right meaning of "denial".

[1] https://en.m.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost...

[2] https://m.wikidata.org/wiki/Q1343202


I think the answer is be as clear as possible in the interface, but also accept mistakes will be made. People make grammar mistakes in (normal) wikipedia all the time, then other people come along and fix them. I expect the same will occur here.


Can really please ELI5 what the end product would look like? Couldn't understand anything concrete from the article.


This article from the SignPost is much more informative:

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2...


This appears to be an attempt to make a Wikipedia using the semantic data from Wikidata. The semantic web ideas of Tim Berners-Lee may be catching on.


That already happens, that's what Wikidata does. [0]

[0] https://www.wikidata.org/wiki/Wikidata:RDF


This is of course an interesting idea, but it has a number of huge technical hurdles to overcome. Here is the biggest:

Right now, if you want to become an editor of Wikipedia, you simply need to have a passing familiarity with wikitext, and how the syntax of wikitext translates into the final presentation of the article.

However, if you want to become an editor of Abstract Wikipedia, you'd need to have an in-depth knowledge of lambda calculus, and possibly a Ph.D. in linguistics. Without a quantum leap in editing technology and accessibility for beginners, there's little hope for this to gain any traction.


> Right now, if you want to become an editor of Wikipedia, you simply need to have a passing familiarity with wikitext

Wikipedia have had an WYSIWYG editor for years.

> if you want to become an editor of Abstract Wikipedia, you'd need to have an in-depth knowledge of lambda calculus, and possibly a Ph.D. in linguistics

No this is not how it's intended. First the data itself is supposed to be from Wikidata which is super simple to edit. Secondly surly they can come up with an UI for the other parts.


> surly they can come up with an UI for the other parts

Those "other parts" is the huge hurdle that I'm referring to, and can't be hand-waved away. There are already tools that can take Wikidata and transform it into human-readable articles [1]

But it's not at all obvious how to build a simple UI for writing completely abstract lambda expressions that take arbitrary data and apply linguistic nuances to produce readable text with correct grammar.

[1] https://reasonator.toolforge.org/


Why do you need a PhD in linguistics to write code?


It's not just writing code, it's writing code that needs to be aware of every linguistic nuance of your native language, so that you can coax the data to come out as a human-readable sentence. [1]

[1] https://meta.wikimedia.org/wiki/Wikilambda/Examples


What, didn't Chomsky already solve that in the 1960s? /s


How would you improve it?


I am sure that was hyperbole.


Yes. I know.


The research article doesn’t mention UNL a single time, despite it being a really similar effort (encoding texts in an abstract representation, which is generated and use by tools to translate automatically in various languages). The hard part of the project is not encoding facts into cute little RDF triples (that’s the super easy part, and as usual that’s where the SemWeb researchers put their focus on), it’s generating natural language from the abstract representation.

This means precise linguistics information must be present in the abstract representation for generation of correct sentences. Spoiler: those seems absent, and the renders presented in the paper are very basic. The data part of the project seems OK, but I predict it would go well because the NLP is largely ignored.


Wouldn't improving online translation tools achieve the same thing? And a much more reasonable task perhaps (or perhaps not, I am not the expert)

https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Examples

I used google translate on English sentences on English sentences here and they outputted exact same German sentences. I feel like this is already a somewhat solved problem


It's also interesting to note that, already, bots on some Wikipedias are the largest contributors of articles in that language. The Swedish, Waray, and Cebuano Wikipedias already have an estimated "between 80% and 99% of the total" all written by one bot, Lsjbot [1].

[1]. https://en.wikipedia.org/wiki/Lsjbot


I wonder if lsjbot has increased single contribution users. Wikipedia (or EN Wikipedia anyway) gates article creation but not editing. If other Wikipedias do that as well, then single edit users won't be able to create an article and hence can't contribute. But if lsjbot has created the stub then people can contribute.


This is what I hope the future of Wikipedia looks like. If all "facts" are stored in Wikidata and pulled from there by individual articles, it would be simple to keep things up to date. I'd love to see Wikidata grow to encompass all sorts of things - citations would be especially interesting and it could potentially solve the problem of an article citing the same source multiple times.


I don't quite get what problem it's trying to solve. Save labor? Improve factual consistency across languages?


"Knowledge Equity."

To increase the availability of knowledge for speakers of less popular languages. Once encoded in Abstract form, it can be made available in every human language.

That is an improvement over the current situation where knowledge is concentrated in just a few of the most popular languages.


This has been tried and failed many times before.

Why is this different?

What is the fundamental structural difference that will allow this to work?


There are many more English articles than any other language on Wikipedia, even though there's more non-English speakers in the world.

To me, it seems this project will allow for at least "stub" articles in essentially every-other-language which at the very least provides some basic information about each entity to a reader in their preferred language.


My first reaction was to look for a Wikipedia article for an overview of this. I couldn't find one yesterday, but one was created today:

https://en.wikipedia.org/wiki/Abstract_Wikipedia


AI Researchers: heavy breathing


Will this mean it will be like the bot generated Wikipedias (like the Cebuano Wikipedia) except it will be done by a Wikidata powered template? It might work for basic data-based facts like populations of villages but what about more complicated statements.


It seems that it is the LLVM in Linguistics.

> Such a translation to natural languages is achieved through the encoding of a lot of linguistic knowledge and of algorithms and functions to support the creation of the human-readable renderings of the content


Might be a good idea, but the multilingual argument doesn't convince me one bit. If this project is any useful, it won't be because of its multilingual part.

Any person worth reading in STEM fields already knows English, and I don't know why anyone would want to read Wikipedia in any language other than English.

I'm Latin American. I used the Internet in Spanish in my early teens before learning English, and it's a joke compared the English Internet. I don't even like English from a grammatical and phonetical point of view, but trying to cater to the non-English-speaking public seems like a waste of time in 2020. Just learn English already if you don't know it, it will be a much better use of your time than reading subpar material in another language.


> Any person worth reading in STEM fields already knows English, and I don't know why anyone would want to read Wikipedia in any language other than English.

> Just learn English already if you don't know it

Some people are merely fine at English, or uncomfortable reading "casually" in their second/third languages...

It's actually not unreasonable for someone to want learning content in their native language. And there are loads of opportunities to try out new content when people in different places are writing content in different languages, with new angles and takes

For example the best intro to LaTeX is a book originally written in French[0].

And sometimes content just makes better sense in other languages because primary materials will be in that language (if you had the choice, would you rather read about the great Tokyo Fire in English or in Japanese?)

Sure, having access to English content is really important! But trying to have multilingual content is normal.

[0]: https://www.latexpourlimpatient.fr/


I often read wikipedia in several languages, because the differences between the articles sometimes offer almost as many bits of information as the commonalities.

What amazes me are the small-audience ones. For instance, who uses https://pdc.wikipedia.org/wiki/Haaptblatt given that most of the native speakers of that dialect adhere to a religion which mandates that cell phones belong, not on one's person, but in the barn, and furthermore be used strictly for business?

(I did first learn about the Mennonite origins of the https://pdc.wikipedia.org/wiki/New_Holland_Machine_Company from this wikipedia)

To the main point, I expect the lojban wikipedia to profit immensely from this project :-)

https://jbo.wikipedia.org/wiki/uikipedi%27as:ralju


> Any person worth reading in STEM fields already knows English, and I don't know why anyone would want to read Wikipedia in any language other than English

That's not the goal of Wikipedia. The goal is to make knowledge freely available to as many people as possible. That's fantastic that you were given the opportunity to learn English and have made the best of it. That is not everyone though. It's not reasonable to expect a farmer in Malaysia to know English or to learn it purely to take advantage of what's there.


> I used the Internet in Spanish in my early teens before learning English, and it's a joke compared the English Internet.

This project seems like one decent way to try and fix that.


On the contrary, Wikipedia is a wonderful multilingual resource.

English is one of my mother tongues but I regularly read Wikipedia in half a dozen other languages to improve my knowledge of them. Being able to cross-reference what you're reading to the English version of the text, even though it's not a literal translation, gives valuable context as well as perspective on how a topic is viewed by different language groups.

It's also often more useful than a dictionary for finding the name of a flower or fish in another language. Or even some topics that you wouldn't find in a dictionary at all.


Hello from Japanese internet.


What's the point of this with the current high quality of state-of-the-art machine translation? Don't we expect machine translation to surpass humans in the near future?

People who are domain experts in various fields don't know how, don't care to, and shouldn't code. They should just edit the articles in natural language.

A lot of the content of Wikidata isn't numbers and is natural language also, so you'd still need to (machine?) translate it. But this time the machine translation algorithm would not have the benefit of the long-term context from the encompassing paragraph.

There are too many reasons why this is a bad idea. Almost makes me mad.


Where is the high quality machine translation? I spend most of my time in countries where I don't speak the same language as the majority of people and text that I encounter, so I am using machine translation many times per day. My experience of the average quality of machine translation is extremely low. It garbles meaning a majority of the time, and in a significant minority of cases destroys meaning completely.

To me the idea that you could translate an encyclopedia, where accuracy of meaning is critical, using such technology in its current state is horrifying. By contrast the abstract/semantic approach seems to have some potential, although I can't imagine it working well for all articles.


> People who are domain experts in various fields don't know how, don't care to, and shouldn't code.

My impression (correct me if I'm wrong, though) is that this is less "domain experts writing code" and more "editors can export a Wikidata 'subject' to a prebuilt translated page that domain experts can later expand". The aim of Wikidata is to collect information in a language-agnostic way, whereas the aim of Abstract Wikipedia seems to be to take that information and turn it into autogenerated pages in whatever language (even if that page is more-or-less a stub).


They are not solving your problem. They are solving theirs.

This is an experimental project so there is nothing to be angry about. Time will tell if this was a good idea or not.


I’m pretty interested in this actually. Although I’m not part of the Wikidata community, it would be interesting to see which language groups dictate the most involvement.


As a regular user of at least four Wikipedias, this seems like a very attractive direction. Interested to see whether it produces the outcomes it's designed for.


holy fuck


Extremely high barrier to entry. Less space than a Nomad. Lame.


> Because Wikidata uses conceptual models that are meant to be universal across languages,

Shows a deep misunderstanding of how human language works.


Reminder that wiki means quick, so when you read Wikipedia you only have surface knowledge,


As the set of information encoded into Wikipedia approaches the sum total of human knowledge, there's no particular reason that needs to remain true.


> As the set of information encoded into Wikipedia approaches the sum total of human knowledge

An outcome Wikipedia will never get close to. They'll never reach 1% of the way there. Closer to 0% of human knowledge gets recorded, rather than closer to 1%. Of the knowledge that is recorded, a small fraction of it will end up on Wikipedia. Most of what gets recorded is universal or widely experienced knowledge, which is a miniscule subset of "the sum total of human knowledge."

Wikipedia has already begun to stagnate badly. For the most part, it's over. That's why they're attempting Abstract Wikipedia now (aka another round of the failed insular semantic Web for elite Wiki nerds that won't accomplish much of anything for the average reader that wants to learn something); and it's why Wikimedia wants to rebrand itself to Wikipedia; and it's why their system is being overtaken by partisan politics (as momentum continues to decline the system will rot and pull apart in various negative ways at an accelerating clip). The growth is running out, and the Wiki bureaucracy wants to keep expanding, that's what this is about.


> They'll never reach 1% of the way there

> Closer to 0% of human knowledge gets recorded

> Wikipedia has already begun to stagnate badly

Anything more concrete you can link to for further reading on this? I understand the difficulty in quantifying such claims and measures but I'd appreciate reading something that attempts to do so objectively.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: