Example notation for the project, called AbstractText:
————
Input 1:
Subclassification(Wikipedia, Encyclopedia)
Result 1:
English: Wikipedias are encyclopedias.
German: Wikipedien sind Enzyklopädien.
————
Input 2:
Article(
content: [
Instantiation(
instance: San Francisco (Q62),
class: Object_with_modifier_and_of(
object: center,
modifier: And_modifier(
conjuncts: [cultural, commercial, financial]
),
of: Northern California (Q1066807)
)
),
Ranking(
subject: San Francisco (Q62),
rank: 4,
object: city (Q515),
by: population (Q1613416),
local_constraint: California (Q99),
after: [Los Angeles (Q65), San Diego (Q16552), San Jose (Q16553)]
)
]
)
Result 2:
English: San Francisco is the cultural, commercial, and financial center of Northern California. It is the fourth-most populous city in California, after Los Angeles, San Diego and San Jose.
German: San Francisco ist das kulturelle, kommerzielle und finanzielle Zentrum Nordkaliforniens. Es ist, nach Los Angeles, San Diego und San Jose, die viertgrößte Stadt in Kalifornien.
It's what bright children in the 70's learning about computers thought.
50 years later they haven't solved it because it doesn't work that way.
Is there a real example not using proper nouns?
A city changes, a population changes depending on the country and language and time. Town X having a population Y might be considered a Village X and Population Z, because in some countries population includes the rural parts, the population of San Francisco might be different in another country.
The rabbit hole goes on forever, and more importantly it's been tried constantly for over 50 years.
Unlike Machine translation which is amazing compared to 50 years ago, and getting better, and you could see how you could integrate it better with Wikipedia (It's already used) yet it's tossed out in the white paper for no real good reason I can see. There's also lots of stuff like Duolingo style methods that you could look at.
English and German have very similar vocabulary and syntactic structure. So this example is not very elucidating. Comparing it to Chinese, Turkish or Javanese would probably be better.
This maps rather nicely to something like Grammatical Framework [0]. I wonder whether they'll adopt an existing project for translation; getting things into this graph form seems like the hard part, honestly.
As far as the comparison goes, it should be easy enough to map the trees from the abstract form into language specific trees. We're you hoping to understand the current limitations? Maybe get a benchmark of the state of things that updates automatically as the project continues?
>English and German have very similar vocabulary and syntactic structure
Hm, the sentences are structured in a parallel way, but is that really proper German? I don't remember anything from high school German class, but people make jokes about putting the verb way at the end. Or is that an obsolete style?
It's actually great German. Syntactically sophisticated. I am surprised by the use of the subsentence (not sure what the proper babe for this is) which puts the three larger cities in the middle of the last sentence.
It would have been possible to place the three larger cities at the end of the sentence similar to the English example. This would have sounded a bit more bot-like, and was somehow what I expected.
So seeing this particular German example is actually quite a good example showing the power of this approach.
Yes, this. Well, in this case, the solution is obvious: you need to have two separate concepts for center. But…
When I first learned about the OmegaWiki project (called WiktionaryZ then, I think), I was thrilled. It tried to represent lexical (Wiktionary) definitions and other language concepts using data. For each sense of each word, a so called DefinedMeaning was created. In the same sense, Wikidata has its entities. But soon, I learned about a problematic aspect of OmegaWiki’s concept, and the same thing appears on Wikidata: You represent some set of concepts in a single language, then another language comes and needs to split some concepts in two, because your language uses one word for both, but the other differentiates between them. Then, a third language comes and it maps its concepts to your existing set still a bit differently, so you might get four entities for just three languages. Etc.
On Wikidata, more focus is, I guess, on “concrete” entities: people, places, etc., where this does not appear that often. But it contains the abstract entities as well, and the problem appears there all the time. You might try to “fix” the problematic entities by splitting them to more elementary, linked using “subclass of” etc.; in some cases it might work quite fine (but losing the interwiki links in the process, which is unfortunate, given those were the original use case of Wikidata), in others, it is basically impossible without a degree in philosophy and deep understanding of ten languages, to be able to correctly distinguish and represent their relations. And imagine somebody trying to _use_ those entities. Like “I would like to say this person was a writer”, but there are seventeen entities with the English label of “writer”, distinguished by some obscure difference used by a group of Sino-Tibetan languages.
And… Wikidata entities represent basically just nouns.
In Input 2 "center" is a keyword, because the markup is using English for keywords. The example output just happens to be in English as well. I assume it will be mapped to a more appropriate word in another language.
But the word/concept 'Center' does not appear anywhere in the input data, as far as I can see? It just lists a number of things for which SF ranks highly, and whether that means you call it a 'center' is up to the template writer - unless I'm misreading.
The key idea is that if the semantic description is abstracted enough, a grammar engine can convert the ideas encoded in it into the right structure for the language.
Not all languages have "X is Y" constructs, but all known human languages have some structure to declare that object X has property Y. Capture the idea "Object X has property Y" in your semantic language, and a grammar engine can wire that down to your target language.
The largest risk is that the resulting text will be dry as hell, not that it's an impossible task.
Though being dry doesn't diminish the value of the text, though. Very exciting.
I'd also be worried about ambiguity; humans can (sometimes) detect when they may be parsed the wrong way in context. I wonder if there will be a way to flag results that don't properly convey the data. How would that be integrated into the generator? (There's probably an answer in the literature.)
The main problem is that language X has an implicit definition of Foo, which is similar but not identical to language Y's definition of Bar. This might work when the languages share common ancestry like German and English, where Foo and Bar are both descendent from Baz and have similar meanings, but will not work when you try to translate to language Z, whose speakers have a different word Foobar which has a meaning that encompasses Baz and Qux but excluding Xyzzy and with a completely different connotation.
Finnish would likely work, though it would require very extensive rules on declinations. Some compound word and list rules are also fun... Finnish is rather liberal in word-order, but that's a simple fix.
What is hard in that the conjucts do not have unique identifiers in the example. That is an essential thing to have. As there is plenty of synonyms and meaning might change. Same applies to center.
This is one big hurdle I think. If one has to refer to the english meaning of words for the whole project to work, then how is this different from just writing the whole thing in english and translating everything from this?
What a horribly myopic way to organize information. They seem to have unthinkingly copied from vernacular English various loosely defined concepts like "city". What do they mean by San Francisco? The City and County of San Francisco? What about Los Angeles? Is that the entire LA metro or just LA county? Is Santa Monica a part of Los Angeles or a seperate settlement? How is the concept of "city", "metro", and "town" going to translate into "市", "Burg", and "Grad"?
This is getting very close to the Universal Language that Umberto Eco describes in his book The Search for the Perfect Language. I wonder what he would think about this if he were alive today...
That's the beauty here. It's not the syntax. It's just a syntax to express the abstract thing. Saying this syntax is an issue is like saying "I don't like binary trees because their syntax is so weird". One particular syntax may be weird, but the syntax is only specific to one specific representation. Everybody will be free to choose any representation they like, as long as it can somewhat automatically be translated back into the abstract thing that this project is aiming to produce and maintain.
The way I'd do it, would be to store an intermediate representation, and have multiple front-ends with different syntaxes. Have the editable text be generated from the IR.
This would be a huge plus, as it would not require the editor to know English keywords. Most keywords could be translated into the contributor's native language, lowering the barrier for editing.
It would also allow the syntax to be changed over time, or provide multiple different syntax paradigms, a bit like wikipedia's code vs visual editors.
Of course, comments are an issue, but hopefully, this is as close to "self-commenting" code as it gets.
For reference, this is from the same developer [1] that created Semantic MediaWiki [2] and lead the development of Wikidata [3]. Here's a link to the white paper [4] describing Abstract Wikipedia (and Wikilambda). Considering the success of Wikidata, I'm hopeful this effort succeeds, but it is pretty ambitious.
Considering the close relationship with Google and Wikimedia https://en.wikipedia.org/wiki/Google_and_Wikipedia and the considerable money Google gives them, how can one not see this project as "crowdsourcing better training data-sets for Google?"
I dont think the relationship is that close - all it says is google donated a chunk of money in 2010 and in 2019, it was a large chunk of money(~3% of donations) but not like so much to make a dependency.
> Can the data be licensed as GPL-3 or similar?
Pretty unlikely tbh. I dont know if anything is decided for licensing, but if it is to be a "copyleft" license it would be cc-by-sa (like wikipedia) since this is not a program.
Keep in mind that in the united states, an abstract list of facts cannot be copyrighted afaik (i dont think this qualifies as that, wikidata might though)
How so? Wikimedia-provided data can be used by anyone. Google could have kept using and building on their Freebase dataset had they wanted to - other actors in the industry don't have it nearly as easy.
Denny seems to be leaving Google and joining Wikimedia Foundation to lead the project this month, so probably you do not need to worry too much about Denny's affiliation with Google.
As a long-time Wikipedian, this track record is actually worrisome.
Semantic Mediawiki (which I attempted to use at one point) is difficult to work with and far too complicated and abstract for the average Wiki editor. (See also Tim Berners-Lee and the failure of Semantic Web.)
WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.
> WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.
Note that the internal data format used by Wikidata is _not_ RDF triples [0], and it's also highly non-relational, since every statement can be annotated by a set of property-value pairs; the full data set is available as a JSON dump. The RDF export (there's actually two, I'm referring to the full dump here) maps this to RDF by reifying statements as RDF nodes; if you wanted to end up with something queryable by SQL, you would also need to resort to reification – but then SPARQL is still the better choice of query language since it allows you to easily do path queries, whereas WITH RECURSIVE at the very least makes your SQL queries quite clunky.
How do you dump general purpose, encyclopedic data into a relational database? What database schema would you use? The whole point of "triples" as a data format is that they're extremely general and extensible.
Now you need a graph for representing connections between pages, but as long as the format is consistent (as they are in templates/infoboxes) that can be done with foreign keys.
Table capital
ID,Name
123,Foo
456,Bar
Table country
Name,Capital_id,Population
Aland,123,100
Bland,456,200
> Most structured data in Wikipedia articles is in either infoboxes or tables
Most of the data in Wikidata does not end up in either Infoboxes or Tables in some Wikipedia, however, and, e.g., graph-like data such as family trees works quite poorly as a relational database; even if you don't consider qualifiers at all.
Those infoboxes get edited all the time to add new data, change data formats, etc. With a relational db, every single such edit would be a schema change. And you would have to somehow keep old schemas around for the wiki history. A triple-based format is a lot more general than that.
People might be interested to know that semantic web ideas have been more successful in some niches than others. Computational biology for example makes extensive use of "ontologies" which are domain specific DAGs that do exactly what Abstract Wikipedia is attempting. Much of the analysis of organism's genomes and related sequences relies on these ontologies to automatically annotate the results so that meaningfull relationships can be discovered.
There are of course HUGE issues with the ontologies. They are not sexy projects so they are often underfunded and under resourced - even though the entirety of bioinformatics uses them! The ontologies are incomplete and sometimes their information is years behind the current research.
For the curious, the Gene Ontology is the golden child of biology ontologies. See here: http://geneontology.org/
Semantic Web[1] reborn (after alleged[2] death)? Also I wonder how helpful Prolog infrastructure could be since they provided some useful frameworks [3][4] for that.
We actually looked into SWI prolog semantic web package for corporate work! We ended up finding RDFox ( https://www.oxfordsemantic.tech/ ) which is the bleeding edge in research on inference databases and linked data. Unfortunately COVID changed the plans but we were really really impressed with the capabilities.
Semantic web is used broadly; Google Structured data you see for reviews and infoboxes, wikidata. Data is broadly available, even if jobs in semantic technologies are not.
We're familiar with common databases like key value stores, OLAP, OLTP, but reasoning technology offers unique properties many people aren't aware of. For example you can have your business logic integrated with your database in a way that's much more flexible that stored procedures. You express your business rules as logic programs, the automatically run multi-core, they run as soon as data is inserted into the database and there is no function call; the data does not need to be aware of what logic is in the database, logical rules are applied incrementally so that adding new data or new rules does not trigger re-computation of all the data, business rules can use data produced by other business rules, and finally you use the explain command to get a mathematical proof of why an outcome happened.
Reasoning technology may be old but recently this idea of automatically stating things in a declarative form and having the application reconcile the differences has been the differentiating factor for the most popular software out there; kubernetes, teraform, ansible, react, graphql, flutter. Without the declarative reasoning capabilities, these tools may not be considered some of the best.
Think postgresql 12 generated columns except infinitely chainable, recursive and connectable to other tables. Think pre-computed materialized views, but automatically updated as new data is inserted (no refresh needed).
OH MY GOD. Is this jimmyruska of jimmyr.com and those youtube tutorials from way way back?
I'm going to get every opportunity I get to tell you this, but you are the reason I'm where I am. I followed you since middle/primary school. Always checked up on you every now and then, and your site[0] is STILL my homepage. In fact I reached this very link from your HN tab (although the link to the HN tab leads to a webarchive).
Thank you. This is a little weird, I'm sure, but you're definitely had a tangible and very significant impact on my life!
>> Also I wonder how helpful Prolog infrastructure could be since they provided some useful frameworks [3][4] for that.
That's a good point, because looking at the working paper on the proposed architecture of the project [1], the example of a "constructor" in Figure 1 is basically a set of frames and it has a straightforward translation in Prolog and the example of a "renderer" in English is basically a pattern with holes that also has a very straightforward Prolog implementation via Definite Clause Grammars. In fact the whole architecture reminds me a lot of IBM's Watson - the good bits (i.e. the Prolog stuff they used to store the knowledgebase).
1. As you say, to see more information about where the paper comes from.
2. It's easy to get from the abstract page to the PDF, but not vice versa.
Personally, I also think it's good for people to get into the habit of linking to a text description of data-heavy resources rather than directly to the resources. PDFs aren't that data-heavy, but there are plenty of other things that are that could do with a text landing page, and I think it's good to get in that habit.
A bit changed in AI since 30 years ago. The way we use Internet changed as well. Perhaps if we had a better semantic network and today's algorithms, we could go further?
> The goal of Abstract Wikipedia is to let more people share in more knowledge in more languages. Abstract Wikipedia is an extension of Wikidata. In Abstract Wikipedia, people can create and maintain Wikipedia articles in a language-independent way. A Wikipedia in a language can translate this language-independent article into its language. Code does the translation.
Very cool. I’m fascinated by the Wolfram Language paradigm of Knowledge Base+Programming language=Computable everything (demo: https://youtu.be/3yrVuM2SYZ8). But I could never get into the Wolfram ecosystem because it’s totally proprietary. This makes me think, does Wikidata’s model (ontologies?) provide a way to recreate the Wolfram computable everything concept as an open community project?
Wikidata models assertions of facts about things and it has a very powerful system for writing queries to get at the facts and relationships you are interested in.
Here are some examples of what you can do with Wikidata:
Perhaps, but this seems to be moving towards a more holistic machine-readable article graph. If you look at a page from wikidata[0], it seems to be basically a key-value database (e.g. earth.highest point = [ mount everest { from sea level, 8000m } ], while the "full article" terminology used in the announcement seems like it may be even more connected/informative/structured than that.
I don't see any indication that Abstract Wikipedia articles are anything more than a sequence of "constructors" and those "constructors" are essentially just triples (with qualifiers) that a "renderer" turns into a specific human language.
The example they give is the constructor:
rank(SanFrancisco, city, 4, population, California)
And the English renderer will output:
"San Francisco is the fourth largest city by population in California."
Agreed, but my point was that the aim has always been to encode these facts and then mix them into wikipedia for any assertion / attribute, so that any fact is backed by an assertion.
I'm curious whether this new project has been driven in any way by the difficulty of integrating data from Wikidata into Wikipedia. It various a lot by language, but the user communities are quite hostile to Wikidata in some cases. I think it's generally on the grounds that since Wikidata is a wiki, it can be easily vandalized and its data can't be trusted.
It assumes articles will say the same thing in every language, which to me means that edit wars can now proceed on a more global basis. You're no longer fighting only the people who feel comfortable enough with your language to edit in it, you're fighting anyone who can edit the article at all around the world.
Do the Hebrew Wikipedia and the Arabic Wikipedia agree on the status of Israel?
Unless they agree to disagree about the facts in a case where one side is factually right while the other has the consistency of a lie your 3-year-old would make up to stop you from discovering they ate all the cookies.
Agree to disagree is like saying there is no right side in this matter which is okay for topics where there isn't. Many topics however are not a matter of perspective, they are a matter of who is factually right.
IMO in such a case agreeing to disagree can often be destructive, because it legitimises a position which is factually wrong and constructs an illusion of balance where there is none.
If one side says it rains and the other says it doesn't, agreeing to disagree is wrong. If there is disagreement about what the facts mean, then sometimes both perspective can be true at the same time. This is however not as often the case as I wish it would be..
"If one side says it rains and the other says it doesn't, agreeing to disagree is wrong"
No. Agreeing to disagree is accepting, that a subject is disputed with different, even contradicting opinions.
An AI could in theory extract both versions and present them as disputed. Meaning, there are different definitions of a word and the correctness of facts.
And "raining or not" is also not as simple as you think. A person from spain will consider a certain state as raining, which a british person might see as some humidity in the air ..
Let’s take the theory of evolution for a spin. Being a theory and not a hypothesis, it is a proven scientific fact. Yet a large chunk of the population chooses to not believe it to be true. So how do we encode this knowledge?
Option 1: we call it a controversy and make it sound like because some people don’t believe in it means it might not be correct.
Option 2: we state upfront that the theory of evolution is correct but link to a “incorrect but competing viewpoint” of creationism.
Option 3: we create three articles: one about the theory of evolution, one about creationism, and one about the disagreements between the creationists and the rest of the modern world.
I like option 3 best as it is the most complete picture of the three, and all it requires is the abstract concept of controversy or disagreement. I don’t know how you encode “dumbass” in this new format but it might be a useful concept to explain to aliens if they decide to visit Earth.
>Being a theory and not a hypothesis, it is a proven scientific fact
A theory is distinct from a hypothesis, but surely it isn't itself a fact either.
Wouldn't it be better to say that the theory of evolution explains a lot of facts, which we may also call (observed) evolution? Really, aren't we just calling two related concepts "evolution"?
We use evolution every day in pharmaceuticals. We use it in our crops and in domesticated animals. If it was under question we wouldn’t call it a theory. Evolution is as real as gravity except it has even more evidence and scientific understanding, while gravity still doesn’t play nice with quantum mechanics and of course general relativity defines it as something most people cannot intuit. Evolution is a fact. I don’t see two related concepts here. Besides, consider that the second closest hypothesis that explains life on earth is that a bearded man in the sky got bored one day and created the universe, then made a man and a woman and gave them a bunch of dinosaurs to play with but instead they played with a snake, an apple, and each other’s bodies until he kicked them out of his play garden because they didn’t play by his rules, so the two of them through tremendous amounts of incest populated the earth (a fact easily disproven by a number of methods including simple genetic testing). No I don’t think it would be better to call the theory of evolution anything but a proven fact. If people want to believe in fairy tales that’s fine. But that’s not a reason to cloud scientific discovery.
I don’t know about other people, but your definition of “theory” doesn’t match mine. To me the word “theory” is almost identical to “hypothesis”, but generally a bit more comprehensive (eg consists of multiple hypothesis). Calling something a theory doesn’t require any proof nor that it be true.
There are a number of theories related to gravity. That doesn’t mean gravity isn’t a fact, it just means we don’t 100% understand how it works in all situations (eg quantum).
Similar for evolution - there are various scientific theories about the origins of species. And it is absolutely the case that we don’t know 100% how we got from Big Bang to here, thus the theories are still theories. If you just want to point to Darwinism and say fact, you’d be doing a large disservice to us all.
Gravity continued being the same thing before Newton, after Newton, and after Einstein. If we come up with a new theory of gravity, it's distinct from the phenomenon we observe. The planets are still going to go around and around, etc.
The semantic structure is the structure of the language (or the independent thing between languages). This does not automatically facilitate machine Understandable knowledge. You would need to write the code to understand it first, which is probably almost as difficult as understanding English for example.
Only when relations are defined in a kind of Prolog style in those examples it will be usable as knowledge about things other than language.
A computer might spit out texts which it guessed are connected to some question you ask it though. Does not mean it understands the relations of things.
But when can we ever really say "a machine understands" something? So perhaps there is not much of a difference?
While the implications are huge, it is hard to think about what is actually being done. XML and JSON are just languages, and the type of information could be stored in any number of ways.
In my point of view, the problem here is that you could say something like "'water' and 'heat' produces 'steam'", but knowledge is never that simple, and understanding of that information is even more complicated.
I would think that Abstract Wikipedia is not the first people trying to solve such a problem, and I am very curious to see what they come up with.
Weren't expert systems the second AI winter? In any case, I see the end of the cold war as being the driving factor behind that winter, whatever its ordinal.
Anyone who has studied old-school AI will know that this is an incredibly ambitious project; it is essentially throwing itself at the problem of "knowledge frames", i.e. how to encode information about the world in a way that an AI system can access it and, well, be intelligent about it. (Also at the problem of natural language generation, but as hard as that is, at the moment it seems like the easier of the two.)
But...
One of the biggest problems with a lot of the old "Big AI" projects that were developing some sort of knowledge frames (and there were several, and some of them still exist and have public faces) was, who the hell is going to get all the info in there in a way that's complete enough to be useful? Now you have a learning problem on top of the knowledge representation problem. But throw the wikimedia community at it and crowdsource the information?
It seems more similar to an elaborate version of the internationalization and translation of messages done in any program that targets multiple languages? If you think of it as a principled template language for generating text from the results of canned database queries, it starts seeming a lot more feasible. The templates themselves do need to be translated into every language, much like the messages in internationalization.
Ideally this enables something like an improved version of the ICU library, with a lot more data available.
So do people find Wikidata that impressive? Here's what Wikidata says about Earth, an item that is number 2 in the ID list, and also on their front page as an example of incredible data.
I struggle to find anything interesting on this page. It is apparently a "topic of geography", whatever that means as a statement. It has a WordLift URL. It is an instance of an inner planet.
The first perhaps verifiable, solid fact, that Earth has a diameter of "12,742 kilometre", is immediately suspect. There is no clarifying remark, not even a note, that Earth is not any uniform shape and cannot have a single value as its diameter.
This is my problem with SPARQL, with "data bases", in that sense. Data alone is useless without a context or a framework in which it can be truly understood. Facts like this can have multiple values depending on exactly what you're measuring, or what you're using the measurement for.
And this on the page for Earth, an example that is used on their front page, and has the ID of 2. It is the second item to ever be created in Wikidata, after Q1, "Universe", and yet everything on it is useless.
I find it pretty well stuffed with appropriate information. You're looking at an ontology, not a wikipedia article, it's supposed to be dry (subject, relation, object). It's being used to disambiguate concepts, named entities and support machine learning models with general knowledge in a standard format. There are plenty of papers on the topic of link prediction, auto-completion and triplet mining.
> I find it pretty well stuffed with appropriate information. You're looking at an ontology, not a wikipedia article, it's supposed to be dry (subject, relation, object).
We're talking about a research project with a large amount of funding to go from the former to the latter. But pretty much none of the stuff on Earth's Wikipedia page is represented here.
> applies to part: equator
An equator (the general concept to which the ontology links to) has no given orientation. Earth's Equator is a human construct distinct from an oblate spheroid's equator, as are the specific locations of the poles. Nowhere is it specified in the ontology that this is measured at a specific Equator, not just any equator.
This is all human context and understanding that we've built on top, and it's part of what I mean when I say that the data is kinda pointless. All of these facts depend on culture to understand.
I believe that in most modern human cultures the sentence "the diameter of the Earth" has a very imprecise, very informal, but very recognisable meaning. In fact, I really doubt that most people on the Earth would think of what precisely is the shape of the Earth when talking about its diameter.
Q2 is just an id, probably one shouldn't interpret too much into it except that it defines an entity. Regarding the diameter, probably it depends how you define it. For instance according to Wikipedia one can generalize it as sup { d(x,y) }, seems legitimate to me although Wikidata's referenced diameter definition (P2386) isn't that general, probably it should be updated... But to be fair, Earth (Q2) has the shape (P1419) oblate spheroid (Q3241540) under sourcing circumstances (P1480) approximation (Q27058) :-)
To me Wikidata (and similar projects like OSM) shine because they tend to have so many details.
I've worked with the Wikidata set a bit. On first glance the entries do seem to lack any useful information as it's all heavily abstracted into other items and properties - as well as containing a bunch of references and qualifiers to validate the facts.
Once you start connecting the items to other items and properties, you begin to see better information and context.
A lot of the "snaks" of items are units of measurement, so no worries converting them into other languages. This project should help in generating articles in other languages based on these facts.
I dont think its interesting in itself so much as in applications. I remember talking to someone once who was working on a project where you stick a probe in some soil, and then it uses wikidata to tell you the best type of plant to grow. I have no idea whatever happened to this project, if it worked or not - but it always struck me as a great example of the enabling value of wikidata - that you can use it to power ideas totally unrelated to the original purpose the data was collected for.
I recently learned that words and translations from Wiktionary are in Wikidata's graph as well, which enables e.g. this simple lemmatizer: https://tools.wmflabs.org/ordia/text-to-lexemes (The Wikidata query it uses is linked at the bottom.)
It's the identifiers that make querying Wikidata difficult, IMO. SPARQL is pretty easy, certainly no more difficult than SQL. It might even be easier than SQL since there are no joins.
I found the hardest part of sparql is forgeting my sql knowledge. Its a very different query language than sql, but some constructs look similar, and its very easy to confuse yourself thinking the construct does a similar thing as sql when it really doesnt
> It might even be easier than SQL since there are no joins.
Every dot between Triple Patterns in a Basic Graph Pattern is actually a JOIN; you just don't need to worry about using them.
As for the identifiers, you get used to them if you work regularly with them, and query.wikidata.org actually has completion for identifiers if you press CTRL-Space.
Hi, founder of Diffbot here, we are an AI research company spinout from Stanford that generate the world's largest knowledge graph from crawling the whole web. I didn't want to comment, but I see a lot of misunderstandings here about knowledge graphs, abstract representations of language, and the extent as to which this project uses ML.
First of all, having a machine-readable database of knowledge(i.e. Wikidata) is no doubt a great thing. It's maintained by a large community of human curators and always growing. However, generating actually useful natural language that rivals the value you get from reading a Wikipedia page from an abstract representation is problematic.
If you look at the walkthrough for how this would work (https://github.com/google/abstracttext/blob/master/eneyj/doc...), this project does not use machine and uses CFG-like production rules to generate natural sentences. Works great for generating toy sentences like "X is a Y".
However, human languages are not programming languages. Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (those that have taken grammar class can relate to the number of exceptions to the ruleset)
Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format. Plenty of text is opinion, subjective, or describes notions that don't have an proper entity. Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.
A much better and direct approach to the stated intention of making the knowledge accessible to more readers is to advance the state of machine translation, which would capture nuance and non-facts present in the original article. Additionally, exploring ML-based ways of NL generation from the dataset this will produce will have academic impact.
> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (...)
> Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format.
It doesn't seem like the goal of this project is to describe those languages, or to represent ever sentence in a typical Wikipedia article? The goal doesn't seem to be to have all Wikipedia articles generated from Wikidata, but rather to have a couple of templates to the order of "if I have this data available about this type of Subject, generate this stub article about it". That would allow the smaller Wikipedia language editions to automatically generate many baseline articles that they might not currently have.
For example, the Dutch Wikipedia is one of the largest editions mainly because a large percentage of its articles were created by bots [1] that created a lot of articles on small towns ("x is a town in the municipality of y, founded in z. It is nearby m, n and o.") and obscure species of plants. This just seems like a more structured plan to apply that approach to many of the smaller Wikipedia's that may be missing a lot of basic articles and are thus not exposing many basic facts.
This is addressed in the white paper describing the project's architecture:
10.2 Machine translation
Another widely used approach —mostly for readers,
much less for contributors— is the use of automatic translation services like
Google Translate. A reader finds an article they are interested in and then asks
the service to translate itinto a language they understand. Google Translate
currently supports about a hundred languages — about a third of thelanguages
Wikipedia supports. Also the quality of these translations can vary widely — and
almost never achieves thequality a reader expects from an encyclopedia [33,
86].*
Unfortunately, the quality of the translations often correlates with the
availability of content in the given language [1],which leads to a Matthew
effect: languages that already have larger amounts of content also feature
better results intranslation. This is an inherent problem with the way Machine
Translation is currently trained, using large corpora. Whereas further
breakthroughs in Machine Translation are expected [43], these are hard to plan
for.
In short, relying on Machine Translation may delay the achievement of the
Wikipedia mission by a rather unpredictabletime frame.
One advantage Abstract Wikipedia would lead to is that Machine Translation
system can use the natural language generation system available in Wikilambda to
generate high-quality and high-fidelity parallel corpora for even morelanguages,
which can be used to train Machine Translation systems which then can resolve
the brittleness a symbolic system will undoubtedly encounter. So Abstract
Wikipedia will increase the speed Machine Translation will become better and
cover more languages in.
Additionally of course Google Translate is a proprietary service from Google, and Wikimedia projects can't integrate it in any way without abandoning their principles. It's left for the reader to enter pages into Google Translate themselves, and will only work as long as Google is providing the service.
What is the quality of open source translation these days?
>> Many natural languages, like German and Finnish, are so syntactically and
morphologically complex that there is no compact ruleset that can describe
them.
Is that realy true? If natural languages have rules, then there exists a
ruleset that can describe any natural language- the set of all rules in that
language. Of course, a "rule" is a compact representation of a set of strings,
so if natural languages don't have such rules it's difficult to see how any
automated system can represent a natural language "compactly". A system
without any kind of "rules" would have to store every grammatical string in a
language. That must be impossible in theory and in practice.
If I may offer a personal perspective, I think that the goal of the plan is to
produce better automated translations than is currently possible with machine
translation between language pairs for which there are very few parallel
texts. My personal perspective is that I'm Greek and I am sad to report that
basicaly translation from any language to Greek by e.g. Google Translate
(which I use occasionally) is laughably, cringe-inducingly bad. From what I
understand the reason for that is not only the morphology of the Greek
language which is kind of a linguistic isolate (as opposed to, say, Romance
languages), but also that, because there are not many parallel texts between
most languages (on Google Translate) and Greek, the translation goes through
English- which results in completely distorted syntax and meaning. Any project
that can improve on this sorry state of affairs (and not just for Greek- there
are languages with many fewer speakers and no paralle texts at all, not even
with English) is worth every second of its time.
To put it plainly, if you don't have enough data to train a machine learning
model, what, exactly, are your options? There is only one option: to do the
work by hand. Wikipedia, with its army of volunteers, has a much better shot
at getting results this way than any previous effort.
> To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.
The training data for machine translation models is also human-created. Given some fixed amount of human hours, would you rather them be spent annotating text that can train a translation system that can be used for many things, or a system that can just be used for this project? It all depends on the yield that you get per man-hour.
As the paper I quote below says, the system that would result from this project could be re-used in many other tasks, one of which is generating data for machine translation algorithms.
I think this makes sense. The project aims to create a program, basically ("a set of functions"). There are, intuitively, more uses for a proram than for a set of labelled data.
> Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.
So, the obvious solution is to create robo-annotators, and that's what your company is supposedly trying to do?
> The project will allow volunteers to assemble the fundamentals of an article using words and entities from Wikidata. Because Wikidata uses conceptual models that are meant to be universal across languages, it should be possible to use and extend these building blocks of knowledge to create models for articles that also have universal value. Using code, volunteers will be able to translate these abstract “articles” into their own languages. If successful, this could eventually allow everyone to read about any topic in Wikidata in their own language.
This is a great idea. I bet the translations will be interesting as well. I was wondering about how the translation was going to work and it looks like they thought of that as well. They're going to use code to help with the translation.
> Wikilambda is a new Wikimedia project that allows to create and maintain code. This is useful in many different ways. It provides a catalog of all kind of functions that anyone can call, write, maintain, and use. It also provides code that translates the language-independent article from Abstract Wikipedia into the language of a Wikipedia. This allows everyone to read the article in their language. Wikilambda will use knowledge about words and entities from Wikidata.
Pretty-printing the abstract content into an arbitrary target language (a better way of putting it than "translation") would be quite the challenge, because "conceptual models" do vary by language. One can attempt to come up with something that's "as abstract/universal as possible" but it remains to be seen how practically useful that would be.
For that matter, making the source model "logical" and "compositional", as implied by the Wikilambda idea, only opens up further cans of worms. Linguists and cognitive scientists have explored the idea of a "logical" semantics for natural language, even drawing on the λ-calculus itself (e.g. in Montague grammar and Montague semantics), but one can be sure that a lot of complexity will be involved in trying to express realistic notions by relying on anything like that.
I didn't assume the translations would be lossless. It's obvious there will be conceptual mismatches but that's why this is interesting. Because when the abstract model is made concrete people can notice the gaps and improve the abstract model. I can imagine a feedback loop that improves both the abstract and concrete/translated models as people work on improving both to reduce the conceptual gaps between the abstract and concrete models.
I’m starting to feel the structure and content of “abstract content” is going to be quite like Wikipedia pages in target languages zipped into single archive plus overheads...
Sorry to be the typical pessimistic HN commenter (e.g., Dropbox is just ftp), but this seems ambitious enough to remind me of https://en.wikipedia.org/wiki/Cyc.
Even Wikidata today is already a lot more usable and scalable than Cyc. The latter always seemed like a largely-pointless proof of concept; Wikidata by contrast is very clearly something that can contain real info, and be queried in useful ways. (Of course knowledge is not always consistently represented, but that issue is inherent to any general-purpose knowledge base - and Wikidata does at least try to address it, if only via leveraging the well-known principle "many eyes make all bugs shallow".)
It is well-known wikidata does not scale. Whether it is in terms of number of data contribution or number of queries. Not only that, but the current infrastructure is... not great. WBStack [0] try to tackle that but it is still much more difficult to enter the party, than it could be. Changes API? None. That means that it is not possible to keep track of changes in your own wikidata/wikibase instance improved with some domain specific knowledge. Change-request mechanic? Not even in the roadmap. Neither is it possible to query for history of changes over the triples.
Wikidata GUI can be attractive and easy to use. Still, there is big gap between the GUI and the actual RDF dump, that is, making sense of the RDF dump is big endeavor. Who else wants to remember properties by number? It might be a problem of tooling. Question: how to add a new type of object to the GUI? PHP? Sorry.
> Neither is it possible to query for history of changes over the triples.
And why should it? The triples (and hence the full RDF dump as well) are a “lossy” (there's actually two different translations, the “truthy” triples that throw away large parts of the data, and the full dump that reifies the full statements, but is therefore much more verbose) translation of the actual information encoded in the graph. Revision history for the _actual_ items has been queryable via the Mediawiki API for a long time.
Agreed. "[since 1982,] by 2017 [Lenat] and his team had spent about 2,000 person-years building Cyc, approximately 24 million rules and assertions (not counting "facts") and 2,000 person-years of effort." https://en.wikipedia.org/wiki/Douglas_Lenat
Because Cyc is not seen as having been successful, so comparing a new project to it implies that Abstract Wikipedia won't be successful either. And, of course, all new approaches in each discipline fail, until sometimes they start succeeding.
Cyc got hyped for while in the early 90s. It became apparent, however, that rule-based wasn't going to play as big a role as ML in the future of AI research. It still exists, but the company is really secretive, and hasn't released anything viable in years.
[edit: I wasn't alive back then, so most of what I know comes from the Wikipedia article and a recent HN thread: https://news.ycombinator.com/item?id=21781597 . My view of Cyc probably comes across as slightly negative. Their (Cycorp) view seems to have evolved since then, and they seem to be creating some really interesting stuff.]
I hope at least 20-30% of the people involved in the project are at least near-native level speakers of non-Indo-European languages. Linguistic biases based on your mother tongue die hard, and I know this from having waded through tons and tons of software designed with biases built-in that woefully disregard Asian syntax, typography, input, grammar, semantics, etc etc etc. As the whole point of the project is multilingual support, I really hope the developers don’t underestimate how grammatically and semantically distant different language families can be.
I think a consistent multilingual Wikipedia is a fantastic goal.
But I'm not sure this is the right way to do it.
Given that most of the information on Wikipedia is "narrative", and doesn't consist of facts contained in Wikidata (e.g. a history article recounting a battle, or a movie article explaining the plot), this scope for this will be extremely limited. The creators are attempting to address this by actually containing every single aspect of a movie's plot as a fact, and that sentences are functions that express those facts... but this seems entirely unwieldy and just too much work.
What I've wished for instead, for years, is actually an underlying "metalanguage" that expresses the vocabulary and grammatical concepts in all languages. Very loosely, think of an "intermediate" linguistic representation layer in Google Translate.
Obviously nobody can write in that directly in a user-friendly way. But what you could do is take English (or any language) text, do an automated translation into that intermediate representation, then ask the author or volunteers to identify all ambiguous language cases" -- e.g. it would ask if "he signed" means made his signature, or communicated in sign language. It would also ask for things that would need clarification perhaps not in your own language but in other languages -- e.g. what noun does "it" refer to, so another language will know to use the masculine or feminine version. All of this can be done within your own language to produce an accurate language-agnostic "text".
Then, out of this intermediate canonical interpretation, every article on Wikipedia would be generated back out of it, in all languages, and perfectly accurately, because the output program isn't even ML, it's just a straight-up rule engine.
Interestingly, an English-language original might be output just a little bit different but in ways that don't change the meaning. Almost like a language "linter".
Anyways -- I think it would actually be doable. The key part is a "Google Translate"-type tool that does 99% of the work. It would need manual curation of the intermediate layer with a professional linguist from each language, as well as manually curated output rules (although those could be generated by ML as a first pass).
But something like that could fundamentally change communication. Imagine if any article you wanted to make available perfectly translated to anyone, you could do, just with the extra work of resolving all the ambiguities a translating program finds.
>The creators are attempting to address this by actually containing every single aspect of a movie's plot as a fact, and that sentences are functions that express those facts... but this seems entirely unwieldy and just too much work.
Doesn't this also get into issues where facts aren't clearly defined. I can think of a lot of interpretation of meaning from my literature classes, but there are also questions such as ownership of land at contested borders, if something was a legal acquisition or theft, or even coming up with a factual distinction between when something is grave robbery vs archaeology. A personal favorite would be mental illness, especially with some of the DSM V changes that have largely been rejected (or outright ignored) by society. And there are all sorts of political disagreements.
And as this applies to different languages, and different languages are likely aimed at different cultures and different nations, this gets messy. I could see some differences in an article written in Hindi vs Chinese when concerning issues involving both China and India. Creating a common language will force a unification of these differences that currently might exist with a sort of stalemate with each linguistic side by maintained by the dominant country for that language.
So Wikidata can handle conflicting information by collecting all of it, but clearly separating the different viewpoints in a kind of "split brain". That works so long as the different sides can agree that their opponent's views are what they state they are.
In an Abstract Wikipedia article, that means that all viewpoints with a sufficiently large userbase might end up represented equally in all language versions, but they'll still be clearly distinguished as such, so the reader can apply their own value judgments to support their own side's viewpoint over that of their enemies.
Then it is not english but a subset of english. They are translation systems that work the way you describe already using restricted natural language grammars.
To the contrary, it would be a superset. The point is that it is not restricted, but would also require additional distinctions that aren't even present in English.
Although, there may be some cases where truly identical meanings in English are collapsed into one. For example, the arguable lack of difference between "it isn't" and "it's not". (Things that are quotations could me marked up as such to not disturb the text in its original language, but quotations could still be translated into other languages.)
Even if you wrote the articles in a subset of English like the https://simple.wikipedia.org/wiki/Main_Page and then used ML to translate into other languages and then formed a feedback loop with the translation so that the original author could have some assurance that the translated texts were valid, this would be huge.
>What I've wished for instead, for years, is actually an underlying "metalanguage" that expresses the vocabulary and grammatical concepts in all languages.
Thanks for the reference. What I'm talking about is quite different, though.
Leibniz's proposal and similar ones are about a language intended to represent knowledge (formal concepts), and usable by people (with training).
I'm talking about simply a translation layer. It doesn't "know" or symbolize anything on its own -- it's merely a central mapping between all constructs in different languages. Nobody would read or write in it directly, and it couldn't be used to make inferences or anything like that.
What Abstract Wikipedia is proposing is actually quite similar to what Leibniz proposed, an intricately detailed map of knowledge and true propositions.
What I'm proposing is vastly less ambitious and vastly more useful: just a way to store the meaning of text in an intermediate representation that belongs to no single language. But that representation doesn't bother with what are facts or propositions or anything like that -- just entities and relations that mapped to however we ultimately understand languages in the real world. And assisted with a ton of ML, rather than built up from elementary propositions or set theory or something.
I think ultimately this endeavor would run into the same issues which befell the efforts of the GOFAI research program. A translation layer would need to "understand" embedded higher-order contexts of language making implicit references to the world, and which vary by different languages and cultures. Such contexts are not analytically decomposable in a manner which permits a centralized mapping.
As with GOFAI, a formalized representation of meaning entails certain ontological assumptions about "how the world hangs together". These assumptions often hold true for an isolated domain of inquiry but falter when dealing with more complex phenomena. I could, however, see such a system useful for 'Simple Wikipedia' articles which largely consist of existentially quantifiable sentences.
There is a wealth of philosophical scholarship on the issues of semantics and formal representation; I recommend the works of Hubert Dreyfus if you want to dig into this more.
hasn’t it been suggested that the internals of google translate and similar systems have effectively constructed such a meta-language?
It seems like you could construct a meta-linguistic sentence by writing a phrase in one language and choosing from possible translations in additional languages, which would hone the cross-linguistic meaning. You might not even have to define a representation of the metalanguage, you could just keep a set of (phrase, score) tuples, and interpolate translations of those phrases into the target language at display time! that almost sounds practical to me
>hasn’t it been suggested that the internals of google translate and similar systems have effectively constructed such a meta-language?
No, Google Translate and other state of the art machine translation software rely on large corpuses of linguistic data from which they make statistical inference. They do not have a 'centralized metalanguage' which ontologizes meaning in the manner described by OP.
There could be some very interesting meta analytics that could be done on knowledge structured in this way. For example, this research which identifies the structural differences in the fact graphs of conspiracy theories vs accurate accounts: https://phys.org/news/2020-06-conspiracy-theories-emergeand-...
The connections capture context-specific relationships, such as co-occurrences. The "fall apart" part comes from the fact that conspiracy theories rely on hidden, unsubstantiated, subjective interpretations of intent or actions whose validity can be questioned. If they are key pillars of the narrative, then their falsity can negate the truth of the narrative.
This reminds me of a philosophical discussion around what "truth" means. Coherent theory of truth: truth is defined as a property that's coherent among a set of beliefs. It can also be used as an epistemic justification -- that is, any set of internally consistent beliefs can be taken as true. Of course, in practice, certain truth statements have to correspond to reality, which is where the correspondence theory of truth comes in.
truly love the mobile design of wikipedia and find myself adding ".m" to every link that i visit on wikipedia. it has larger fonts, more readable copy (for me at least), and works great on mobile. surprisingly the trick worked with this one as well!
Because it is limited in design. It misses the sidebar which has useful links, the discussion page, the history page, doesn't have account links. Also the more compact text the desktop version is preferable to some.
You can automatically redirect to the mobile version if you want using a user script. I'm using a similar one for the reverse.
On mobile I struggle to remove the m. on every link.
The mobile version does not work well with the research feature, and I never took time to know where do you switch to another language.
I'm not very involved in wikipedia politics so i might be wrong here, but my perception is that the desktop wikipedia has a lot of eyes on it and any change is received with a "aahh, change is scary" response.
the people who react negatively to change are also the people don't like mobile versions of websites, so the mobile site is more free to experiment and evolve their design.
Desktop is basically the full admin interface for a Google-scale website with only rudimentary authentication requirements and a default-allowed policy. That works.
Truly hate that I have to deal with deleting the .m off other people's links. It's so ugly, it has these disgustingly oversized fonts, and because all the sections are hidden by default, it works badly on mobile and even worse on a real computer.
How come I can't just stop seeing that loathsome mobile design, both on my desktop and on my phone?
Note the mobile skin is separate from the mobile site. If you set in your preferences the skin Minerva Neue, you will get it on desktop (just the skin, there are non skin differences with the mobile site you wont get).
As for why not default- i imagine partially because the mobile site tends to be unpopular with power users (of course opinions vary)
I seem to remember it used to be a foldout thingie as you scrolled down (at which point you might as well just hit the 'Desktop' link at the bottom). But apparently no more! I just found it again after some intense staring - it's the icon with two characters on the far left of the bar right under the article title.
A Wikipedia Signpost article[1] gives a more detailed overview of the goals of the project, but it also made me think of an interesting failure case. From the article:
> Instead of saying "in order to deny her the advantage of the incumbent, the board votes in January 2018 to replace her with Mark Farrell as interim mayor until the special elections", imagine we say something more abstract such as elect(elector: Board of Supervisors, electee: Mark Farrell, position: Mayor of San Francisco, reason: deny(advantage of incumbency, London Breed)) – and even more, all of these would be language-independent identifiers, so that thing would actually look more like Q40231(Q3658756, Q6767574, Q1343202(Q6015536, Q6669880)).
But Q1343202 doesn't mean "denial" as in "preventing someone else from getting something", it means "denial" as in "refusing to accept reality". (See [2].) The two concepts are represented by the same word in English, but they might not be in other languages.
It seems like it'd be kind of tricky to create an interface that ensures other English-speaking editors indicate the right meaning of "denial".
I think the answer is be as clear as possible in the interface, but also accept mistakes will be made. People make grammar mistakes in (normal) wikipedia all the time, then other people come along and fix them. I expect the same will occur here.
This is of course an interesting idea, but it has a number of huge technical hurdles to overcome. Here is the biggest:
Right now, if you want to become an editor of Wikipedia, you simply need to have a passing familiarity with wikitext, and how the syntax of wikitext translates into the final presentation of the article.
However, if you want to become an editor of Abstract Wikipedia, you'd need to have an in-depth knowledge of lambda calculus, and possibly a Ph.D. in linguistics. Without a quantum leap in editing technology and accessibility for beginners, there's little hope for this to gain any traction.
> Right now, if you want to become an editor of Wikipedia, you simply need to have a passing familiarity with wikitext
Wikipedia have had an WYSIWYG editor for years.
> if you want to become an editor of Abstract Wikipedia, you'd need to have an in-depth knowledge of lambda calculus, and possibly a Ph.D. in linguistics
No this is not how it's intended. First the data itself is supposed to be from Wikidata which is super simple to edit. Secondly surly they can come up with an UI for the other parts.
> surly they can come up with an UI for the other parts
Those "other parts" is the huge hurdle that I'm referring to, and can't be hand-waved away. There are already tools that can take Wikidata and transform it into human-readable articles [1]
But it's not at all obvious how to build a simple UI for writing completely abstract lambda expressions that take arbitrary data and apply linguistic nuances to produce readable text with correct grammar.
It's not just writing code, it's writing code that needs to be aware of every linguistic nuance of your native language, so that you can coax the data to come out as a human-readable sentence. [1]
The research article doesn’t mention UNL a single time, despite it being a really similar effort (encoding texts in an abstract representation, which is generated and use by tools to translate automatically in various languages). The hard part of the project is not encoding facts into cute little RDF triples (that’s the super easy part, and as usual that’s where the SemWeb researchers put their focus on), it’s generating natural language from the abstract representation.
This means precise linguistics information must be present in the abstract representation for generation of correct sentences. Spoiler: those seems absent, and the renders presented in the paper are very basic. The data part of the project seems OK, but I predict it would go well because the NLP is largely ignored.
I used google translate on English sentences on English sentences here and they outputted exact same German sentences. I feel like this is already a somewhat solved problem
It's also interesting to note that, already, bots on some Wikipedias are the largest contributors of articles in that language. The Swedish, Waray, and Cebuano Wikipedias already have an estimated "between 80% and 99% of the total" all written by one bot, Lsjbot [1].
I wonder if lsjbot has increased single contribution users. Wikipedia (or EN Wikipedia anyway) gates article creation but not editing. If other Wikipedias do that as well, then single edit users won't be able to create an article and hence can't contribute. But if lsjbot has created the stub then people can contribute.
This is what I hope the future of Wikipedia looks like. If all "facts" are stored in Wikidata and pulled from there by individual articles, it would be simple to keep things up to date. I'd love to see Wikidata grow to encompass all sorts of things - citations would be especially interesting and it could potentially solve the problem of an article citing the same source multiple times.
To increase the availability of knowledge for speakers of less popular languages. Once encoded in Abstract form, it can be made available in every human language.
That is an improvement over the current situation where knowledge is concentrated in just a few of the most popular languages.
There are many more English articles than any other language on Wikipedia, even though there's more non-English speakers in the world.
To me, it seems this project will allow for at least "stub" articles in essentially every-other-language which at the very least provides some basic information about each entity to a reader in their preferred language.
Will this mean it will be like the bot generated Wikipedias (like the Cebuano Wikipedia) except it will be done by a Wikidata powered template? It might work for basic data-based facts like populations of villages but what about more complicated statements.
> Such a translation to natural languages is achieved through the encoding of a lot of linguistic knowledge and of algorithms and functions to support the creation of the human-readable renderings of the content
Might be a good idea, but the multilingual argument doesn't convince me one bit. If this project is any useful, it won't be because of its multilingual part.
Any person worth reading in STEM fields already knows English, and I don't know why anyone would want to read Wikipedia in any language other than English.
I'm Latin American. I used the Internet in Spanish in my early teens before learning English, and it's a joke compared the English Internet. I don't even like English from a grammatical and phonetical point of view, but trying to cater to the non-English-speaking public seems like a waste of time in 2020. Just learn English already if you don't know it, it will be a much better use of your time than reading subpar material in another language.
> Any person worth reading in STEM fields already knows English, and I don't know why anyone would want to read Wikipedia in any language other than English.
> Just learn English already if you don't know it
Some people are merely fine at English, or uncomfortable reading "casually" in their second/third languages...
It's actually not unreasonable for someone to want learning content in their native language. And there are loads of opportunities to try out new content when people in different places are writing content in different languages, with new angles and takes
For example the best intro to LaTeX is a book originally written in French[0].
And sometimes content just makes better sense in other languages because primary materials will be in that language (if you had the choice, would you rather read about the great Tokyo Fire in English or in Japanese?)
Sure, having access to English content is really important! But trying to have multilingual content is normal.
I often read wikipedia in several languages, because the differences between the articles sometimes offer almost as many bits of information as the commonalities.
What amazes me are the small-audience ones. For instance, who uses https://pdc.wikipedia.org/wiki/Haaptblatt given that most of the native speakers of that dialect adhere to a religion which mandates that cell phones belong, not on one's person, but in the barn, and furthermore be used strictly for business?
> Any person worth reading in STEM fields already knows English, and I don't know why anyone would want to read Wikipedia in any language other than English
That's not the goal of Wikipedia. The goal is to make knowledge freely available to as many people as possible. That's fantastic that you were given the opportunity to learn English and have made the best of it. That is not everyone though. It's not reasonable to expect a farmer in Malaysia to know English or to learn it purely to take advantage of what's there.
On the contrary, Wikipedia is a wonderful multilingual resource.
English is one of my mother tongues but I regularly read Wikipedia in half a dozen other languages to improve my knowledge of them. Being able to cross-reference what you're reading to the English version of the text, even though it's not a literal translation, gives valuable context as well as perspective on how a topic is viewed by different language groups.
It's also often more useful than a dictionary for finding the name of a flower or fish in another language. Or even some topics that you wouldn't find in a dictionary at all.
What's the point of this with the current high quality of state-of-the-art machine translation? Don't we expect machine translation to surpass humans in the near future?
People who are domain experts in various fields don't know how, don't care to, and shouldn't code. They should just edit the articles in natural language.
A lot of the content of Wikidata isn't numbers and is natural language also, so you'd still need to (machine?) translate it. But this time the machine translation algorithm would not have the benefit of the long-term context from the encompassing paragraph.
There are too many reasons why this is a bad idea. Almost makes me mad.
Where is the high quality machine translation? I spend most of my time in countries where I don't speak the same language as the majority of people and text that I encounter, so I am using machine translation many times per day. My experience of the average quality of machine translation is extremely low. It garbles meaning a majority of the time, and in a significant minority of cases destroys meaning completely.
To me the idea that you could translate an encyclopedia, where accuracy of meaning is critical, using such technology in its current state is horrifying. By contrast the abstract/semantic approach seems to have some potential, although I can't imagine it working well for all articles.
> People who are domain experts in various fields don't know how, don't care to, and shouldn't code.
My impression (correct me if I'm wrong, though) is that this is less "domain experts writing code" and more "editors can export a Wikidata 'subject' to a prebuilt translated page that domain experts can later expand". The aim of Wikidata is to collect information in a language-agnostic way, whereas the aim of Abstract Wikipedia seems to be to take that information and turn it into autogenerated pages in whatever language (even if that page is more-or-less a stub).
I’m pretty interested in this actually. Although I’m not part of the Wikidata community, it would be interesting to see which language groups dictate the most involvement.
As a regular user of at least four Wikipedias, this seems like a very attractive direction. Interested to see whether it produces the outcomes it's designed for.
> As the set of information encoded into Wikipedia approaches the sum total of human knowledge
An outcome Wikipedia will never get close to. They'll never reach 1% of the way there. Closer to 0% of human knowledge gets recorded, rather than closer to 1%. Of the knowledge that is recorded, a small fraction of it will end up on Wikipedia. Most of what gets recorded is universal or widely experienced knowledge, which is a miniscule subset of "the sum total of human knowledge."
Wikipedia has already begun to stagnate badly. For the most part, it's over. That's why they're attempting Abstract Wikipedia now (aka another round of the failed insular semantic Web for elite Wiki nerds that won't accomplish much of anything for the average reader that wants to learn something); and it's why Wikimedia wants to rebrand itself to Wikipedia; and it's why their system is being overtaken by partisan politics (as momentum continues to decline the system will rot and pull apart in various negative ways at an accelerating clip). The growth is running out, and the Wiki bureaucracy wants to keep expanding, that's what this is about.
Anything more concrete you can link to for further reading on this? I understand the difficulty in quantifying such claims and measures but I'd appreciate reading something that attempts to do so objectively.
————
Input 1:
Subclassification(Wikipedia, Encyclopedia)
Result 1:
English: Wikipedias are encyclopedias.
German: Wikipedien sind Enzyklopädien.
————
Input 2:
Result 2:English: San Francisco is the cultural, commercial, and financial center of Northern California. It is the fourth-most populous city in California, after Los Angeles, San Diego and San Jose.
German: San Francisco ist das kulturelle, kommerzielle und finanzielle Zentrum Nordkaliforniens. Es ist, nach Los Angeles, San Diego und San Jose, die viertgrößte Stadt in Kalifornien.
————
I didn’t understand quite what the proposal was until I saw these examples from https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Examples