Citation File Format

benrbray · on Aug 21, 2021

What is the advantage of this format over CSL-JSON [1] or BibTeX, which are already supported by software like pandoc? There is even a standard YAML citation format [2] used within YAML metadata for Markdown files, so I wonder why GitHub couldn't have chosen one of the existing options, lobbying the spec maintainers if any changes were needed.

[1] https://citeproc-js.readthedocs.io/en/latest/csl-json/markup...

[2] https://ymlthis.r-lib.org/reference/yml_reference.html

[3] https://pandoc.org/MANUAL.html#citations

grenoire · on Aug 21, 2021

BibTeX might not be perfect, but damn do I love being able to copy-paste pre-generated entries from practically all journals on Earth.

sdruskat · on Aug 21, 2021

Now including GitHub, that gives you BibTeX generated from CFF files :).

xucheng · on Aug 21, 2021

Since many papers have existing bibtex, it would be better if GitHub either provides tools to convert bibtex to CFF or supports bibtex directly.

FWIW, bibtex is the de facto standard.

sdruskat · on Aug 21, 2021

Not for providing citation metadata for software, where BibTeX misses important fields.

xucheng · on Aug 21, 2021

May I ask what missing fields you are referring to? Why @online/@software/@dataset type in biblatex [1] cannot do the job?

That being said, I think GitHub should acknowledge that it is common for authors to want people cite their paper (or multiple papers) rather than simply the source code. Because this is what counts to the citation in academic. At the same time, there is no reason to not support bibtex/biblatex in addition to the cff.

[1]: See section 2.1.1 in http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/biblate...

sdruskat · on Aug 21, 2021

@software is simply an alias for the fallback @misc, i.e., semantics are lost, no fields like different URLs for different software media (code, build artifacts, etc.), no software identifier support, etc.

Also, you can have people cite your paper on GitHub by giving them it as a preferred citation in CFF, and GitHub will render that instead of the source code. Which is, btw, against the software citation principles [1], but caters to people who need time adapting and want traditional credit now.

[1]: https://doi.org/10.7717/peerj-cs.86

sdruskat · on Aug 21, 2021

Hi, co-lead of the CFF project here.

One advantage I see is semantics, and that it's single-purpose. Downstream clients (archives, indexers, GitHub citation feature, etc.) know exactly what they're dealing with (citation information for software or datasets). Also has better support for software-related fields than BibTeX for example (@software being an alias for @misc).

colonelxc · on Aug 21, 2021

CSL doesn't appear to support software as a 'type' of thing, which it has a hardcoded list of options[0]. Of course, maybe they should have just fixed an existing format instead of creating a new one.

[0] https://github.com/citation-style-language/schema/blob/maste...

riedel · on Aug 21, 2021

I don't think this is anything chosen by GitHub . Seems like a random format endorsed by some German and Dutch research orgs. Having said that it would be cool to have a standard format/best practice that would also be picked up e.g. by Google scholar ( guess bibtex is the best bet). Like a CITEME.bib file, which would be pretty self-explanatory.

riedel · on Aug 21, 2021

I found the post now: https://news.ycombinator.com/item?id=28253293

So it seems to official. Strange things happen...

cratermoon · on Aug 21, 2021

> What is the advantage of this

Right? There's also RIS, EndNote, and a others. Did we really need ANOTHER format?

cormacrelf · on Aug 21, 2021

TLDR: Think of it like Zotero having field names that change depending on the item type. If you select "Journal Article", the CSL `container-title` field is displayed as "Publication", whereas for a book chapter, it's "Book Title". The `software` type in CSL gets a similar treatment in Zotero, and CFF is like that but as a YAML schema for writing by hand. Additionally, you get to give reference data for the dependencies or datasets you built the present one from.

The main page for CFF does not argue for its existence very effectively. You're better off reading the schema (https://github.com/citation-file-format/citation-file-format...). Two takeaways:

1. It is is specifically for citing software and datasets. Only those two things. The fields look useful for its intended purpose: You can have multiple dois, for different versions of the software as published. You can include a few more different URLs than most formats, like zipped repo contents for a version or a dataset download link (repository-artifact) etc. If you use the preferred-citation field to point people to a paper instead, then you are warned it is against the principle of citing software and datasets as if they are papers themselves.

2. Unfortunately, though, because the schema only has enough fields for citing those two, if your software is (e.g.) an implementation of an algorithm described in a paper, you cannot express that in CFF. There is a `references` field, but in reality it can only contain other software and datasets because CFF can't describe anything else. It would be better if there were separate fields for CFF software/dataset references and other kinds of reference data, the latter incorporating CSL-JSON by reference. CSL-JSON isn't really written by hand except by a handful of people making CSL tests (me!); the point of such a field would be a space to dump an export from your reference library.

So in sum, what is the advantage of this?

- Anything but JSON. People hate writing it by hand.

- Software-specific field names, whereas if you used CSL-JSON directly `date-released` would be `issued`... who issues software?

- Separates the "main" citation and the "references", whereas other formats are a flat dump of a reference library. It's got structure enough to produce different bibtex/etc for different versions of the code, selectable at conversion time.

(Disclosure: I work for Zotero.)

sdruskat · on Aug 21, 2021

Hi, and thanks for supporting CFF through the Zotero connector (for GitHub repos) now :).

FYI, we're in the process of improving the website atm, including a Rationale section, etc. which will hopefully make it clearer why we think the format is a good idea, at least for the time being.

As for your takeaway 2.:

`references` can take all kinds of references, not just software and references, including articles, so a paper describing the algorithm implemented in the software that the CFF file describes is exactly in scope for that (the paper being, e.g., a prior work).

cormacrelf · on Aug 21, 2021

Ah, I missed the fields that would help you define that info on “definition.reference”. https://github.com/citation-file-format/citation-file-format...

Clearly I’m not the person who built the connector if I missed that :)

avian · on Aug 21, 2021

One thing I’m missing on GitHub is support for providing a citation for a journal paper about the software, not the software itself. It’s common to see something like “if you use this code, please cite this paper” in README files.

There are many reasons why people put this into their READMEs, but it mostly boils down to the fact that paper citations affect various metrics, while citations of a GitHub repo mostly don’t matter.

The citation file format does include a field for providing a list of extended references, but it seems that GitHub doesn’t support that.

sdruskat · on Aug 21, 2021

This is something that you can do by providing a `preferred-citation` in a CITATION.cff file. This will be rendered on GitHub as the thing to cite. See the schema guide section about this here: https://github.com/citation-file-format/citation-file-format....

avian · on Aug 21, 2021

Ah, interesting. Thanks! I was experimenting with the "references" key to see if it can do what I want and somehow missed that "preferred-citation" exists. Was this added after the initial GitHub announcement?

sdruskat · on Aug 21, 2021

The agenda release and docs update came a few days later, yes. I think it might have been in the Gem already, albeit experimentally.

polm23 · on Aug 21, 2021

I was a little surprised that wasn't the default usage model, but I followed the instructions and it showed up as expected.

https://github.com/polm/fugashi

sdruskat · on Aug 21, 2021

This is because CFF is mainly built to support the software citation principles [1], where it is argued (rightly so, if you ask me) that software is important enough to be cited in its own right.

Also, there will likely be no new paper for each version of software, so if you want to cite the version you have used in your work (e.g. towards reproducibility), the paper may be useless.

[1]: https://doi.org/10.7717/peerj-cs.86

_Algernon_ · on Aug 21, 2021

What does this do that a .bib file doesn't? It doesn't follow a standard which is made clear even by the creators themselves: "When you put a CITATION.cff file in the default branch of your GitHub repository, it is automatically linked from the repository landing page, and the citation information is rendered on the repository page, and also provided as BibTeX snippet which users can simply copy!"

I don't see why they don't stick to the established standard and parse a CITATION.bib instead. It would be less complex, more friendly to the user, and less likely to cause lock-in.

sdruskat · on Aug 21, 2021

I guess the answer is semantics: who will guarantee (e.g. to downstream services) a CITATION.bib file will contain the metadata for the software in the repo? CFF is single-purpose and made for just that.

spicybright · on Aug 21, 2021

How can you guarantee a CFF file will have the right metadata?

sdruskat · on Aug 21, 2021

The guarantee is that you have citation information for a specific research output type: software (or dataset, as defined), and that it is the output you have found the CFF file with. Unless people want to break the principle on purpose, against which no format/mechanism can do anything ;).

thenoblesunfish · on Aug 21, 2021

This seems like a trap to make GitHub stickier. If something makes it harder to leave GitHub and host your code elsewhere, beware. In this situation, it seems safer to give people some BibTeX to copy-paste.

sdruskat · on Aug 21, 2021

We're trying to bring CFF to other platforms as well, so everything just becomes stickier ;). E.g. https://gitlab.com/gitlab-org/gitlab/-/issues/337368.

Athas · on Aug 21, 2021

I'm all for citing software. I still don't understand the guidelines for who to list as an author, though. In modern open source projects, you're going to have a very long tail of drive-by contributors. Some of these might not even touch the software itself, but merely fix a typo in the README or similar. Should the citation list all of these? As for myself, I'm very much in favour of crediting anyone for any contribution, no matter how small, but standard scientific practice is more exclusive. And what about names? When I look at the author list that Zenodo generates for my own main project [0], it's not only very long, but contains lots of online pseudonyms and even a few duplicates, due to differences in spelling or inclusion of middle names and such. My background is in the hacker community, so I think this is great, but I don't think a journal editor would agree.

I could manually curate a list of the "main authors", which would be much smaller, but I'm not particularly enthusiastic about being the arbiter of when someone's contributions are major enough to become a "main author".

[0]: https://zenodo.org/record/5062209

sdruskat · on Aug 21, 2021

This is indeed something that needs to be solved. I think the current path in the schol comms community leans towards having contributors (with different roles) as well as authors.

Also, summary authors ("the <project> contributors") is one way to relatively elegantly circumvent this, and something you could do in a CFF file for example (these are being picked up by Zenodo).

pwdisswordfish8 · on Aug 21, 2021

When citing collective written works (like conference paper collections), the editors’ names are usually listed. I don’t see why not do the same here: cite the maintainers’ names when citing the whole, and also specific authors when citing particular fragments.

angrais · on Aug 21, 2021

Why not simply cite the original creators? Then if major features changed amongst versions then the key contributors?

Original creators (whose idea the software is) should be first author.

Athas · on Aug 21, 2021

As I understand it, an important point of software citations is to help academic researchers who are (unfortunately) measured by citation metrics. Will whatever tools the bean counters are using connect "the <project> contributors" citations properly to people? I don't see how they could.

cratermoon · on Aug 21, 2021

oboy yet another citation format standard

tannhaeuser · on Aug 21, 2021

https://xkcd.com/927/

tejtm · on Aug 21, 2021

How does (or should) this tie into that last tool they released that suggested snippets of of other peoples code ... copilot.

When your code ends up with a morally equivalent section to something copilot suggested out of my repo should GH add my citation to your code?