Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Citation File Format (citation-file-format.github.io)
82 points by polm23 on Aug 21, 2021 | hide | past | favorite | 35 comments


What is the advantage of this format over CSL-JSON [1] or BibTeX, which are already supported by software like pandoc? There is even a standard YAML citation format [2] used within YAML metadata for Markdown files, so I wonder why GitHub couldn't have chosen one of the existing options, lobbying the spec maintainers if any changes were needed.

[1] https://citeproc-js.readthedocs.io/en/latest/csl-json/markup...

[2] https://ymlthis.r-lib.org/reference/yml_reference.html

[3] https://pandoc.org/MANUAL.html#citations


BibTeX might not be perfect, but damn do I love being able to copy-paste pre-generated entries from practically all journals on Earth.


Now including GitHub, that gives you BibTeX generated from CFF files :).


Since many papers have existing bibtex, it would be better if GitHub either provides tools to convert bibtex to CFF or supports bibtex directly.

FWIW, bibtex is the de facto standard.


Not for providing citation metadata for software, where BibTeX misses important fields.


May I ask what missing fields you are referring to? Why @online/@software/@dataset type in biblatex [1] cannot do the job?

That being said, I think GitHub should acknowledge that it is common for authors to want people cite their paper (or multiple papers) rather than simply the source code. Because this is what counts to the citation in academic. At the same time, there is no reason to not support bibtex/biblatex in addition to the cff.

[1]: See section 2.1.1 in http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/biblate...


@software is simply an alias for the fallback @misc, i.e., semantics are lost, no fields like different URLs for different software media (code, build artifacts, etc.), no software identifier support, etc.

Also, you can have people cite your paper on GitHub by giving them it as a preferred citation in CFF, and GitHub will render that instead of the source code. Which is, btw, against the software citation principles [1], but caters to people who need time adapting and want traditional credit now.

[1]: https://doi.org/10.7717/peerj-cs.86


Hi, co-lead of the CFF project here.

One advantage I see is semantics, and that it's single-purpose. Downstream clients (archives, indexers, GitHub citation feature, etc.) know exactly what they're dealing with (citation information for software or datasets). Also has better support for software-related fields than BibTeX for example (@software being an alias for @misc).


CSL doesn't appear to support software as a 'type' of thing, which it has a hardcoded list of options[0]. Of course, maybe they should have just fixed an existing format instead of creating a new one.

[0] https://github.com/citation-style-language/schema/blob/maste...


I don't think this is anything chosen by GitHub . Seems like a random format endorsed by some German and Dutch research orgs. Having said that it would be cool to have a standard format/best practice that would also be picked up e.g. by Google scholar ( guess bibtex is the best bet). Like a CITEME.bib file, which would be pretty self-explanatory.


I found the post now: https://news.ycombinator.com/item?id=28253293

So it seems to official. Strange things happen...


> What is the advantage of this

Right? There's also RIS, EndNote, and a others. Did we really need ANOTHER format?


TLDR: Think of it like Zotero having field names that change depending on the item type. If you select "Journal Article", the CSL `container-title` field is displayed as "Publication", whereas for a book chapter, it's "Book Title". The `software` type in CSL gets a similar treatment in Zotero, and CFF is like that but as a YAML schema for writing by hand. Additionally, you get to give reference data for the dependencies or datasets you built the present one from.

The main page for CFF does not argue for its existence very effectively. You're better off reading the schema (https://github.com/citation-file-format/citation-file-format...). Two takeaways:

1. It is is specifically for citing software and datasets. Only those two things. The fields look useful for its intended purpose: You can have multiple dois, for different versions of the software as published. You can include a few more different URLs than most formats, like zipped repo contents for a version or a dataset download link (repository-artifact) etc. If you use the preferred-citation field to point people to a paper instead, then you are warned it is against the principle of citing software and datasets as if they are papers themselves.

2. Unfortunately, though, because the schema only has enough fields for citing those two, if your software is (e.g.) an implementation of an algorithm described in a paper, you cannot express that in CFF. There is a `references` field, but in reality it can only contain other software and datasets because CFF can't describe anything else. It would be better if there were separate fields for CFF software/dataset references and other kinds of reference data, the latter incorporating CSL-JSON by reference. CSL-JSON isn't really written by hand except by a handful of people making CSL tests (me!); the point of such a field would be a space to dump an export from your reference library.

So in sum, what is the advantage of this?

- Anything but JSON. People hate writing it by hand.

- Software-specific field names, whereas if you used CSL-JSON directly `date-released` would be `issued`... who issues software?

- Separates the "main" citation and the "references", whereas other formats are a flat dump of a reference library. It's got structure enough to produce different bibtex/etc for different versions of the code, selectable at conversion time.

(Disclosure: I work for Zotero.)


Hi, and thanks for supporting CFF through the Zotero connector (for GitHub repos) now :).

FYI, we're in the process of improving the website atm, including a Rationale section, etc. which will hopefully make it clearer why we think the format is a good idea, at least for the time being.

As for your takeaway 2.:

`references` can take all kinds of references, not just software and references, including articles, so a paper describing the algorithm implemented in the software that the CFF file describes is exactly in scope for that (the paper being, e.g., a prior work).


Ah, I missed the fields that would help you define that info on “definition.reference”. https://github.com/citation-file-format/citation-file-format...

Clearly I’m not the person who built the connector if I missed that :)


One thing I’m missing on GitHub is support for providing a citation for a journal paper about the software, not the software itself. It’s common to see something like “if you use this code, please cite this paper” in README files.

There are many reasons why people put this into their READMEs, but it mostly boils down to the fact that paper citations affect various metrics, while citations of a GitHub repo mostly don’t matter.

The citation file format does include a field for providing a list of extended references, but it seems that GitHub doesn’t support that.


This is something that you can do by providing a `preferred-citation` in a CITATION.cff file. This will be rendered on GitHub as the thing to cite. See the schema guide section about this here: https://github.com/citation-file-format/citation-file-format....


Ah, interesting. Thanks! I was experimenting with the "references" key to see if it can do what I want and somehow missed that "preferred-citation" exists. Was this added after the initial GitHub announcement?


The agenda release and docs update came a few days later, yes. I think it might have been in the Gem already, albeit experimentally.


I was a little surprised that wasn't the default usage model, but I followed the instructions and it showed up as expected.

https://github.com/polm/fugashi


This is because CFF is mainly built to support the software citation principles [1], where it is argued (rightly so, if you ask me) that software is important enough to be cited in its own right.

Also, there will likely be no new paper for each version of software, so if you want to cite the version you have used in your work (e.g. towards reproducibility), the paper may be useless.

[1]: https://doi.org/10.7717/peerj-cs.86


What does this do that a .bib file doesn't? It doesn't follow a standard which is made clear even by the creators themselves: "When you put a CITATION.cff file in the default branch of your GitHub repository, it is automatically linked from the repository landing page, and the citation information is rendered on the repository page, and also provided as BibTeX snippet which users can simply copy!"

I don't see why they don't stick to the established standard and parse a CITATION.bib instead. It would be less complex, more friendly to the user, and less likely to cause lock-in.


I guess the answer is semantics: who will guarantee (e.g. to downstream services) a CITATION.bib file will contain the metadata for the software in the repo? CFF is single-purpose and made for just that.


How can you guarantee a CFF file will have the right metadata?


The guarantee is that you have citation information for a specific research output type: software (or dataset, as defined), and that it is the output you have found the CFF file with. Unless people want to break the principle on purpose, against which no format/mechanism can do anything ;).


This seems like a trap to make GitHub stickier. If something makes it harder to leave GitHub and host your code elsewhere, beware. In this situation, it seems safer to give people some BibTeX to copy-paste.


We're trying to bring CFF to other platforms as well, so everything just becomes stickier ;). E.g. https://gitlab.com/gitlab-org/gitlab/-/issues/337368.


I'm all for citing software. I still don't understand the guidelines for who to list as an author, though. In modern open source projects, you're going to have a very long tail of drive-by contributors. Some of these might not even touch the software itself, but merely fix a typo in the README or similar. Should the citation list all of these? As for myself, I'm very much in favour of crediting anyone for any contribution, no matter how small, but standard scientific practice is more exclusive. And what about names? When I look at the author list that Zenodo generates for my own main project [0], it's not only very long, but contains lots of online pseudonyms and even a few duplicates, due to differences in spelling or inclusion of middle names and such. My background is in the hacker community, so I think this is great, but I don't think a journal editor would agree.

I could manually curate a list of the "main authors", which would be much smaller, but I'm not particularly enthusiastic about being the arbiter of when someone's contributions are major enough to become a "main author".

[0]: https://zenodo.org/record/5062209


This is indeed something that needs to be solved. I think the current path in the schol comms community leans towards having contributors (with different roles) as well as authors.

Also, summary authors ("the <project> contributors") is one way to relatively elegantly circumvent this, and something you could do in a CFF file for example (these are being picked up by Zenodo).


When citing collective written works (like conference paper collections), the editors’ names are usually listed. I don’t see why not do the same here: cite the maintainers’ names when citing the whole, and also specific authors when citing particular fragments.


Why not simply cite the original creators? Then if major features changed amongst versions then the key contributors?

Original creators (whose idea the software is) should be first author.


As I understand it, an important point of software citations is to help academic researchers who are (unfortunately) measured by citation metrics. Will whatever tools the bean counters are using connect "the <project> contributors" citations properly to people? I don't see how they could.


oboy yet another citation format standard



How does (or should) this tie into that last tool they released that suggested snippets of of other peoples code ... copilot.

When your code ends up with a morally equivalent section to something copilot suggested out of my repo should GH add my citation to your code?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: