"Professors usually have this legacy code on hand (often code they wrote themselves decades ago) and pass this code on to their students. This saves their students time, and also takes uncertainty out of the debugging process."
This is so true. I'm a PhD student in physics using Fortran for pretty much that reason. At the start of my PhD, in response to my supervisor telling me I should learn Fortran to modify our current codebase, I asked if I could rewrite what I'd be working on into C++ first, since I was already familiar with it and wanted to bring future development into a more "modern" language.
His response was "You could do that and it would probably be enough to earn your PhD, since it'll take you at least three years. But I suspect you'll want to work on something else during that time".
He was right. I later learnt one of our "rival groups" attempted the same thing and it took three phd students working full time for a year to rewrite their code from fortran to C++.
"Within a month of his arrival, Randy solved some trivial computer problems for one of the other grad students. A week later, the chairman of the astronomy department called him over and said, “So, you’re the UNIX guru.”
"At the time, Randy was still stupid enough to be flattered by this attention, when he should have recognized them as bone-chilling words. Three years later, he left the Astronomy Department without a degree, and with nothing to show for his labors except six hundred dollars in his bank account and a staggeringly comprehensive knowledge of UNIX."
1. "A staggeringly comprehensive knowledge of UNIX", three years of domain-specific education, and a network of people who trust you to--no, who depend on you to be able to get things done with that knowledge
2. A piece of paper just like everyone else in the department.
And while the former might be more difficult to make use of, I think it could be much more valuable in the long run.
In context, he was underpaid tech support who was taken advantage of and got no grad school education while he paid for grad school. A bunch of scientists who know you as "that guy who fixes my email and doesn't publish or know anything" is not a great network.
1. "A staggeringly comprehensive knowledge of UNIX", three years of domain-specific education, and a network of people who trust you to--no, who depend on you to be able to get things done with that knowledge
2. A piece of paper just like everyone else in the department, three years of domain-specific education, and a network of people who trust you to--no, who depend on you to be able to get things done with that knowledge (and probably a girlfriend)
> three years of domain-specific education, and a network of people who trust you to--no, who depend on you to be able to get things done with that knowledge
This might be the case, but he's now like the janitor who unplugs the toilet. His work may be appreciated, and it might be necessary, but it's not respected or well remunerated. I guess he gets a little bit of respect for unplugging more theoretical pipes, but a little respect isn't a PhD.
Not useful if he wanted to be an astronomer, but a much more marketable set of skills if his goal was to find a job in industry. Probably also made a more positive (albeit uncredited/unrewarded) contribution to astronomy than most of the grad students.
Trusting legacy code with few users is a dangerous proposition. My roommate was given some "state-of-the-art" code and told to run simulations with it. The only graphical output was postscript (for some reason), so every frame was 150 MiB and took minutes to dump - so usually, this was only done at the end to show the result.
I managed to hack in a step which just dumped the memory of resulting frame to a file, and then we wrote a Python script to read that and produces a PNG.
We combine the PNGs into an animation and show someone else in the department because the supervisor wasn't in that day. "Cool! But, hrmm, those boundary conditions look wrong." Sadly, this was 2/3s into his Masters.
Like much now in Physics, the original code was simply incorrect. This happens all the time, and papers get retracted because of it - well, that's the best case if somebody notices.
At the end of the day, programming languages come with eco-systems, and must be chosen solely as a tool. The problem is not with Fortran, but that often Fortran == outdated development practices and in Physics horrible code hacked on by 10+ people without prior programming experience.
Another problem is that most people who wrote this code aren't programmers - they don't write clean code, no tests, etc. They don't really know those are important. Sometimes the code that is used and updated for years looks like a dirty prototype.
I don't know what can be done about it except hiring programmers to write code, which wouldn't be either easy or cheap
Well, I guess one moral of the story is that doing image manipulation is easier in Python than Fortran, so it's again using the right tool for the job. And I think the push for Physicists to use Python if possible is good, as the learning curve is less steep. And once you have that tool at your disposal, you might use it more often (instead of Excel).
I can see many projects being improved by providing a Python pre-processor that writes out e.g. a binary config file, the hard-code Fortran/C simulation code reads that, and spits out the simulation results, and then having a Python post-processor that does the pretty stuff at the end.
Academically, it seems that pairing CS undergrads with Physics undergrads to do e.g. a molecular dynamics (MD) course would be cool. The Physics behind MD isn't too hard, and given the right parameters the programming part would be manageable. Then again, CS undergrads aren't necessarily great programmers either...
Clean code and tests are overrated. Version control and a big eco-system is underrated. It's like maths, physicists don't understand maths, they just use it like a carpenter uses a nail without understanding metallurgy, or a programmer uses a CPU without understanding solid-state physics. And that's okay.
I would have thought a more natural candidate for scientists would be languages like Java or C# rather than python: you get both a language that is relatively easy to write, with no memory to manage, no hornet nest of pointers on pointers, nice debuggers, lots of 3rd party libraries, while a the same time getting the performance of static typing / running compiled code once the program has been JIT (which for long running code is a negligible cost). It's never going to be as fast as a C++ or Fortran, but you get a compromise between performance and ease of use, while I understand python is going the other extreme, i.e. all ease of use at the expense of performance. Doesn't matter for a simple simulation, but for some large data analysis projects, I assume it could make a big difference.
But that doesn't seem the case. The choice seems to be between python and c++/fortran. Does anyone know why?
1. Nice bindings already exist for numerical libraries in python. It is more accurate in some ways to say that scientists are using a domain specific language based on numpy than they are using python.
2. The choice is either "fastest possible" or "I just don't care how long it takes" - for either development or run time. There is usually no middle ground.
3. Tooling - the scientist is most likely using a text editor not an IDE (especially when they start working). Fortran and python are both low enough on boilerplate to not need IDE support.
4. Abstraction level. Scientists in general don't bother care about abstraction level at all. Procedural programing's abstractions are usually more than enough for them (and may actually be the correct level for some numerical work - think cache misses vs. hits).
5. Some areas of science do use Java (check out imagej for example).
Why not using both ? You can reuse existing fortran code withing Python, and benefit from the prior experience and speed while enjoying the qualities of Python.
If you don't have clean code and tests, vsc would allow you to switch between old and new bad, probably non working code. That's nice, but probably not what you'd want. If I'd need to choose one out of all, I'd choose tests. From my experience, it's nearly impossible to implement big system without tests. Maybe I'm just not good enough as a programmer
They probably did at some point in the misty past. The problem is that running relevant experiments is often very expensive (otherwise they wouldn't need to do it in software) and they can only test a tiny handful of cases. More often than not your calibrating your model against another model that was calibrated against another model that was calibrated against a set of simple experiments done in the 60's.
I've heard horror stories of supposedly quite serious software modelling programs behaving quite differently when their input problem was rotated a few degrees.
I've heard horror stories of supposedly quite serious software modelling programs behaving quite differently when their input problem was rotated a few degrees.
Sounds perfectly normal :)
The standard advice here at work for one of our modelling packages is: if it crashes try rotating your input model by one degree and try again. Most of the time that will fix it. If it doesn't fix it try one degree in the other direction.
I'm sure lots of people do. We're a commercial company selling consultancy service around this software (technically around some inhouse tools built on top of this software), among many other things.
Is correctness somehow assured provided it doesn't crash?
On the whole well tested PDE solvers either crash, give answers that are off by several orders of magnitude, or give a correct answer. (If the answer they give is relevant to what you're trying to model is left as an exercise to the reader)
We're reasonably sure that, if the calculations converge, the solution provided is a correct numeric approximation (within the error bounds given) of the PDEs we're claiming to solve. We also believe that the PDEs we're using provide a reasonable balance between modeling what we claim we're modelling and our computation running in a reasonable amount of time.
Yep. Most of my programming classes were either in pascal or c when I was in undergrad, but in my physics courses, all the way through my graduate education, I used FORTRAN. It's because physics doesn't change that much over the years, FORTRAN code that is time-tested and that works exists and there's no need to re-invent the wheel in another language when more interesting and important problems exist to solve.
David Baker's Rosetta code, which made him a pile of money, is IMO a spectacular example of craptastic C++ written by people who really didn't understand C++ but didn't let that deter them from using every single feature of the language, badly. Some years back we tried to port it to CUDA but there were so many levels of indirection, dereferencing, and virtual functions that it was nearly impossible to make any progress.
In contrast, porting the 30+ year old molecular Dynamics package AMBER to CUDA took about 3 months and probably established me as a CUDA expert. In my opinion its well-maintained Fortran 90 code was far easier to understand and refactor.
While my primary languages today are C++ and CUDA, there is something clean about Fortran when it comes to understanding underlying algorithms. I have a similar opinion about well written C code.
I wonder if you can formalize this. Assume TDD, and then measure the compressed size of the codebase vs. the compressed size of the test suite. Code is "worse" when it requires more "stuff" (informational entropy) to do less "stuff" (pass test cases.)
The compression would presumably remove the redundancy of the language itself as a factor (including differences in idiomatic cyclomatic-complexity "depths" of various stdlibs), and also remove any redundancy in the way the test cases were specified. So it'd be down to a measure of how much circumlocution and over-engineering you did in the process of implementing the solution.
I'd worry slightly that code-golf solutions would be rewarded, though. Maybe pass everything through an obfuscator + linter before computing the metric, so that things like identifier lengths and spaces aren't considered.
Oh, you're that Scott Grand? Allow me to thank you for the work you did on pmemd.cuda, my old research group wouldn't be where they are now without you.
You're welcome, it was ironically based on ideas drawn from a C Port of the Amber potential function I wrote back in grad school. While I'm not proud of the code these days, I put it on sourceforge a long time ago.
Why write the port in the first place? Because back in the days of yore, I was doing the equivalent of adversarial search to try and design a better potential function for predicting protein tertiary structure. I ultimately arrived at the result that there were too many adversaries to make linear models and single hidden layer neural networks work.
And unfortunately, my postdoctoral advisor at the time didn't consider this publishable research.
Its a product of the UW, so no surprise the quality is crap. IMO their Cybersecurity course (which is mandatory for CS students) is the biggest load of shit I've ever seen in a class.
The head of information security at the UW calls the PSTN bulletproof secure, but VOIP insecure, then babbles on about how he shares classified info with his buddies casually (happens to be a felony to share said info).
Needless to say, UW's Avaya phone system barebacks the internet, with no regard for using silly security things like TLS or SRTP.
Now you need to do a follow on study to see how much science the 'rival' group does with a more modern codebase than your 'legacy' group does. Would you know if there are enough examples of two groups who have diverged like this to get meaningful (as in statistically significant) results on the cost benefit of porting / not porting?
> Now you need to do a follow on study to see how much science the 'rival' group does with a more modern codebase than your 'legacy' group does.
I would guess that a C++ codebase written by PhD students, not by seasoned C++ experts, is more complicated and much slower to debug, than a corresponding Fortran codebase.
Don't be so assumptive that a 30 page long main function is bad code. The code still can be well organized and readable. Function is not the only method to organize code. Knuth's literate programming for example is invented to better organize code.
Function has its side effects. It's use as only once case is really questionable, justified by lack of other means.
The very fact that the article itself, with a completely straight face, has all that malloc nonsense in there as "the way you have to do it in C++" is very telling of how easy it is to get the wrong end of the stick with C++ unless you are au fait with the history of the language and its evolution to better more modern idioms
Fortran is a much simpler and safer language than C++. And faster to learn. Fortran is even somewhat simpler than C or Java, whereas C++ is probably the most complicated language in the known universe.
So especially in the hands of non-experts, Fortran should produce less bugs.
> Now you need to do a follow on study to see how much science the 'rival' group does with a more modern codebase than your 'legacy' group does. Would you know if there are enough examples of two groups who have diverged like this to get meaningful (as in statistically significant) results on the cost benefit of porting / not porting?
I don't know of any studies like this. But there are two aspects to consider. One is what you've pointed out, which is the long-term scientific gains and the productivity of the research group as an entity. But the other aspect is whether those students who rewrote the code into C++ were more or less successful than the students who weren't rewriting code.
Obviously, how you judge success in the latter will be tricky, since there may be differences in the interest of students who rewrote the code and their desired outcome after the PhD (e.g., stay in academia versus go into industry).
That is a good point, if the majority of the code is living in Fortran then there is definitely some value in understanding large legacy Fortran code bases.
This is true. Though I realized I wasn't specific enough in what I meant. I was more thinking about the limited available time to students and that rewriting code might mean the students do less science and so are less able to obtain a good postdoctoral position (or whatever position typically follows a PhD in the particular field).
True, but the students that re-implement these codebases also might have a much greater understanding of the underlying techniques, theory, and the interaction between them; compared to others who produce more results, but use these codebases as black boxes. I think both pathways lead to equally good academic careers, they just branch out into separate paths.
His response was "You could do that and it would probably be enough to earn your PhD, since it'll take you at least three years. But I suspect you'll want to work on something else during that time".
as someone who had contact with that codebase, do you have any insights as to why that is?
Was it the sheer size of the thing? Was it some nuance that Fortran had as an advantage over other languages? was the math just difficult to follow?
I have reimplemented all Fortran code of my professor at CERN in C during several undergrad classes during my Physics major. Then I have quited physics and became a developer.
Pretty much just size! There was about 20,000 lines of it, it's legacy code that has been gradually added to since the 1980s, so it would have taken a while to rewrite all the parts to work with each other. Perhaps some day though.
Started in the 80s doesn't mean the code is 30 years old, as they keep adding to it and modifying it in Fortran. Plus most of the analysis part is probably just math routines that don't need modification ever after.
20,000 lines of code really doesn't seem like that much. I probably output that much in about 2-3 months of biomedical research so there has to be more to it than that.
>> legacy code that has been gradually added to since the 1980s
> 20,000 lines of code really doesn't seem like that much.
It's not "20k lines of code" but "n lines of code that grew organically over decades to 20k lines" - with the help of probably way more than 100 people who all are not trained as developers. I think it's a safe bet to say the current state only has a faint memory of being a consistent code base.
> 20,000 lines of code really doesn't seem like that much.
Once I spent close to a month hunting down a subtle but nasty bug in a number crunching module which was perhaps around 2k LoC. Writing code is very easy. Verifying and validating number crunching code is very hard and very time consuming.
20,000 lines of your code is one thing. 20,000 lines of someone else's code is another thing entirely. 20,000 lines of many other people's code can be insane.
To do such a rewrite one needs domain knowledge much more than you need software engineering knowledge. This means you need to get research physicist to contribute major effort into such a rewrite. To get there one needs to make them see a clear benefit of translating a system (that works fine today, thank you), from language A to language B.
A Ph.D. student or a young researcher might be more interested if a rewrite lets them run experiments using Amazon / CUDA / whatever for a few thousand dollars instead of spending scarce grant money on dedicated hardware.
They are talking about large scale massively parallel code to run on a super computer. Most grad students in physics that need it have have access to a system much faster then Amazon. For example my friends code runs on Cori @ NERSC which is a 622,336 core Cray.
I'm taking (educated) guess, because people involved estimating rewrite where all physicists and not developers. And certainly not developers with 5+ years of experience. Which is minimum I'd trust with archtechting major rewrite.
Yes, C++ does not buy you that much. Sure, it is a more modern language, but when used by scientists (not software engineers) the advantages are hardly earth shattering. Much of complexity is hidden (as it should be) into libraries.
IMO, Python stands a better chance of breaking the Fortran's lock on physics related computing. Give it a few more years and enough numpy-based libraries might make Python a real competitor. My 2c
I have to be honest with you, as someone who writes in both C++ and Python, I really do not see Python being more of a candidate than C++ for displacing Fortran.
Can you clarify why you think Python might be able to do it? For scientific computation with high performance requirements, Python is not competitive with Fortan or C++. For work that continues to happen in Fortran due to "academic inertia", my impression (and vague experience) is that researchers find the convenience of what they're accustomed to (Fortran) to be greater than the convenience of things like rapid prototyping offered by the various Python scientific computing and data analysis libraries. There is a mental overhead is switching, and I think most academics are sort of okay writing code in whatever is familiar and battle-tested if it means they can focus more on the research at hand.
In other words, I'm not saying you're wrong, but I'm not following your reasoning.
A disclaimer -- I do not work in physics, however a couple of recent physics Ph.D. that I work with expressed similar views.
Speed is not nearly as important now as it was 10-15 years ago. A typical scientist's worstation has 20 CPU cores. If I want to use 100 CPUs for a few days, it is trivial and 1000 is easy to get. Thus the fact that Python is slow(er) does not bother me unless I am setting up something major.
What matters though, is the availability of libraries that let me reliably run my experiments. If there is a bug, it must be in my code, not the library -- a "discovery" caused by a software bug is humiliating. This is where Fortran shines and Python is not quite there yet -- the decades of beating on those libraries made them very well understood and trusted.
What users? I do work in research and this is what I see almost every time -- doubling the number of CPUs available for a simulation is trivial; getting someone to rewrite existing code to make it twice as fast on the same CPU is hard.
I am not talking about interactive programs -- a slow browser is annoying. I am talking about scientific computing. This is just my experience, can you provide some counterexamples?
I worked as a programmer in a molecular dynamics research group for a while. I was asked to work in Python because it was what the boss was familiar with - so there's some of the same inertia happening again, just with a new(er) language, I guess.
Speed is not a huge issue if you're happy to leave your simulation running overnight anyway, or if you have the option to just throw more and more cores at the problem (or in our case, both). The goal is to get your papers published. Code developing/interpreting/debugging time is generally far more serious an obstacle to that goal than simulation speed. I was the only developer there. Everyone else in the group coded on an almost daily basis, but none of them considered themselves programmers, and very few of them actually learn about good programming practices. They're biophysics researchers.
Having not worked in Python much before I was also pretty pleased to find that, having determined that I needed to use Dijkstra's algorithm for path-finding and then working out the smallest Standard Deviation between certain sets of data points, Python came with libraries to do both of those things off the cuff. It's just so easy, I can see why it has a favoured place in this field.
As long as Fortran is able to keep up with Python's capabilities (or any other language's for that matter), there seems to be absolutely no reason to make a change considering Fortran's history and familiarity in Physics. In other words, other languages need to make a big enough leap forward or Fortran to lag behind enough to justify a change. Your example just shows that it's fine to use other languages, and I agree, but it's not very compelling in promoting a change in language.
> Speed is not a huge issue if you're happy to leave your simulation running overnight anyway, or if you have the option to just throw more and more cores at the problem
What about those cases that it's the matter between throwing at 1000 core cluster and wether we can have results before next conference in a couple months. That is what really defines the scientific programming -- it's about feasible and infeasible. For other jobs, isn't it just matter of taste?
Depends not all scientific programming is at universities
when I worked for BHR group (a Hydro dynamics research organisation) some times we had emergency projects that had < 24 hour turn around. I recall one where a clinet had had a serious issue at a plant and we ran the simulation and produced a report in a single day
Just because python is slow doesn't necessarily mean that every library is also slow. With numpy/scipy, which are written in C with python bindings, scientific computation is pretty fast. Python is fast becoming standard language for data science with so many tools like notebooks that make it extremely easy to do many things. I am not sure how well it compares to Fortran but it is no brainer to use it over C++, especially for research needs.
I don't see Python displacing Fortran or C++, but I do see Python showing up in HPC as a platform for DSLs. Instead of handling the expensive part of the computation, Python acts as a UI layer / configuration manager / code generator. For an example, see fenics: [https://fenicsproject.org/]
In a previous discussion of this kind (I'm not in this area myself), a scientist had said that the speed of the languages is less important than iterating the algorithms, and it was a lot easier for non-developer scientists to work with python than with C++.
Hardly any of numpy is written in Fortran. It's basically all C and Python. Numpy does however links to and provide wrappers around some existing Fortran libraries.
Potentially your "rival groups" will have a long term advantage now though. This might be bad for the graduate students that did the work, but could be good for the professor and group over the next decade.
Graduate students need to think in terms of 4-6 years while forward looking professors might want to think in terms of 5-15 years.
I do wonder about this a little, but so far I haven't seen anything implemented elsewhere in my field that hasn't been possible to do with our existing code. In fact, I have spent a fair bit of my time replicating others' results with our code and getting sub-percent level agreement.
However, I guess one drawback is that a lot of the things we currently implement are all written from scratch (for very standard things such as numerical differentiation and parameter optimisation), which has the advantage of having "control" and more understanfing over the code, but less time saving/potentially not as efficient as using pre-existing libraries.
I agree, it is very project/field dependent. There are times when a painful redesign might pay off in the future, but I am sure there are also times when it is a waste of effort. In your case a redesign might have been a waste, you are in the best position to judge this.
I use Python for most things and write C extensions which can use OpenMP or CUDA and which are hooked in via Cython for the slow parts. Find this works well, although it can lead to you duplicating things unnecessarily sometimes (needing a C function to be callable by Python requires you to write a wrapper for it).
I might actually disagree on this one. If it was the first year of their PhD, then yes, it does seem like grunt work which is dissatisfying, but they will have fully learned the ins and outs of their group's code framework.
In the process, they would have understood nearly every approach taken by the group towards producing the results that it does, and I bet that has helped them when they've ended up modifying that same code later for their research. And with fewer supervisor meetings to work out exactly what X, Y and Z part of the code does because they will have worked on it themselves.
I'd say I've easily spent 6 months just spending time getting my head around all the code we work with anyway, so at the very least, I hope that made their following years of research more productive.
Not necessarily. It was definitely bad if they want to become academics and needed to publish papers. But if they knew from the start what they were getting into and wanted to transition from physics to software development then it might not have been so bad (although there might have been a more optimal path).
A bad advisor wouldn't care what his students wanted, but a good advisor might still have students work on this kind of project as long as they were aware it wasn't going to help them get a tenure track position in the future. If they wanted to go work at a national lab doing HPC work and programing it might have been plenty ok for their career (This is what I am transitioning to now) or if they want to go work for a hedge fund or apple it might also be an okay option.
I don't know. Having "rewrote X kloc of scientific Fortran to modern idiomatic C++" on your CV should get you to the head of the line in many places when looking for a job.
Only if you go after run of the mill coding jobs after your phd, in which case why bother at all. When applying for a postdoc, you'll very much want to bury that part of your work.
Plenty of high-paying quantitative jobs requires such a degree and experience. The majority of phd’s don’t go on to support their family with a tenure-track academic salary.
Or you could be applying for a HPC research job in one of the countless non-university setting that also does that sort of thing. Then having both a relevant PhD and some relevant hands on experience will be extremely helpful.
>>I later learnt one of our "rival groups" attempted the same thing and it took three phd students working full time for a year to rewrite their code from fortran to C++.
I wonder if there are any transcoders for Fortran to C++? I wonder if there is even a market for something like that?
I've written alpha versions of transcoders for C to Java and Java to Objective-C and I think I could do the same for Fortran to another language but why?
Writing a transcoder first is the right way to do it and it should take you a couple of months to do it, especially if you already have some experience with this kind of stuff. Definitely less than a year for a single person. I would not try to translate the code by hand. It would be a never ending project full of bugs.
Edit: Apparently there are plenty of Fortran to C++ transcoders. Here is just one I found during my google search:
http://cci.lbl.gov/fable/
When I was working for LROC, most of the planetary geologists used IDL (Fortran-ish in many ways) but there was still plenty of actual FORTRAN floating around. Oh, look, in the last couple of years this legend of photometry worked up this new method and here's the associated Fortran code. That kind of thing.
I did actually rewrite both IDL and Fortran, but it was always smallish, single-purpose programs or functions.
Note that the SPICE Toolkit[1] is still written in Fortran today, translated to C using f2c, and the C version used as the base for other language support. The Fortran part is unlikely to ever go away, since support for processing past missions is crucially important and all that processing code was written in Fortran. Also, talk about your stable, backward compatible APIs...
Just before I left JPL, the SPICE team had announced a plan for a complete rewrite in C++. The SPICE team is exceptionally small (especially considereing how widely used and impactful the software has been). IIRC since the team’s inception, to the time I left, it has been about 4-5 people. Their primary goal has always been stability and correctness over speed. Recalling a conversation with Boris, they have something like 2.5 million lines of test code. So it would take some time to port over. The codebase is probably the most documented I have ever seen, every mathematical deduction is described in great detail.
That being said, as someone who has integrated CSPICE into several C++ and python projects, the modernization would be a very welcome change. The current arch depends far too much on global state, and none of it is threadsafe.
I recall that a local aerospace outfit was looking for C++ programmers to re do a metric F%^K ton of Fortran into c++.
I actually rang the recruiter to query why would not be simpler to train your existing staff to use Fortran - the add stayed up for years and years I always wonder if it ever got ported.
Not really its very odd as when I worked at on of this organisations peers straight from high school I was told to get a fortran book from the company library and learn it I also had an hours basic instruction in how to boot the PDP11
"...and it took three phd students working full time for a year to rewrite their code from fortran to C++"
There's a salient lesson here. Clearly, this was a stupid decision for two reasons:
(1) there are many thousands of scientific routines that have had 40-60 years of fine honing and careful debugging and they just work! (For instance, we send voyagers—Voyager, Cassini, etc.—to the end of the solar system and they invariably get there; the Fortran routines that get them there do exactly what they're supposed to do (unlike much of today's poorly written C code)!
(2) Rewriting that already-reliable fully-debugged Fortran code into any other language will almost certainly make it far less reliable, thus it's a no-brainer to stick with the original .F source.
Yes, Fortran is simpler than C [consider it BASIC on steriods] and years ago it had long since evolved well past John Backus's 1954 incarnation of the language into solid workhorse that both physicists and engineers use regularly.
Just because something is old and out of fashion doesn't mean that it's broken or doesn't work well. (Longevity ensures that there's been sufficient time for many hands to make it reliable.)
(Oh, BTW, I'm reminded that decades ago when I was just beginning to learn Fortran using punch cards on an IBM360 mainframe with the WATFOR FORTRAN-IV compiler, that I made four errors in only six lines of code. After dutifully printing ERROR against each offending line, the compiler finished with the message: "YOU NEED TO SEEK ANOTHER CAREER " or words to that effect [yes, it was in uppercase]. Eventually, I got considerably better.)
It's not just that the code is already written and debugged; in many cases results have been published using analysis done with this legacy software. Using the same code ensures a certain amount of consistency between experiments from the same lab.
From this point of view, rewriting code is an extremely high-risk proposition. As we know, the likelihood of discovering bugs during this process is quite high.
>> Professors usually have this legacy code on hand (often code they wrote themselves decades ago) and pass this code on to their students
So cut-n-paste code. The students are running code that they don't really know what it does, it might not even be correct...it's like StackOverflow in academia.
He certainly used to back when qmail vs postfix as the better sendmail replacement was a real debate 'we' where regularly having. If he's not getting any flack today it's most likely because he's fallen more or less into irrelevance when it comes to practical day to day operations.
Speaking from a government contracting point of view: Nobody is going to pay you to rewrite existing code that's already working. Nobody. The customer doesn't give a flying shit about the implementation. He'd be happy with a box of diodes as an implementation, as long as it worked and came in on time and on budget.
When you're writing up your proposal for a contract or a grant, the theme should always be that you're "adding capabilities" (which should be well-defined and constrained) to the existing codebase. If you get the money, then you've got carte blanche to rewrite to your heart's content - just don't tell the customer that this is what you're doing. Just make sure that those new capabilities indeed make it into the re-write and that you introduce no regressions in the new code.
People don't tend to write tests for their Fortran code so the assumption that its already working and the numbers coming out are correct is a matter of faith.
Writing tests for the the kind of numerical code that FORTRAN is usually used for is hard. Sometimes there is no direct way of testing it because if you knew any of the results already you wouldn't need to run the simulation in the first place. Quite often, the best that you can do is proper sanity checks like conservation of energy and momentum or things like that.
As with any sort of big transition (changing e-mail, using a new password manager, changing programming language), the solution is always to do it incrementally.
For e-mail, I generally create a new one, and over a year or two, I create new accounts with the new e-mail and gradually move accounts until the old one is seldom used. Similarly here, it may be a bit tricky, and it really depends on how intertwined it all is, but gradually writing new pieces that you're adding in a new language, or using C++ for pieces that you're rewriting. eventually you'll be much closer than trying to do it all at once.
The one thing that helps I think is writing of codes that are open source. Yes, it's a sticky point regarding getting funding, but in that imaginary world where you are funded well, transitioning to open codes (save the things that are...export controlled) would be beneficial for all of us.
I hate how I can't publish easily on modifications I make to our PIC code because it isn't open source; eventually I'm planning to switch to another code (and might implement a needed solver for it) just for the sake of my publications.
As someone working on an Exascale project for electronic structure calculations, I have a theory about the longevity of Fortran. It's the fact that many of these codes were started years ago and the people who have the credentials and ability to get funding for super computing projects learned on Fortran and stayed with Fortan because they were scientist first and programmers second.
Modern Fortran has many nice features in 2017, but the people that wanted these features moved to C/C++ long before the features became available in Fortran and those that are left using Fortran are usually scientist, not programmers, and so don't care so much about these features. I think it is largely the older generation that says they will never stop using Fortran, in the survey mentioned in the article.
Just to suggest where the field is moving though. NwChem is a large successful electronic structure package using Fortran. Its next gen version NwChemEx that is being designed for exascale will exclusively be written in C++ (https://www.pnnl.gov/science/highlights/highlight.asp?id=441...).
Also just from experience people who work in HPC mostly would rather be writing C/C++, but use Fortran because they have to not because they want to.
I makes no sense for people to have moved to C for things that have recently appeared in Fortran. What are the features they've been missing which have appeared in Fortran in the last 10 years?
So why is NWChem being re-written in C++, and what relationship does that have to exascale? Richard O'Keefe said 10 years ago ago "Why not start by rewriting the Fortran code _in_ Fortran? Fortran 90 is a very pleasant language." (NWChem is just in Fortran77 as far as I remember.)
I work in HPC, though I don't write numerical code these days, but I'd definitely prefer to write it in Fortran than C, and I'm not clever enough to use C++. I'd also much rather maintain typical scientific Fortran.
I don’t use Fortran but I agree as of today and probably post 90 Fortran is nice enough but people moved to c++ for things like polymorphism and templates and pointers among other things. You could probably also rewrite nwchemex in Modern Fortran just fine. I’ll give you one example of why C++ might be easier though. Next gen linear algebra libraries are being written in c/c++ such as elemental and dplasma also Cyclopes tensor framework and probably others. When the libraries you want to use are in c/c++ the that can be the easiest path forward.
Yes, the electronic structure community is rapidly moving away from Fortran.
In my opinion, Fortran will fall behind as newer hardware, libraries, etc, will drop Fortran support. Also, newer grad students are all much more interested in C/C++/Python. I think this is in part because the newer languages are widely used outside of science and therefore there is much more documentation and tutorials/guides. Not to mention that skills in those languages are transferrable to other areas (data science, machine learning, etc).
As a side note: Wow someone working on the ECP on HN? I'm tangentially related to the project (and just visited PNNL last week)
> Yes, the electronic structure community is rapidly moving away from Fortran.
Really? Seems to me that with very few exceptions (e.g. GPAW which is python/C and nwchemEx which I've never heard about until the parent poster mentioned it), electronic structure is pretty much a Fortran bastion.
(Source: I did a Phd doing mostly electronic structure calculations, graduated ~5 years ago)
New libraries are being written in C/C++, and maybe Python. This includes libraries that should form the foundation of the QM community (matrix/tensor and integral libraries). These are meant to take advantage of newer hardware and libraries which themselves are written in C/C++, and often in a way that is inaccessible from Fortran.
The old Fortran code will be around for a long time, but I don't know of any large-scale, serious efforts to develop new packages or major new functionality that are starting with Fortran.
(I'm not totally against Fortran - I just spend a week devloping in it. But I still much prefer C++ and Python)
I don't understand why the implementation language of low level libraries should determine a high level language that uses them. In what way are C libraries inaccessible from Fortran, given that it defines interoperability with C?
I'm afraid you need a 10-, or preferably 20-, year perspective, not a week.
I have been developing in Fortran for years. I just mentioned that as an aside.
About C compatiblity: Many C libraries use pointers in their interfaces. Interoperability is indeed defined by the Fortran 2003 standard, which I have used several times to wrap existing C libraries. However, much of the existing code is F90 only (some even F77...), and a vast majority of Fortran developers in the field are not familiar with even modules and other F90 features, let alone iso_c_binding in 2003.
Also, newer libraries tend to be C++ as well, which is more difficult (or at least more awkward) to wrap.
It’s partly the power of library authoring that is moving things I think. To my knowlegdge most post lapack tensor algebra is all C/C++ also. I don’t know Fortran but I am not sure if it has the same flexibility when it comes to generic code and things like writing things like template expression math codes.
Finally the fact that groups like Facebook and google are writing their machine learning code in C++ shows that 1 they find it useful and 2 plenty performant. This kinda became a response to the comment above yours sorry.
If you don't mind, I'd like to maybe chat with you a bit and get your opinion on some things (and maybe see if I've actually met you before). My (mainly throwaway) email is ytterbium35 (at major email service run by google).
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is written in C++ and it's over 20 years old. The FORTRAN codebases seem to be centered around the finite elements/difference methods and the fluid dynamics community.
Sorry, I wasn't specific. My comment mostly applies to the quantum mechanics (QM) community, rather than molecular dynamics. In QM, many people still run Gaussian/GAMESS/ADF/MolCAS/MolPro/Dalton, which are all Fortran (or majority Fortran). And most are Fortran 90 or earlier.
My background is in QM, so I guess that's my bias showing through :)
There is a lot of electronic structure in c and c++. For SMP there is psi4 and Orca also Garnet Chan has a new python C++ package. Then some widely used integral code generators generate c or c++ (libcints, Libint2)
On the parallel front MPQC has been around a long time and is C++.
I have a somewhat amusing FORTRAN story from my undergrad days at the Florida Institute of Technology...
So my first programming class ever was a Numerical Analysis class taught at FIT, and to be honest, this was my first exposure to a "real" editor (vi on a PDP11 in this case)..up to then it was all MS-BASIC with that wonky line editor and, of course, goto-s and line numbers.
At the end of the first class (8am ugh), the instructor announced that anyone looking to get extra credit and perhaps skip having to come to early class, to talk to him after class. Of course that sounded good to me, so I went to see him and he said "ok...if you can write me a bowling league manager in 10 weeks you will get an A and not have to ever come to class."
Ok...hell yes in fact! This sounded a ton more interesting then sitting around a silly class talking about programming. He gave me a spec sheet and away I went to lab to begin my struggles with vi and FORTRAN.
I wasn't easy, but holy shit did i learn a lot...more then I ever could have just doing the exercises in floating-point rounding error and non-linear simulations (I ended up doing that later as well) that were "taught" in class.
I can still remember FORTRAN (77 I believe) has the a very strict formatting scheme where the column had to match the keyword in order of the program to compile or something stringent like that. But mostly, coming from BASIC, it was a breath of fresh air.
I ended with completing the program with extra bells and whistles...sorting, multiple leagues and other things...and the instructor was duly impressed.
I knocked out a bunch of lower division CS classes at a JC (taking 5 at a time). I think I went to the first class and a few before midterms and finals. Just got the labs and handed them in the next day.
Hurray for attendance-optional JCs! :D
Most of it transferred to an UC and then the fun began:
- caching http/1.0 forking select() proxy server as the third project in a networking class, circa 2002
- Java subset to MIPS assembly compiler
- Reimplement most of the OpenGL pipeline in C++, quaternions and write a trapezoid (scanline to scanline) engine (on which a triangle engine could be built). Oh and then model the interior of the building.
- Pipelined, microcoded, simple branch-predicting processor. Bonus points for smallest microcode and fewest microcycles. (I Huffman mapped the histogram of the sample assembly programs’ executed instructions to the user-defined binary macro ISA (students had to write the assembler too), and then used progressive decoding in the microcode (43 micro ops long microprogram IIRC). Blew the doors off the extra credit in that class.)
Some of the points brought up here are in fact correct, mainly legacy, testing, and awesome compilers tuned for supercomputers. However, a lot of these "why fortran" articles (on both sides) I find are written by people who don't dabble enough on both sides of the fence, and are ignorant of what either side offers. For example, numpy implements a lot of the stuff from fortran the author listed, like broadcasting operations across arbitrarily shaped arrays, striding and negative indices, etc, not to mention the scipy library that contains leagues of the famous fortran codes...you get all that with a quick and easy to prototype language for the stuff that isn't bottle neck.
Another issue is computational people think C++ is about OOP, ffs, what a way to sell C++ short and ignore the more significant tool C++ brings to the table: generic programming. Whenever I talk to my computational colleagues, they talk about "C++ and OOP" as if they are two peas in a pod; what if I told you you didn't need to use inheritance to leverage the best of what C++ offers (what if I told you you didn't need inheritance to even leverage OOP!?). Templates have the potential to be a powerhouse for performance in codes I feel, just no one in the computational side has leveraged them because they quite simply don't understand it.
The same sort of thing is true for cs people and their critique of Fortran usage, but I'll leave scathing comments on one of those stories that are shared here.
Indeed! But stuff like parsing input files and plotting are tons easier to code in Python than in Fortran. The point is the performant parts are not in Python, but the parts we don't want the difficult baggage can be written in a much easier and kinder language like Python.
C++'s generic-programming feature still has shortcomings - I think functional-programming has more relevance to scientific computing, but C++'s functional features are "okay" but still not as capable or proven as, say, Haskell's or OCaml's - for example for tail-recursion you still depend on the compiler supporting that optimization, you can't force it or necessarily assume it will happen, with fun consequences for your stack if it doesn't.
For scientific computing, whether a language has very advanced or merely okay functional features is only a superficial style issue.
Regarding more important performance issues like low level control of memory layout and avoiding pointless copying and indirection, C++ and Fortran are both at the most effective end of the spectrum, while typical functional languages lie between "don't even think about it, by design" and "it might be OK but only a fool would put a project at the mercy of what optimizations a relatively unproven compiler opts to do".
Functional languages force you to be more correct, more often. Eliminates a bunch of classes of bugs which are anathema to scientific computing, and are generally so high level that compilers can optimize extremely aggressively. Also, scientific programming is usually much more about data flow and transformation, which is FP’s wheelhouse.
Output results would be the same - that's concerned with program correctness, regardless of whether it's written in a function, object-oriented, or procedural paradigm.
It should instead be compared with how long it took to engineer and build the system or program in a particular paradigm and the qualitative engineering aspects of a particular platform. FP may be amazing for certain areas, but a difficulty in basing a large-scale project or business on it would be hampered by the small supply of developers who can comfortably program in it.
The article is wrong or misleading in a number of respects. For instance, OpenMPI doesn't define the language interfaces -- the MPI standard does. It talks about "no aliasing of memory" -- the rules actually concern "storage association" -- and then claims Fortran passes by reference, misunderstanding the whole thing. The Benchmarks Game is pretty useless generally, but it's clearly useless to compare supposed language speed by using two different compilers anyway.
I don't mean to knock Fortran.
One of the key points of the article is that there is a lot of legacy code written in Fortran. As a former high energy physicist, I have an anecdote here that some people might find interesting.
There was a library written in Fortran called CERNLIB which included a broad variety of miscellaneous numerical algorithm implementations (e.g. minimization, integration, special functions, random number generation) [1]. I couldn't tell you exactly when the library was first released, but my best guess would be the early 80s. It can't possibly be later than 1986 when PAW was initally released [2]. The field has since transitioned from the Fortran based PAW to the C++ based ROOT since then, but many high energy collaborations still rely on CERNLIB for their own analysis frameworks (keep in mind that many of these experiments had been in planning and development stages for over a decade before they actually turned on).
The thing about this that I find interesting is that compiling CERNLIB has become a lost art and that this fact has had far reaching consequences. The last available binaries were compiled with GCC 4.3 in 2006 and packages are only available for Scientific Linux 4 [3]. This crucial dependency has led to collaborations using extremely outdated Linux distributions and GCC versions in their computing facilities. The majority of analysis code is written in C++, but not even C++11 additions can be used because everything is frozen on GCC 4.3. Nobody can even run the analysis environment on their local machines without resorting to the use of virtual machines running SL4. It was really a nightmare to deal with.
Cernlib is available in current Debian (and also in EPEL 6). I don't know why it's not in anything else under the Fedora banner, and wonder what nasty non-standard stuff CERNLIB does. I don't remember ever having to look at more than bits of it.
I've heard an anecdote of HEP analysis code that was written by a team in C++ and wasn't ready for impending data collection on LHC. Someone apparently rescued the situation by turning up with a working Fortran system he'd written on his own. I don't know details other than which university group the report originated from; I'd be interested to know more about it.
Huh, funny thing. I had to use a heavy-ion collision
simulation program written in fortran, which had to be
compiled using a certain compiler implementation (1).
After a 2-3 weeks of debugging and trying different compilers
in vein, my supervisor had put me on a phone with a guy
that was more successful than me and told me which
fortran compiler to use.
(1) Each compiler gave different results: compilation errors,
code that goes in an infinite loop.
Getting locked to a particular compiler isn't usually a quality of the language, but the development team. It happens to all languages where the underlying platform remains fixed a site, it evolves into every capability of the instance, not the spec. This is why it is important to have platform diversity from the start. Run your unit tests on at least two toolchains, hopefully on at least two different distros if not kernels. I like to use GCC/Clang and Centos/Ubuntu as my base platform matrix.
One good example of this is to run your code on both 32 bit, 64 bit, big endian and little endian machines of all those combinations. That works pretty good as a way to keep you on your toes with respect to portability.
The bigger experiments at CERN use C++14 and python quite a lot. There's still a bit of wrapped FORTRAN code kicking around, and we definitely use rather old school distributions, but I haven't seen anything as bad your anecdote.
> The benchmarks where Fortran is much slower than C/C++ involve processes where most of the time is spent reading and writing data, for which Fortran is known to be slow.
Why would IO be slow in any language? What does the language have to do besides buffering and system calls?
> In Fortran, variables are usually passed by reference, not by value. Under the hood the Fortran compiler automatically optimizes the passing so as to be most efficient.
Aren't arrays implicitly passed by reference in C also?
I believe many Fortran implementations default to unbuffered io.
Which is probably easy enough to change.
But I think that's really the core issue. Physicists don't want to learn more about programming languages. They want whatever mostly works out of the box and has local documentation and expertise specific to their problem domain.
>Aren't arrays implicitly passed by reference in C also?
He covers that. C passes arrays by reference, but the individual elements aren't contiguous. He says Fortran passes an optimized reference.
he's definitely doing it wrong there. As a scientific C programmer, I would never do that, I would malloc one contiguous array of nrows*ncolumns. And my freeing of the array would be as simple as his Fortran deallocate.
Sure, C could make multi-dimensional array handling nicer, but I have macros that basically do his A[x,y,z] for me, admittedly a bit more verbosely.
Many of his points about Fortran are true, but almost all his statements about C are false. Or nearly so, yes, copying an array of floats requires calling a function, memcpy, and not just using an equals sign, but that ain't rough and falls into his idea of "what actually happens ‘under the hood’ inside a computer". And others are easily handled by adopting things like the MKL (want to take the sin of everything in an array, see https://software.intel.com/en-us/mkl-developer-reference-c-t... )
Fair enough then, I suppose that in C you have to do something like that if you want both that your matrix size is decidable at runtime, and to be able to index it with m[x][y] rather than some function.
That’s not strictly true. You can malloc an m*n sized array and then assign to another array the pointer to the head of each column, allowing you standard style indexing.
If you do that, every access will actually dereference that pointer and won't be able to optimize the standard style indexing to a multiplication and addition, so this still carries a significant performance cost.
Fortran is much easier to use than its direct competitors if you are writing array-based number crunching. For all other used, it’s hopelessly outdated. If I were to explain its purpose to a non-fortran programmer, I’d say that Fortran is useful for number crunching kinda like regular expressions are great for certain text processing tasks. It’s basically a domain specific language and should be used accordingly. Wrap in C++ or f2py and call the routines from python/C++, where you do the «software parts»: IO, GUI, ...
That being said, I usually just use python and surf on other people’s hard work!
Oh, and the author is wrong on many of the specific details. For instance, MPI is available to many languages, including python.
> [Fortran is] basically a domain specific language and should be used accordingly.
I read this whole discussion with interest, and I think this is the most compact and insightful statement here. Thinking about it this way makes the situation very clear.
The paragraph on "legacy code" is a bit weak and half-hearted because it underemphasizes one of the most important arguments for using old code: it's been thoroughly debugged already. The most the author can summon on the topic is the fact that legacy code "takes uncertainty out of the debugging process." What? There is no debugging process, because that code has been debugged for 40 years and is damn near bulletproof at this point!
Everybody is used to cringing when they hear "legacy code," and that's justifiable for several good reasons. Note that "not wanting to learn an unfamiliar language" isn't one of them. And "not having, or not being willing to use/cultivate, the skill set of reading someone else's code" isn't one of them either.
But there is obviously a lot of bad code out there. And that's the thing, there are only two kinds of code: good code and bad code. And by extension there is bad legacy code and there is good legacy code. Don't assume legacy code is always bad code. If something has been used successfully for 40 years, do yourself a favor and try to have the humility to assume people implemented it well, found all the bugs, know what they're doing, and/or generally are rational-thinking adults who make good choices... instead of the usual naïve assumption that everybody's an idiot but I'm going to change all that! No, you're going to duplicate a lot of effort, and possibly (depending on the faithfulness of your reading of the code) reintroduce some of the same bugs that were dealt with years ago.
I don't know about mathematical domain, so this might be off-topic but in some cases a legacy code base that works perfectly can be brittle and full of holes and bugs that never manifested themselves because the code not directly interfacing the input data is guarded by subset of possible data it should support, and many of the possible code paths have not been evaluated.
But if you try to refactor the code to, say support other feature or optimise it, you might get into nastiness that is beyond comprehension, and you cannot count on that the code is coherent or correct.
This is my experience for maintaining a legacy base that has been many years in production. It's just a pile of frozen code that nobody has properly refactored probably out of fear for breaking something. This way you end up with unreadable layers and weird technology-specific hacks that were carefully made just work, probably not understanding what the existing code actually did but cargo-culting and resulting with massive amount of code that does little.
Another reasons not mentioned : 1) builtin support for complex numbers 2) Fortran compiler usually generates faster code (due lack of pointers it can do more assumptions). But recently I see more and more physics code written in C++
> Interestingly, C/C++ beats Fortran on all but two of the benchmarks, although they are fairly close on most.
I think this is fairly recent that C/C++ wins. I don’t know how recent exactly, but I remember a colloquium not too long ago by a compiler researcher who said that cross-compiling to Fortran and then optimizing almost always produced faster code than the C/C++ compiler could. Fortran is apparently easier to optimize.
For science codes, where Fortran is still used, the most expensive pieces (think DGEMM Kernel) are largely written in architecture dependent ASM anyways so programming language doesn't have that much of an effect on the most time critical pieces of large HPC programs.
If a specific function is important enough it will be hand optimized in a way that the language doesn't really matter. Sometimes that means calling MKL, or using Cuda, or writing your own assembly.
DGEMM doesn’t necessarily win over MATMUL for small matrices. Useful in e.g rotation matrices in finite element codes, and lots of other areas. Since matmul can be done with inlined loops, you also avoid the function call overhead, but it looks like a function call which is good for readability.
What I don't get is why someone doesn't make a superb template library with C++ that hides the restricts; it would make writing performant codes in C++ easier and you'd get the host of what C++ has to offer
Agreed. However, C++ pragmatically does have restrict in both clang and g++, which is still useful in cases where you can deductively prove the lack of aliasing.
> Even if old code is hard to read, poorly documented, and not the most efficient, it is often faster to use old validated code than to write new code.
Amen. A one-character mistake might take a week to find as it exhibits only subtly wrong behavior (e.g. wrong grid convergence rate, overly noisy boundary condition, odd symmetry breaking beyond IEEE floating point). During that week no science happens.
Yup. Think of C as a rusty straight razor and Fortran as a barn full of rusty implements about ready to fall at any time. C++ maybe a rusty safety razor.
Originally, Fortran had manual memory management, as per the times. Thankfully, the language progressed.
Overall, the evolution of languages from assembly/raw instructional to procedural ones needed early languages like Fortran on which other higher-level languages, tools and OSes could be later built/bootstrapped.
Our physics group has a core library, first written in 1987, that is in Fortran (a Microsoft dialect, to be specific).
Why haven't we moved to something else? It works, it is time-tested, and the original author continues to maintain it.
(P.S. I'd like to compile it with the gfortran tools, in order to preserve the library for the future. Is there any documentation for simple conversions from Microsoft's implementation of the language to the more-traditional spec?)
A problem for computational science is people care about their publications more than people being capable of reproducing their work. The funny thing is an open source code is a sure way to attain a legacy.
Is MS Fortran still a supported product? I'm not sure I'd call software that depends on unsupported and unmaintained other software properly maintained.
There are some caveats AFAIK. Intel changed some default behaviors. I believe you can change them with a command line switch though. Of note is the SAVE property on all local variables being the default on MS. Now that’s an insane default if there ever was any!
Sounds like it's "legacy" or "heritage" supported, which is not a good place to be. Yeah, it still manages to work, and it's available for use because it costs Intel effectively nothing to leave an installer up on the web for download, but there's no real "support".
That's not a piece of software I'd want to tie my code to.
It's not, unfortunately. But how are you "keeping the lights on" with your old MS Fortran based code? Still using an ancient Powerstation? compiler, or Compaq Visual Fortran, or ... ?
We're still running the old compiler, probably within DOSBox or something similar. There are one or two boxes on which the executable can currently be built.
A slight gap in knowledge in the piece: Most Python libraries for numerical computation are written in C/C++ or... Fortran. Last time I had to compile scipy from scratch I had to install gfortran.
Immense amount of effort was put in performance, correctness, feature set, and numerical stability of such widely used libraries. Replacing the without a very good reason is hardly feasible.
Blas is an interface, All performant blas libraries need to written to specifically take advantage of modern hardware. Things like NetLib exists as reference implementations, they are not high performance.
The point I am trying to make is that Numpy is unlikely to still be using the exact same code that was written in the 1970's unless its performance isn't critical. It is making the same function calls but the actual code will look pretty different.
Numpy can use optimised BLAS but I believe it ships with an f2c transpiled reference BLAS so it can run (although slower) without Fortran where necessary.
Some from the 1960s. Quoting numpy/lib/function_base.py:
We use the algorithm published by Clenshaw [1]_ and referenced by
Abramowitz and Stegun [2]_, for which the function domain is ...
.. [1] C. W. Clenshaw, "Chebyshev series for mathematical functions", in
*National Physical Laboratory Mathematical Tables*, vol. 5, London:
Her Majesty's Stationery Office, 1962.
.. [2] M. Abramowitz and I. A. Stegun, *Handbook of Mathematical
Functions*, 10th printing, New York: Dover, 1964, pp. 379.
http://www.math.sfu.ca/~cbm/aands/page_379.htm
I needed a .net math library. My IT department insisted on my trying a wrapper around a Fortran library (I believe purely because they had a license already). All the classes and variables names are short hexadecimals strings. None of the interface is idiomatic to .net (the library only has void returning methods which pass error codes as byref arguments instead of using exceptions, requesting a parameter for the length of any array passed as argument instead of reading it from the array), etc...
Basically I refused to touch this thing, using a library which makes the code unreadable is going to be a bug magnet. I would be surprised if I was the only one having this reaction to 1960s coding convention surfacing in modern code.
So selling new licenses should be a good enough reason.
Your .NET wrapper doesn't need to expose the same name or calling convention. That's why you use a wrapper: to abstract away distasteful non-idiomatic decisions in the wrapped API.
Sounds like your IT department botched the wrapper. Wouldn't a more successful candidate likely just have used an abstraction layer between the .net conventions you were expecting and whatever the native code was doing?
I spent several summers working at Los Alamos National Lab in their HPC group on I/O, among other things. Supporting legacy codes was a requirement, even if it meant that we couldn't explore obvious and important optimizations. I remember an anecdote from the division lead, that the reason a lot of legacy code is never replaced was because of certification. Codes are designed to simulate some critical system, and it takes years to trust the result. So any change that forces a multi-year validation process was a non-starter.
My understanding is that the code modernization and co-design efforts that are a part of the Exascale initiative are changing this.
That was my initial reaction as well. However, I would guess that C++ is mostly used as a "C with classes" in the domain the author is talking about, so "C/C++" wouldn't be so incorrect.
Maybe that's more just because some people just need C and some people need C++, but both are very standard, go to compiled languages, and are often learned in tandem.
I don't think the choice to continue using Fortran is as well thought out as the author. I work in computational physics and have written a lot of Fortran, C++ and Python.
The predominant issue is that physics codes are typically written by PhD candidates, often with little to no programming experience. The projects can span 3-5 years and continue existing in the physics ecosystem for decades. Good programming practices are seldom employed, and the codes become these massive patchwork ships, leaking everywhere with holes plugged by spaghetti code.
The issue is not that the students aren't smart enough to learn good programming practices, it's that the advisers are not patient enough to wait for documentation, unit tests and so forth. They view good practices as wasted time, after all it's "physics not computer science".
Unfortunately, the path of least resistance is actually to adopt unit tests, write code documentation and generally employ industry programming practices, it's just that this path has a barrier to entry and the benefits are not immediately apparent to older academics who have fallen out of the loop.
The end result is quite sad; students with advanced programming skills are chronically under appreciated in the field. Professors will happily bring them on board as post docs to develop their simulations, but they balk at giving them jobs based on their computational skill set. It is thus no surprise why the computational talent leaves physics and takes careers in the private sector where the union of math skills and programming is in high demand.
A lot of the arguments in this post can simply be rebutted with basic abstractions. Things like "Dynamically allocating and deallocating ... 2D array" is easy in C++. You could easily have someone define a MathArray<Type, Rows, Columns> class and turn this messy fortran code:
auto *my_matrix = new MathArray<double, 10, 10>();
The code the author showed demonstrates a lack of understanding of C and C++. Even if you restrict it to C your matrix code should look something like this
For me "real, dimension(:,:), allocatable :: " is much more complicated than "matrix_make"
Many of the issues people see in the speed difference between Fortran and C code will likely be based on their misunderstanding of how Fortran actually does their data layout and a misunderstanding of how the computer hardware (and what you're describing to C) to do. This "Double array" that was defined would never be allowed in production code. The amount you'd be hitting the OS for even small allocations is crazy.
The arguments for Fortran, as far as I'm concerned, are:
1. We already know it
2. We're not going to get grant money to rewrite a library
3. We've built a bunch of computer clusters and have to justify what we spent (rather than buying 2 GPUs for your workstation)
4. We've all spent a lot of time learning how to use MPI that we're never getting back.
You're making a pretty big deal about trivial syntactic issues.
> Into something that looks like
>
> auto *my_matrix = new MathArray<double, 10, 10>();
The dimensions here are a template parameter. They must be known at compile time. Also, if you go this route and you want to write a function that, say, adds two matrices, you get one copy of that function in your binary for every matrix size that occurs in your program. You also can't naturally interoperate with someone else's matrix code unless that someone else specifically wrote against your MathArray template class.
> And if you'd really like you can hide the sizeof via a macro...
>
> #define MATRIX_MAKE(r, c, type) matrix_make(r, c, sizeof(type))
>
> For me "real, dimension(:,:), allocatable :: " is much more complicated than "matrix_make"
It looks uglier, but it's language syntax rather than custom code. Whoever is reading your code doesn't have to unpack a macro and then look into a function to figure out what you're doing. (And as a bonus, the Fortran user can index the matrix without mentioning that it's a 'double' matrix when he's indexing.)
> Many of the issues people see in the speed difference between Fortran and C code will likely be based on their misunderstanding of how Fortran actually does their data layout and a misunderstanding of how the computer hardware (and what you're describing to C) to do. This "Double array" that was defined would never be allowed in production code. The amount you'd be hitting the OS for even small allocations is crazy.
Which many issues are you thinking about? Idiomatic Fortran has an inherent advantage over idiomatic C in that better aliasing information is available to the compiler.
If you want to do make an implementation that supports runtime-specified matrix sizes you change your implementation from...
auto *my_matrix = new MathArray<double, 10, 10>();
Into
auto *my_matrix = new MathArray<double>(10, 10);
Most programs don't need dynamically sized arrays (rows and columns) and as such it makes sense to also provide a template Row and Column width. By doing this you can likely implement matrix multiplication and addition as a constexpr (with some effort) and thus get....
1. Compile time matrix evaluation
2. Vectorized multiplication/addition of matrices
3. Pipline-efficient code
I think your 3. and 4. are phrased a bit dismissively, though I don't entirely disagree. Not every workload in scientific computing maps cleanly into the GPGPU paradigm. Though that's not really a compelling reason to use Fortran over C++
An additional reason I've heard (I don't personally use Fortran) for keeping Fortran around is that it's straightforward to convert Matlab prototype code into Fortran
Physicists / engineers / mathematicians are the target audience for Fortran. For heavy number crunching it's still quite good, it's people trying to use it for other stuff that causes problems.
That said, Fortran really is dying. Scientific code is much larger nowadays with more functionality and scientists want to do everything in one language. C++ and Python are taking over.
Vast majority of scientists is not able to write idiomatic Fortran, yet alone idiomatic C++. Scientific C++ code that didn't have an oversight from a professional C++ developer will be always horrible. Scientific Fortran code written without such oversight can sometimes be bearable. This is perhaps the main advantage of Fortran.
Eh, I'm talking mostly about the large scientific code packages that are being developed with millions of dollars in funding and large, organized teams. The people writing these sorts of codes know what they are doing and a lot of migration to C++ is because they are more familiar with it and it's easier to hire skilled people.
1) Some stuff is already written in Fortran so they don't want to rewrite that. I dig it.
2) It's fast (except C sometimes) but easier to write than c. Like 100x faster than python.
I'm not sure about number two. With the gpu processing revolution wouldn't a python/TensorFlow stack be faster than Fortran? Am I missing something?
I remember talking to someone who had worked heavily on atmospheric weather predictors recently and her description of the program was: we divide the space into tiny little cubes and then run some differential physics equations to predict what will happen next. My basic questions to her:
1) From a computational perspective this seems very GPU friendly.
2) Why not use a convolutional neural network? If you use the same data for training you will probably wind up with a more accurate prediction than a theoretically based physics system.
Her reaction was basically that she hadn't heard of these things before so my impression is in fact that the physics community is just behind and they will catch up when they are ready.
TensorFlow is for machine learning, not general purpose computations. And no, you will not get better results from a neural network than state of the art computational fluid dynamics.
As for general purpose GPU programming, some parts physics are GPU friendly put not all of them.
IIRC there was some project doing more or less real-time weather forecasts for small areas (think airports and such) that used deep learning instead of traditional CFD style simulations. They were able to do it with several orders of magnitude less CPU usage than the CFD calculations.
But was it more accurate? Was it for the same purpose? Getting within a few percent of a CFD model might be good enough for industrial use. For studying systems (from the perspective of scientific investigation) it's not even close.
Dearest me. I hope you're not under the impression that the performant part of your ML code is written in python. Python is the interface to a C++ libary in tensorflow looking at its repo on github. I don't use tensorflow, but if it's anything like numpy and scipy, they are interfaces to C libraries which are extremely performant. If the question is why can't a CFD be in python...and the answer is well, I don't know if anyone has tried it due to your point 1).
I don't think you understand Tensorflow. Tensorflow compiles tensor mathematics into a shader that can be run on the gpu. This is usually faster than acpu computation.
A lot of science is done with GPUs ... in FORTRAN and C++. I believe an n body simulation in GPUs is what started the idea of using GPU for general computation.
I would be extremely surprised if you could get similar or better predictions for the same amount of computation from an off the shelf CNN than from a carefully tuned physical model. (I would love to see counterexamples).
Even then I would question how well it extrapolates; you generally have a very good idea of where your physics simulation will break down.
I think your friend is right; the physics community can't afford to throw away as much code as the tech community. I wouldn't spend years building a simulation in tensorflow because it will likely be obsolete before I finished (leading to the same problems as FORTRAN but increasing the amount of dependencies).
In any case, if it works well enough it's not going to be rewritten.
Eventually ML will become part of physics simulations, but I don't think there will be easy wins in these well developed areas.
What do you mean by "discrete problems"? By my understanding, weather prediction is basically solving a large system of partial differential equations. Sure the grid methods are a discrete approximation of the true problem, but it is not what I would normally think of as a "discrete problem".
Training a neural network to map "current state of atmosphere" to "state of atmosphere in the future" is definitely possible to do with a neural network, and sounds like a good idea to me.
While I can imagine that neural networks are useful in predicting rainfalls or precipitations, I don’t see how they are suitable for some tasks, such as the forecasting of typhoon paths. A neural network is basically a function interpolation device. To use it for prediction, the implicit assumption is that the function it approximates is somehow well behaved, but some weather systems are chaotic. Function interpolation doesn’t seem very useful in this regard.
From what I see from googling, cyclone tracks forecasting by neural network is definitely an active area of research, but apparently they aren’t practical yet. The lady you talked to were probably working on some chaotic systems like this.
Why not create a DSL in C++ that gives you the same syntax as Fortran for doing array/matrix manipulations? That's really the one main advantage I gleaned from this article
Things like this exist but there are two issues. They aren’t part of the standard and Fortran compilers are better at optimizing built in Fortran abstractions than c++ compilers are at optimizing user made abstractions.
I suspect it’s the lack of standardization that’s really the issue.
I wish there was a clean nomenclature for the context of program use. My world of programming is academia and industrial research. In that setting the vast majority of software is fashioned for ad-hoc use. There is no expectation that a large population will ever make use of what we code except to adapt it to some new specific related need. Scientific publishing promotes novel investigations, novel investigations promote one-off programs. And the user interface doesn't have to be, and is very seldom, "elegant" in any sense. It's in that context of "rapid cycle until it works then seldom use it again" that Fortran (or in my hands, python scripts which call Fortran numerical libraries) still makes sense. Understanding the context will free us from the head-scratching.
On array convenience: Fortran also supports optional runtime bounds checking of array indexes. I've worked writing signal processing code in Fortran, and it's really quite nice to have that when doing dev, even if you turn it off in prod for performance reasons.
GP is saying the contrapositive for Fortran: you can have runtime bounds checks during development instead of disabling them for production builds.
I'm not sure which feature you're referring to, because while GCC has -fbounds-check, that's for the GNU Compiler Collection and only for frontends that support it (to wit, Fortran and Java). I don't know of any runtime bounds checking that ever made it into vanilla. Clang and GCC both have some limited array bounds checking, but it's static, and there are plenty of issues it won't catch. People maintained third party patches for a long time, but these are obsolete now. Perhaps you're thinking of ASAN/-fstack-protector-*?
I think this just shows how scientists don't understand programming. Pretty much every 'advantage' listed for Fortran over C++ could be added to C++ in an afternoon. C++ is an extensible language, you can define your own types and operators, so all the examples can be easily implemented. For some of the examples it's just sad that they were brought up. For example, 'you have to write a loop to allocate an array' should say 'you have to write a function one time to allocate arrays and never write this loop again'.
I don't think that's the case, seems a little too extreme to think that. I believe the central point of someone not directly enrolled with programming as an actual job (or as a hobby taken seriously) picking Fortran over C/C++ has to do with the features it has out-of-the-box, or at least that are more apparent from the surface of language knowledge. Even if it takes one afternoon to implement a certain feature, would scientists WANT or NEED to do that?
Don't get me wrong, this is probably an opinion I would defend if it were exclusively related to programmers being "too lazy to learn a new language", but this is a different learning purpose.
I can't agree, comparing two languages purely by a superficial "this one lets me write code like this out of the box" ignores everything that matters.
Lets say I wanted to write a web blog server. One language has "import blog; blog.run()" and I am up and running instantly. Another language makes me install a blog library and some other side stuff, and choose a webserver. The point is it isn't just built right in. Which language is better for writing web blog software? The answer is, you have literally no idea from what I just told you. My analysis is insanely superficial and meaningless. Presumably, if I am going to spend hundreds or thousands of hours in some coding environment, what is 'built in with no effort' matters somewhat on hour 0, but virtually not at all by hour 200. Professional scientists presumably spend thousands of hours on this stuff, it's really not too much to ask that they become somewhat competent with the tools they are using.
language built-ins are not fully replaceable with extensions and libraries. Most notably: (1) Compilers have more ways to optimize if given higher level abstractions, (2) built-ins are easier to use for the user because of specific syntax (e.g. highlighting), (3) built-ins have support in debuggers (e.g. you can easily print an array slice in a Fortran debugger).
just a case in point: what do you think is easier to use, standards conformant and is more typesafe:
a) a typedef that gives you "__restrict const * const double"
Actually for C++ you are almost perfectly wrong. It's specifically a design goal of the language to not have the builtins be 'superior' to what you can implement yourself (well the builtins are built by world class experts, but the point is there is no bias in the language).
> researchers at MIT have decided to tackle this challenge with full force by developing a brand new language for HPC called Julia
This and the linked news piece [1] from MIT News sound pretty weird to me. The OP article probably takes the bit about "researchers at MIT" developing Julia from the MIT News page, so that's the real source of the issue - the MIT News pages seems to have been written with weird biases, making it sound like a primarily MIT project that other people have just tacked a few things on to. And then there's:
> A few years ago, when an HPC startup Edelman was involved in [...] was acquired by Microsoft, he launched a new project with three others.
That to me sounds like an implication that Edelman was the one to initiate the project and take in the others. They seem to be writing from the usual academic bias of "the senior faculty gets the credit even if the actual work is done by the PhD/graduate students". Edelman was Bezanson's thesis advisor and a crucial part of Julia's history, but this article seems to be downplaying the role of the other core contributors and the open source community.
I had assumed university news, at least in such technical topics, would be more reliable and less inherently biased, learned something new today.
In the late 90’s, I helped port a nuclear reactor simulator to Win32. It was around 20 million lines of Fortran and was actively developed by physicists and engineers (none were really software engineers). And, at that time the codebase was around 40 years old. Apart from disabling virtual memory, it worked on winDOwS nearly flawlessly on an COTS PC and ran about 50% faster than the fastest *nix test lab box.
It’s done mostly for historical tradition reasons, and it costs nontrivial time and money to switch.
The same reason many big systems still use COBOL: it works, it's well tested, why change it? Usually they just run it as long as they have hardware they can run it on...
I've always wondered about the following approach for old apex predators like COBOL and FORTRAN, it follows what .NET does.
The idea is to take current COBOL and FORTRAN code and compile it down to IL similar to .NET's CIL, in the .NET world once code is brought down to CIL it can be read back in VB.NET, C#, C++.NET
Essentially bring it down to some sort of lossless IL to convert to another language. It should be possible to do this given we have the source code. In certain cases where source code doesn't match the binary (happens over years due to monkey patching the binaries etc...) then we'll have to take an approach a few folks at IBM are talking with recompilation and reoptimization of existing old binaries for COBOL. [1]
Don't throw away that debugged and battle-hardened code, change the IL and the final compile target, if possible re-interpret the IL into a newer language if its not lossy.
"Compile COBOL applications directly to Microsoft intermediate language for deployment within Microsoft .NET... Compile COBOL applications to Java byte code for deployment within the Java Virtual Machine"
I've actually seen this idea suggested, at least as part of a bake-off between design ideas. There's a certain amount of sense to it.
I remember when I was a junior engineer working on aerospace simulation software (defense contracting). I was primarily writing C++, but we had a large collection of physics algorithms written in Ye Olde FORTRAN that we had to link in.
I brought up a possible rewrite, and one of the greybeards told me that they had looked at that in the past, but the govt V&V process on any rewritten algorithms would have been so onerous that they eventually dropped the idea.
The last guy who actually understood that old code retired about a year after I started there and then things started to get... interesting.
As an aside, for 'new' code, I'm actually ok with the science folks dumping a pile of matlab script on me so I can rewrite it in Python, Java, etc. (rather than letting scientists write production code, which I'll have to rewrite later anyway).
Often there are many (more or less reasonable) extrinsic reasons for non-tech domains staying with the stacks they use.
I've done some programming for cognitive psychology experiments, fMRI analysis etc, and although I didn't like the often proprietary systems used (E-Prime, Presentation etc), I could see it would have required hefty investments of very scarce time to switch to something 'better'. The vast bulk of the software was written by non-programmer grad students, for whom the tech was a very 3rd order issue: they just needed their experiments up and running. This was generally done by finding a close-enough prior experiment, and tweaking it in a hurry, often with limited understanding of how the system worked. There was in most cases no possibility of paying programmers to do the work.
Modern fortran is quite awesome for matrix manipulation. The ease of using it for math makes me wonder why we don't use it more often. The main downside to modern fortran is its I/O capabilities are quite stunted. I honestly think we need an open source fortran95 to gpu compiler.
If you are using the GNU compilers (and the appropriate compiler flags with gcc), there isn't much difference in performance --- and there shouldn't be.
There are a few FORTRAN-only compilers out there; I'd be curious to see how well they do.
What would you use for a HPC application?
You need a very fast language that doesn't get in your way in data management.
Trust me even in the basic examples we did in our parallel algorithms course minor changes in data layout could save hours of computational time and that was in C, where no crappy garbage collection gets in your way. Not to even mention the masses of good low level performance analysation tools and parallel libraries made for Fortran and C/C++, but not other languages.
Why would you use anything else?
I think some people misunderstand what makes languages good. It is not generalisable, it depends on the case. Sure to write a script, e.g. to quickly automate a few things, shell or common scripting languages like ruby or python may make sense, because it is relatively easy to get going and write something in them. But that is not an important question for an HPC application. You need to write code that will definitively give the correct result and that will run extremely fast on the cluster of machines that make up the supercomputer. You need the language means to define very detailed how your memory is to be layed out, etc.. The very thing that makes a language annoying to use in a scripting context is a feature here. On the other hand you don't care about the ease of portability, in fact you will want to optimise it for one specific architecture as much as time allows. That the program will have to be recompiled is a minor issue in comparison to memory layout, threading schedules or network communication changes to the algorithm to optimise it for a new system.
No language is truely superior to all others, the question is always context and the conditions and constrictions it puts on the developer.
For physicists Fortran or C are the best choices. Even Go uses a garbage collector which brakes it for large HPC scenarios. VM based languages are completely useless. Their low speed is already a nuisance for simple common tasks, never mind problems that already take days or weeks to execute when they are properly optimised. If you think Java, Ruby or any such language could be used, look at benchmarks. You will find CPU time of 1.5-2.5x and memory at least 5-7x the amount needed by the same problem executed by a program written in C.
Fortran and Cobol both orginated in the late 1950s and are still in widespread use in certain domains. It's extraordinarily cheap in these domains to keep using legacy Fortran and Cobol software. These folks are passionate about their domains, but they probably couldn't care less about so-called modern software languages, new development tools, and new hardware. Those of us who thrive on change find this very hard to accept. We need to get over it.
I think one problem with Fortran is that it has no standard library. To a Fortran beginner, it isn't obvious where to look for high quality third-party libraries.
FORTRAN is still great for fast numerical applications and easier for me to read than C++.
My industry is rewriting its FORTRAN base in C++ and a lot of physicists are switching to Python + Numpy for all but the most intensive tasks. I see FORTRAN being only used in legacy systems within the next 5-10 years.
FORTRAN is still in much of the R Core. Fortran is faster than C++ and I actually feel that many people are starting to realize why we still have FORTRAN around.
Though there are many using Python + Numpy R still have a larger piece of user base in scientific and mathematical spaces.
When people say that FORTRAN is faster than C++ it is obviously for numerical code, since general systems code is very hard to do in FORTRAN. Many people will complain that C++ can be faster than FORTRAN, however the problem is that you need to know a lot of C++ or have access to the right libraries to write competitive numeric code. In FORTRAN you can do that out of the box.
Moreover, the article mentions this point, references the benchmarks game, and says "However, the two benchmarks where Fortran wins (n-body simulation and calculation of spectra) are the most physics-y".
"However, the two benchmarks where Fortran wins (n-body simulation and calculation of spectra) are the most physics-y. The results vary somewhat depending on whether one compares a single core or quad core machine with Fortran lagging a bit more behind C++ on the quad core."
Though I don't see how the single core version is measured. As far as I can tell, the spectra calculations are always using 4 CPUs while the n-body is always using 1 CPU.
I'd be interested in some stats around this, My father was a physicist, and I did some programming work for him at a large research institute, and while there was fortran around, most of the scientists were reaching for new tools where they could.
I dont have much to add to this conversation but it surprises me that Rust hasn't come up in the comments. Would/could rust be appropriate for these kinds of use cases today/in the future?
In theory, yes. In practice, we need to get SIMD working on stable, which is underway but not done yet. We also need to get RFC 2000, const integers, implemented. And I'm sure other things too.
Basically, Rust could be excellent at this in the future, but right now, is merely okay.
I wonder if Haskell wouldn't be a better fit. I realize there's no rewriting all that legacy code in any language though -- but I'm thinking of new code here.
glmnet, one of the core packages for machine learning is written in Fortran. So are many of the linear algebra packages commonly used. It's not just physicists that still rely on this stuff.
I think that's more of once a language has a hold on a particular industry (Ex: TCL in semiconductors), it takes awhile for something else to replace it due to all the in-house code written in it. If the vendor uses only a TCL API, then double the time to switch. Assuming TCL isn't doing a great job serving their needs.
This is so true. I'm a PhD student in physics using Fortran for pretty much that reason. At the start of my PhD, in response to my supervisor telling me I should learn Fortran to modify our current codebase, I asked if I could rewrite what I'd be working on into C++ first, since I was already familiar with it and wanted to bring future development into a more "modern" language.
His response was "You could do that and it would probably be enough to earn your PhD, since it'll take you at least three years. But I suspect you'll want to work on something else during that time".
He was right. I later learnt one of our "rival groups" attempted the same thing and it took three phd students working full time for a year to rewrite their code from fortran to C++.