Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pandas isn't the problem, the problem is assuming that "$LIBRARY releases the GIL so things will be fast!". It's a pennywise, pound-foolish approach to performance. Someone will write a function assuming the user is only going to pass in a list of ints and someone else will extend that function to take a list of Tuple[str, int] or something and all of a sudden your program has a difficult to debug performance regression.

In general, the "just rewrite the slow parts in C!" motto is terrible advice because it's unlikely that it will actually make your code appreciably slower, and if it does, it's very likely to be defeated unexpectedly as soon as requirements change. Using FFI to make things faster can work, but only if you've really considered your problem and you're quite sure you can safely predict relevant changes to requirements.



> Someone will write a function assuming the user is only going to pass in a list of ints and someone else will extend that function to take a list of Tuple[str, int]

There are plenty of pitfalls in leaky abstractions. Establishing that fast numeric calculations only work with specific numeric types seems to help.

One thing you seem to be encountering, that I've seen a few times, is that people don't realize NumPy and core Python are almost orthoganal. The best practices for each are nearly opposite. I try to make it clear when I'm switching from one to the other by explaining the performance optimization (broadly) in comments.

Regardless, any function that receives a ``list`` of ints will need to convert to an ndarray if it wants NumPy speed. If the function interface is modified later, I think it's fair to expect the editor to understand why.


> There are plenty of pitfalls in leaky abstractions

Sure, but this is a _massive_ pitfall. It's an optimization that can trivially make your code slower than the naive Python implementation, all due to a leaky abstraction.

> Regardless, any function that receives a ``list`` of ints will need to convert to an ndarray if it wants NumPy speed. If the function interface is modified later, I think it's fair to expect the editor to understand why.

Yeah, that was a toy example. In practice, the scenario was similar except the function called to a third party library that used Numpy under the hood. We introduced a ton of complexity to use this third party library on the grounds that "it will make things fast" instead of the naive list implementation, and the very next sprint we needed to update it such that it became 10X slower than the naive Python implementation.

That's the starkest example, but there have been others and there would have been many more if we didn't have the stark example to point to.

The current slogan is "just use Python; you can always make things fast with Numpy/native code!", but it should be "use Python if you have a deep understanding of how Numpy (or whatever native library you're using) makes things fast such that you can count on that invariant to hold even as your requirements change" or some such.


I have mixed feelings about your conclusion. On one hand I don't want to discourage newbies from using Python. On the other, I enjoy that my expertise is valuable.

It seems reasonable that different parts of the code are appropriate for modification by engineers of differing skills.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: