Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fundamentally, the "right place" here differs between Windows and Linux. On Windows, command line arguments really are unicode (UTF-16 actually). On Linux, they're just bytes. In Python 2, on Linux you got the bytes as-is; but on Windows you got the command line arguments converted to the system codepage. Note that the Windows system codepage generally isn't a Unicode encoding, so there was unavoidable data loss even before the first line of your code started running (AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2). However, on Linux, Python 2 was just fine.

Now with Python 3 it's the other way around -- Windows is fine but Linux has issues. However, the problems for linux are less severe: often you can get away with assuming that everything is UTF-8. And you can still work with bytes if you absolutely need to.



> On Windows, command line arguments really are unicode (UTF-16 actually)

No, they're not. Windows can't magically send your program Unicode. It sends your program strings of bytes, which your program interprets as Unicode with the UTF-16 encoding. The actual raw data your program is being sent by Windows is still strings of bytes.

> you can still work with bytes if you absolutely need to

In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes, or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

> AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2)

That's because none was needed. You got strings of bytes and you could decode them to whatever you wanted, if you knew the encoding and wanted to work with them as Unicode. That's exactly what a language/library should do when it can't rely on a particular encoding or on detecting the encoding--work with the lowest common denominator, which is strings of bytes.


> In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes,

Actually you can, you should use sys.std{in,out,err}.buffer, which will be binary[1]

> or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

I'm assuming you're talking about scenario where LANG/LC_* was not defined, then Python assumed us-ascii encoding. I think in 3.7 they changed default to UTF-8.

[1] https://docs.python.org/3/library/sys.html#sys.stdin


> Actually you can, you should use sys.std{in,out,err}.buffer,

That's fine for your own code, as I said. It doesn't help at all for code in standard library modules that uses the standard streams, which is what I was referring to.

> I think in 3.7 they changed default to UTF-8

Yes, they did, which is certainly a saner default in today's world than ASCII, but it still doesn't cover all use cases. It would have been better to not have a default at all and make application programs explicitly do encoding/decoding wherever it made the most sense for the application.


> That's fine for your own code, as I said. It doesn't help at all for code in standard library modules that uses the standard streams, which is what I was referring to.

I'm not aware what code you're talking about. All functions I can think of expect to provide streams explicitly.

> Yes, they did, which is certainly a saner default in today's world than ASCII, but it still doesn't cover all use cases. It would have been better to not have a default at all and make application programs explicitly do encoding/decoding wherever it made the most sense for the application.

I disagree, it would be far more confusing when stdin/stdout/stderr were sometimes text sometimes binary. If you meant that they should always be binary that's also unoptimal. In most use cases an user works with text.


> I'm not aware what code you're talking about.

All the places in the standard library that explicitly write output or error messages to sys.stdout or sys.stderr. (There are far fewer places that explicitly take input from sys.stdin, so there's that, I suppose.)

> it would be far more confusing when stdin/stdout/stderr were sometimes text sometimes binary

I am not suggesting that. They should always be binary, i.e., streams of bytes. That's the lowest common denominator for all uses cases, so that's what a language runtime and a library should be doing.

> If you meant that they should always be binary that's also unoptimal. In most use cases an user works with text.

Users who work with text can easily wrap binary streams in a TextIOWrapper (or an appropriate alternative) if the basic streams are always binary.

Users who work with binary but can't control library code that insists on treating things as text are SOL if the basic streams are text, with buffer attributes that let user code use the binary version but only in code the user explicitly controls.


Linux had issues whenever LANG/LC_* variables weren't defined, python assumed us-ascii, I believe that was changed recently to just assume utf-8.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: