Fundamentally, the "right place" here differs between Windows and Linux. On Wind...

pdonis · on April 21, 2020

> On Windows, command line arguments really are unicode (UTF-16 actually)

No, they're not. Windows can't magically send your program Unicode. It sends your program strings of bytes, which your program interprets as Unicode with the UTF-16 encoding. The actual raw data your program is being sent by Windows is still strings of bytes.

> you can still work with bytes if you absolutely need to

In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes, or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

> AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2)

That's because none was needed. You got strings of bytes and you could decode them to whatever you wanted, if you knew the encoding and wanted to work with them as Unicode. That's exactly what a language/library should do when it can't rely on a particular encoding or on detecting the encoding--work with the lowest common denominator, which is strings of bytes.

takeda · on April 21, 2020

> In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes,

Actually you can, you should use sys.std{in,out,err}.buffer, which will be binary[1]

> or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

I'm assuming you're talking about scenario where LANG/LC_* was not defined, then Python assumed us-ascii encoding. I think in 3.7 they changed default to UTF-8.

[1] https://docs.python.org/3/library/sys.html#sys.stdin

pdonis · on April 22, 2020

> Actually you can, you should use sys.std{in,out,err}.buffer,

That's fine for your own code, as I said. It doesn't help at all for code in standard library modules that uses the standard streams, which is what I was referring to.

> I think in 3.7 they changed default to UTF-8

Yes, they did, which is certainly a saner default in today's world than ASCII, but it still doesn't cover all use cases. It would have been better to not have a default at all and make application programs explicitly do encoding/decoding wherever it made the most sense for the application.

takeda · on April 22, 2020

> That's fine for your own code, as I said. It doesn't help at all for code in standard library modules that uses the standard streams, which is what I was referring to.

I'm not aware what code you're talking about. All functions I can think of expect to provide streams explicitly.

> Yes, they did, which is certainly a saner default in today's world than ASCII, but it still doesn't cover all use cases. It would have been better to not have a default at all and make application programs explicitly do encoding/decoding wherever it made the most sense for the application.

I disagree, it would be far more confusing when stdin/stdout/stderr were sometimes text sometimes binary. If you meant that they should always be binary that's also unoptimal. In most use cases an user works with text.

pdonis · on April 22, 2020

> I'm not aware what code you're talking about.

All the places in the standard library that explicitly write output or error messages to sys.stdout or sys.stderr. (There are far fewer places that explicitly take input from sys.stdin, so there's that, I suppose.)

> it would be far more confusing when stdin/stdout/stderr were sometimes text sometimes binary

I am not suggesting that. They should always be binary, i.e., streams of bytes. That's the lowest common denominator for all uses cases, so that's what a language runtime and a library should be doing.

> If you meant that they should always be binary that's also unoptimal. In most use cases an user works with text.

Users who work with text can easily wrap binary streams in a TextIOWrapper (or an appropriate alternative) if the basic streams are always binary.

Users who work with binary but can't control library code that insists on treating things as text are SOL if the basic streams are text, with buffer attributes that let user code use the binary version but only in code the user explicitly controls.

takeda · on April 21, 2020

Linux had issues whenever LANG/LC_* variables weren't defined, python assumed us-ascii, I believe that was changed recently to just assume utf-8.