More

ribit · 2025-09-20T05:26:49 1758346009

Yep. My wife just started as a professor (humanities) and she entered on H1B visa last week, as green card takes years to obtain. I have been offered a teaching job at the same institution as a partner hire and they have filed an H1B petition for me.

Unless they clarify that education is exempt from these rules, my wife will surely have to quit her new job. She is supposed to go on fieldwork later this year and she won’t be able to re-enter. Not to mention I can kiss my lecturer offer good bye. This is an incredibly retarded situation.

ribit · 2025-02-11T14:45:33 1739285133

Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.

pandaman · 2025-02-12T00:34:29 1739320469

No, branching works on GPU just like everywhere else - the instruction pointer gets changed to another value. But you cannot branch on a vector value unless every element of the vector is the same, this is why branching on vector values is a bad idea. However, if your vectorized computation is naturally divergent then there is no way around it, conditional moves are not going to help as they also will evaluate both branches in a conditional. The best you can do is to arrange it in such a way that you only add computation instead of alternating it, i.e. you do if() ... instead of if() ... else ... then you only take as long as the longest path.

This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.

ribit · on Nov 22, 2024

Quick note: I looked at the bindless proposal linked from the blog post and their description of Metal is quite outdated. MTLArgumentEncoder has been deprecated for a while now, the layout is a transparent C struct that you populate at will with GPU addresses. There are still descriptors for textures and samplers, but these are hidden from the user (the API will maintain internal tables). It's a very convenient model and probably the simplest and most flexible of all current APIs. I'd love to see something similar for WebGPU.

ribit · on Oct 11, 2024

ribit · on Oct 11, 2024

M3 GPU uses a new instruction encoding, among other things. Also, it has a new memory partitioning scheme (aka. Dynamic Caching), which probably requires a bunch of changes to both the driver interface and the shader compiler. I hope the Asahi team will get to publishing the details of M3 soon, I have been curious about this for a while.

ribit · on Oct 11, 2024

Are you talking about Vulkan or about geometry shaders? The later is simple: because geometry shaders are a badly designed feature that sucks on modern GPUs. Apple has designed Metal to only support things that are actually fast. Their solution for geometry generation is mesh shaders, which is a modern and scalable feature that actually works.

If you are talking about Vulkan, that is much more complicated. My guess is that they want to maintain their independence as hardware and software innovator. Hard to do that if you are locked into a design by committee API. Apple has had some bad experience with these things in the past (e.g. they donated OpenCL to Kronos only to see it sabotaged by Nvidia). Also, Apple wanted a lean and easy to learn GPU API for their platform, and Vulkan is neither.

While their stance can be annoying to both developers and users, I think it can be understood at some level. My feelings about Vulkan are mixed at best. I don't think it is a very good API, and I think it makes too many unnessesary compromises. Compare for example the VK_EXT_descriptor_buffer and Apple's argument buffers. Vulkan's approach is extremely convoluted — you are required to query descriptor sizes at runtime and perform manual offset computation. Apple's implementation is just 64-bit handles/pointers and memcpy, extremely lean and immediately understandable to anyone with basic C experience. I understand that Vulkan needs to support different types of hardware where these details can differ. However, I do not understand why they have to penalize developer experience in order to support some crazy hardware with 256-byte data descriptors.

MBCook · on Oct 11, 2024

I’m not a game programmer, so I just sort of watch all this with a slightly interested eye.

I honestly wonder how much the rallying around Vulkan is just that it is a) newer than OpenGL and b) not DirectX.

I understand it’s good to have a graphics API that isn’t owned by one company and is cross platform. But I get the impression that that’s kind of Vulkan‘s main strong suit. That technically there’s a lot of stuff people aren’t thrilled with, but it has points A and B above so that makes it their preference.

(This is only in regard to how it’s talked about, I’m not suggesting people stop using it or switch off it to thing)

shmerl · on Oct 11, 2024

Nothing stops them from providing their own API and Vulkan both. So your arguments only make sense for why they might want other API but they don't make sense on the part reasons for completely denying Vulkan support alongside it. There is no good reason for that and the apparent reason is lock-in.

ribit · on Oct 11, 2024

Apple not supporting Vulkan is a business decision. They wanted a lean and easy to learn API that they can quickly iterate upon, and they want you to optimize for their hardware. Vulkan does not cater to either of these goals.

Interestingly, Apple was on the list of the initial Vulkan backers — but they pulled out at some point before the first version was released. I suppose they saw the API moving in the direction they were not interested in. So far, their strategy has been a mixed bag. They failed to attract substantial developer interest, at the same time they delivered what I consider to be the best general-purpose GPU API around.

Regarding programmable tessellation, Apple's approach is mesh shaders. As far as I am aware, they are the only platform that offers standard mesh shader functionality across all devices.

ribit · on July 30, 2024

Is this really a new approach? On a cursory look this seems like implicit error propagation with checked exceptions. I am Also curious about authors presentation of the topic. To me, an important feature of error handling design is whether fallible contexts are marked (e.g., with try statement) or not.

zupa-hu · on July 30, 2024

(Author) Thanks indeed checked exceptions have been brought to my attention in this thread, and in theory they are indeed very similar, except I understand changing the type of a checked exception causes refactoring hell, so they are painful to use in practice.

The other aspect is ergonomics. A major problem I have with try..catch blocks that the article doesn't touch on is that they created nested blocks, with their own scopes. This either results in nesting hell, or the variables that are assigned inside the try block must also be defined outside the try block. This makes the code simply unpleasant to write.

ribit · on July 31, 2024

Have you looked at the Swift error model? I really like their design. They use a dedicated try statement to mark call sites that can fail — note that try is not the same as try...catch — Swift has an additional block construct for catching errors. This design makes sure that you always know where errors can occur when reading the program code, but avoids all the ergonomy issues you mention.

Your model sems to be every similar to the traditional implicit model used by languages such as C++, only that you allow switchign between the implicit and explicit error propagation. I am not sure how much this is useful in practice, as it creates inconsistency.

zupa-hu · on July 31, 2024

As I understand Swift's error handling falls straight into the explicit camp. That's all good and I love it.

My problem is that it seems it's not for everyone at every stage of their journey. When I teach software development to a musician, it's just in the way. I needed a language that works for professional developers doing low-level work, but also for citizen developers doing their first round of hacking.

Originally, I thought implicit error handling is not for me. I will add support for those who hate explicit error handling and not use it myself. I'm at a point in my journey where I started questioning the benefits of explicit error checks (ie. Swift's try keyword or ? in Boomla) when all I do is propagate errors.

I'm thinking that maybe it would be interesting to start with implicit error handling, and turn on explicitness where you need it. Existing error handling approaches don't let you do that. I see the big idea in this approach that you can turn it on incrementally, with no far reaching effects. How you handle errors is a local implementation detail of the function.

Of course we will see, I might end up sticking to the explicit model, but for now I enjoy it. Luckily, migration won't be an issue, I can just return to using explicit error handling if I change my mind, and current code will keep working as is.

ribit · on July 27, 2024

While I understand the argument, it would also be good to see some empirical evidence. So far all x86 built need more power to reach the same performance level as ARM. Of course, Apple is still the outlier.

ribit · on June 12, 2024

Most NPUs are not directly end-user programmable. The vendor usually provides a custom SDK that allows you to run models created with popular frameworks on their NPUs. Apple is a good example since they have been doing it for a while. They provide a framework called CoreML and tools for converting ML models from frameworks such as PyTorch into a proprietary format that CoreML can work with.

The main reason for this lack of direct programmability is that NPUs are fast-evolving, optimized technology. Hiding the low-level interface allows the designer to change the hardware implementation without affecting end-user software. For example, some NPUs can only work with specific data formats or layer types. Early NPUs were very simple convolution engines based on DSPs; newer designs also have built-in support for common activation functions, normalization, and quantization.

Maybe one day, these things will mature enough to have a standard programming interface. I am skeptical about this becoming a reality any time soon. Some companies (like Tenstorrent) are specifically working on open architectures that will be directly programmable, I'm not sure whether their approach translates to the embedded NPUs, though. What would be nice is an open graph-based API and a model format for specifying and encoding ML models.