Hacker Newsnew | past | comments | ask | show | jobs | submit | The-Toon's commentslogin

Sorry to hijack the thread, but how would one get into managing GPU clusters? Modern GPUs are expensive, so it seems difficult to build a homelab to play around with them. Is learning how to run software on a cluster at the end-users level + playing around with VMs enough experience to enter the field?


> Modern GPUs are expensive, so it seems difficult to build a homelab to play around with them.

Simulate a system with multiple high-end GPUs by setting up a system with one low-end GPU, breaking all the video outputs, and plugging it into a timeswitch that makes it lose power 3 times a week.

Learn about industry norms by deciding it's crucial you have read access to production data, but at the same time that your users are doing ad-hoc experimentation and they can barely meet the code maturity requirements of a test environment.

Fill your storage with training data for a project, then have that project "de-prioritised" so the data isn't being used, but it also can't be deleted. Reorganise the department so it's not even clear whose data it is any more.

Broaden your experience to the entire mlops lifecycle by memorising the crucial shibboleth: "I don't understand labelbox's pricing"


This is a little too realistic. You left out the quarreling and politics between users of your system that is short on resources.


(Disclaimer, I'm working for Google) In my opinion this is a great question. I believe you can go two ways:

1) Get your hands on few physical computers and "old" gpus (like nVidia 1000 series or something like that). Put them as part of a K8S cluster and you have an amazing setup to play around with. Try to squeeze as much flops from your hardware as you can. Bonus points if you also architect around failures and test that by pulling off network cables.

2) Using some Cloud provider use preemptible/spot instances with a bunch of GPUs for few hours at the time. Not sure with other clouds, but with GCP you can create a GKE nodepool that only uses spot instances and in conjuction with cluster autoscaler makes what I described very easy and you don't really have to clean much after you're done messing around for the day. GPUs like K80, T4, or P4s are relatively cheap and, if you use them for just 10s of hours a month, you can get away with a bill of 10s of dollars [1].

Either option works fine (IMHO).

Another option I am unsure about because I never tried it is to use something like [2] to multiplex your GPU(s) so that you can pretend you have more GPUs to mess around with. However, if your goal is to learn how to manage/write software for multi-gpus/multi-machine clusters this is somewhat limiting because 1) it doesn't teach you much about data locality/compact placement since transfers between virtual GPUs will be extremely fast (i.e. they are sharing the same VRAM pool after all) and 2) you will still have a single machine (even if you are using multiple virtual K8s nodes).

[1] 3 instances with 4 GPUs each used for 24h in a month cost you 46$ according to: https://cloud.google.com/products/calculator/?hl=en&dl=CiRhY...

[2] https://github.com/NVIDIA/k8s-device-plugin


I don't know about GPU clusters specifically but when the purpose is to learn clusters you typically build a cluster out of the cheapest nodes you can get, even if the whole cluster's performance is less than a single machine with sane specifications. Non-GPU clusters for education are often built from single-board computers (Raspberry Pis or even cheaper). So that would be one way to approach it - find a bunch of really old machines with really old GPUs that are worthless for doing actual work on. Some single-board computers even have PCIe slots, so you can install a GPU into an SBC (often with only 1x performance). You could try nodes with iGPUs if that's suitable for your use case (they use different memory topology so it might not be). You could only put one GPU in each node, instead of as many as possible like a real cluster would have; if inter-GPU communication on the same node is an important part of your experiment, then you could only put 2 GPUs in each node. You might want to scale down the network to keep a similar performance ratio to a real high-end system - maybe force each port to 100Mbps.


There isn't really a school for this stuff. The way I learned was to go work for a company that was building out GPU compute.

It is a lot more than just software, especially on the high end of things.


Meanwhile in job seeking land:

- obligatory “Senior” in title

- requires 3-5 years of building physical GPU infrastructure


Straight up, these are excuses.

I never finished college. I also never let what was written job descriptions stop me from applying for a job that I wanted. I'm sure that the denials I have received have been all because of my own lack of performance during the interview.

I believe strongly in the whole "If there is a will, there is a way." If you get denied for one interview and you really want the job, you should try again at a later date. Find out what caused you to fail, fix it, and come back stronger.

I'm guessing that the number of people on the planet who've been hands-on in building large scale physical GPU infrastructure are in the low thousands. It isn't some huge field. We, as an industry, need people who really want to do this stuff, and can learn it quickly.

ProTip: you don't need experience with GPUs. I had zero when I started deploying 150,000 of them. What I had was an innate ability to figure shit out based on my other experiences, and that is what got me hired in the first place. I took ownership over the project and made it happen, no matter what it took. That's what people are looking for. Be a doer, not a talker.


It's an example for beginners, presumingly to illustrate the range function, so that's probably why it's done that way.

To continue to be nit picky with that example, the range function produces an iterable object, which is different from a list.


This seems to be non-standard, but I make a serious effort to never teach the wrong way to do something. Primacy is just so strong.

It makes teaching harder. You really have to work to come up with great examples. But it makes learning easier and there's less correcting to do later.


In general, I agree, and I think it’s especially important in text.

A teacher is in a position of authoritative trust. As a student, after I learn an example is flawed, I often wonder if I’m just missing some context because I trust the teacher to have gotten it right. In a setting where communication is already established (e.g., a class room) this can be cleared up quickly with a question. In other settings (e.g., reading a text), it can leave me wondering until I reach a much higher level of competence.


When the course reaches the point that students are ready to learn about enumerate (see my reply above) the course will definitely cover it and emphatically point out that it is the better way.


Take it for what it's worth, but that's exactly what I think is a bad idea. It violates the principle of primacy in learning[1]. It also erodes trust in the lessons. (Is this the real way to do it?)

There's a similar strategy of building things up and then refactoring when they get bad (this ifelse is getting too big. We could use a dict). But the difference is every step along the way is valid or immediately corrected.

[1] https://psychology.wikia.org/wiki/Principles_of_learning#Pri...


What would you do in this case?

The student has never seen a tuple, or iterable unpacking of any form. Would you just show them `for index, word in enumerate(words)` and tell them not to worry about what that means?

Quite shortly after this, I ask them to essentially do `zip_longest`. Given two string variables:

    string1 = "Goodbye"
    string2 = "World"
output:

    G W
    o o
    o r
    d l
    b d
    y  
    e  
Here's what I expect their solution to look like:

            length1 = len(string1)
            length2 = len(string2)

            if length1 > length2:  # one could use max, but I don't expect them to
                length = length1
            else:
                length = length2

            for i in range(length):
                if i < len(string1):
                    char1 = string1[i]
                else:
                    char1 = ' '

                if i < len(string2):
                    char2 = string2[i]
                else:
                    char2 = ' '

                print(char1 + ' ' + char2)
That's not something that can nicely be solved with enumerate. Do you think this exercise is bad because they should just use zip_longest instead?


I'd split the problem into two. You're talking about these problems like they're fixed, but they're not. This is your project.

I'd teach index access and number generation separately, with small incrementing variations. Then I'd show iteration, then enumeration (maybe after tuple unpacking).

When you combine two concepts it doesn't create just a combination. It creates one or more new concepts. Those new concepts have their own idioms and they should be taught properly.

Combining looping over numbers and index access creates two new concepts: looping over items and looping over items with their index. Both of those things have their own idiom in Python and your solution shows neither.

I think telling someone not to worry about the details of how something works because you'll get back to it later is way better than showing them the wrong thing and then correcting it. One is a promise kept, the other a promise broken.

In my opinion, nailing this kind of stuff is like half of the value of the project you're doing. You need to lean way in on it and get it right.

My take is right now you're thinking backwards from the position of someone who knows how to code. You need to think forward like someone who doesn't, at least more often. A student isn't going to be motivated to learn a, b = c by learning return a, b first because they won't know that second concept exists!

In short, if your students aren't ready to use enumerate or zip_longest, don't hand them problems that call for enumerate or zip_longest.


> I think telling someone not to worry about the details of how something works because you'll get back to it later is way better than showing them the wrong thing and then correcting it. One is a promise kept, the other a promise broken.

I'm not sure I'd say teaching someone to do something by hand for which they could use an existing function is "wrong". There's huge pedagogical value in knowing how powerful tools you didn't make work. Going bottom-up is a very effective way to do that (just ask the lisp folks), and a culmination of "nice job, you've implemented something so useful that it mirrors what the language designers/library authors did as well, here's how to use their version to save time in the future" is far from a broken promise.


> I'm not sure I'd say teaching someone to do something by hand for which they could use an existing function is "wrong".

I didn't say it was wrong the wrong way to teach. I said the code was wrong. Like, if I removed myself from teaching and I just saw that in a PR, I'd definitely suggest they use enumerate instead.

I'm a huge fan of reimplementing built in functions as a way to learn. I'm learning Clojure right now (like I stopped to write this comment) and I do it all the time. But, I know I'm doing it.

The other thing that's fine is building up to the abstraction. "Let's get the index, now the item. Okay, there's a better way to do this". But you have to do it immediately.

I'm just not a fan of showing someone something, letting it sink in, and then having to go back and correct it.

And while I hope my arguments stand on their own, and I certainly could be wrong, I'm at least not speaking out of complete inexperience. I've spent a decent amount of time teaching people to code, juggle, work at rope courses, and fly airplanes.


Do you get reply notifications?


No, I just check my comments.


Which is why it specifically says "is similar to the list".

But you may be surprised how many list-like operations `range` supports. Subscripting, `len`, `in`, `.count`, `.index`. It might as well be a tuple.

And if you want to nitpick more, range isn't a function.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: