Interesting, I'd love to see some more opinions on this. I find the instant "MD5 is broken, don't use it advice" harmful. Not all applications are security related or need absolute highest speed. I actually had use cases in multiple hobby projects, but of course that doesn't there are many.
For example in a distributed event based LAN chat, I used MD5 for an "integrity chain". Every new event id is the hash of the old event id + some random bytes. This way you can easily find the last matching event two systems have in common. Just a random id isn't enough, when two instances integrate an event from a third system, while one of the two added a new event just before that.
No security needed, speed doesn't matter much, it is not designed for high throughput. MD5 seems like a very good choice, because it's easy to work with and can be verified on every system.
As far as I know and that information is probably way out of date, such an integrity chain on MD5 could be compromised as someone would be able to switch out some important bytes (switch the byte with a toggle on "I do NOT want to buy this very expensive washing machine") while keeping the hash value intact. So using MD5 as a signature check on documents like invoices going through unsafe channels is not safe.
But this is a security case that requires a hostile actor. If the problem is just checking for data integrity or in this case data identity without there being a danger for manipulation, MD5 should perform fine. I don't see a problem with your use case. I am no expert here and there are probably more optimal hashes, but MD5 has the advantage of being widely implemented in all kinds of systems.
Because understanding the intrinsic weakness of hashes isn't trivial, many just recommend "MD5 is broken, don't use it". I think this is just to be on the safe side. Many applications would probably be fine, but because to err on the side of caution is safer, people sometimes say that MD5 is the worst hash function ever conceived.
You are using the security argument again. It is not used in an adverserial context. You are correct, that this is not secure. And messages can be tampered with. But this is not the application. The threat model of the application is that everything happens in a private context without adversaries. The communication is end-to-end encrypted and every participant of the chat has total control and is allowed to change everything in the chat, even messages from other users. So there is no point in protecting again adversaries that have access have to the secure channel, because they are already allowed to do anything, even if the hash were cryptographically secure, they are allowed to change everything. The integrity is only for synchronization, so that every participant can easily verify the state of the event history up to a specific point.
I think they use master for releases only. Development branch is actively worked on and more than 100 commits ahead of master which is totally active. Last full release March 2024 is totally fine. People can always build from develop branch.
At [company x] someone wrote a starter guide that tells developers to create a "production", "staging" and "dev" branch for any new repo. We have tons of repositories that follow this pattern.
For many of them, each branch has taken of its own life and could be considered its own completely different codebase. It's a nightmare to manage and it confuses developers on a regular basis.
Don't do this.
If you want to deploy different versions of your software in different environments, use SEMVER and git tags. Don't create 1 branch per environment...
I have since edited that starter guide and crossed out this recommendation.
It works fine if you review PRs and only allow STG->PRD promotions. It breaks down when people start making separate builds for each env.
Treat env as config as you'll just have to manage a config folder in that repo.
I concur, it works fine as long as devs follow the procedure. I also prefer to enforce linear history as well so that git bisect works properly; but then this requires devs to understand how to use --ff-only and also if you're using github to use a github action to fast forward as github doesn't natively support fast forward (one of github's many sins).
But then I also find I need to train devs on how git actually works and how to use git properly because I find that only about 10% of devs actually understand git. But it works out best for everyone once all the devs understand git, so generally most devs appreciate it when someone is willing to teach them the ins and outs (but not all devs appreciate it before they learn it properly though).
Sorry but you are just using source control very wrong if you keep 2 parallel environments in the exact same code base but different branches. The build itself should know whether to build for one environment or another!
They are the same only sometimes; devs work on code on feature / fix / whatever branch, then when they've finished dev testing you do a code review and then it gets fast forwarded onto the dev branch, then when it suits for staging (non dev team stakeholder testing / infra testing) it gets fast forwarded to staging, then when it passes staging testing (if necessary), then it get ff onto prod and deployed. so dev will sometimes point to the same commit as staging but sometimes not, and staging will point to the same commit as prod but sometimes not. It's a funnel, a conveyor belt if you will.
Mobile apps release process will disagree with you. there’s a gap of around 4 days between what you consider as a release and what can be on prod. If you got rebutted by review, you need to edit the code. If you want to rollback, you need to edit the code. You can only be linear if you control releases.
> For many of them, each branch has taken of its own life and could be considered its own completely different codebase.
Seems you have bigger process issues to tackle. There's nothing inherently wrong with having per-env branches (if one thing, it's made harder by git being so terrible at branching in the general/long lived case, but the VCS cannot alone be blamed for developers consistently pushing to inadequate branches).
> There's nothing inherently wrong with having per-env branches
There is when you stop thinking in terms of dev, staging and prod, and you realize that you might have thousands of different environments, all named differently.
Do you create a branch for each one of them?
Using the environment name as branch name is coupling your repository with the external infrastructure that's running your code. If that infrastructure changes, you need to change your repository. That in itself is a cue it's a bad idea to use branches this way.
Another issue with this pattern is that you can't know what's deployed at any given time in prod. Deploying the "production" branch might yield a different result 10 minutes from now, than it did 25 minutes ago. (add to the mix caching issues, and you have a great recipe for confusing and hard to debug issues)
If you use tags, which literally are meant for that, combined with semver (though not necessarily a requirement, but a strong recommendation), you decouple your code and the external environment.
You can now point your "dev" environment to "main", point staging to ">= v1.25.0" and "prod" to "v1.25.0", "dev-alice" to "v2.0.0", "dev-john" to "deadb33f".
When you deploy "v1.25.0" in prod, you _know_ it will deploy v1.25.0 and not commit deadb33f that so happened to have been merged to the "production" branch 30 seconds ago.
Before git abused the terminology, a branch used to refer to a long-lived/persistent commit lineage, most often implemented as a commit-level flag/attribute,
OTOH, git branches are pointers to one single commit (with the git UI tentatively converting this information sometimes into "that commit, specifically" or sometimes as "all ancestors commits leading to that commit", with more or less success and consistency).
Where it matters (besides fostering good/consistent UX) is when you merge several (topological) branches together: git won't be able to tell if you just merged A into B or B into A. Although the content is identical at code-level, the semantic/intent of the merge is lost. Similarly, once the head has progressed so much ahead and your history is riddled with merges, you can't tell from the DAG where the individual features/PR/series start and end. This makes bisecting very hard: while hunting down a regression, you would rather avoid checking-out mid-series commits that might break the build, and instead stick to the series boundaries. You can't do that natively with git. That also makes maintaining concurrent versions unnecessarily difficult, and many projects are struggling with that: have you seen for instance Django¹ prefixing each and every commit with the (long-lived) branch name? That's what you get with git while most other VCSes (like Mercurial, my preference) got right from the start.
Branch is semantic. The true unit is commit and the tree is applying a set of commits. Branching is just selecting a set of commits for a tree. There’s no wrong or right branch, there is just the matter of generating the wrong patch
Branches are mutable and regularly point to a new commit. Branching is selecting an active line of development, a set of commits that change over time.
That's why git also offer tags. Tags are immutable.
There are multiple valid branching strategies. Your recommended strategy works well[0] with evergreen deployments, but would fail hard if you intend to support multiple release versions of an app, which happens often in the embedded world with multiple hardware targets, or self-hosted, large enterprise apps that require qualification sign-offs.
0. Semver uas many issues that I won't repeat here, mostly stemming from projecting a graph of changes onto a single-dimension.
I always thought multiple hardware targets are solved by build flags. And keep the one branch. E.g. in Go you can include/exclude a file based on "build tags":
> but would fail hard if you intend to support multiple release versions of an app, which happens often in the embedded world with multiple hardware targets, or self-hosted, large enterprise apps that require qualification sign-offs.
I don't have experience in this world, indeed.
But isn't "multiple release versions of an app" just "one application, with multiple different configurations"? The application code is the same (same version), the configuration (which is external to the application) is different.
Your build system takes your application code and the configuration as input, and outputs artifacts for that specific combination of inputs.
> But isn't "multiple release versions of an app" just "one application, with multiple different configurations"?
That would be nice (and evergreen), but that's not always the case. It's common to have different versions of the app released simultaneously, with different features and bugfixes shipped.
Think of Microsoft simultaneously supporting Windows 10 and 11, while still releasing patches for XP: they are all individual OSes that share some common code, but can't be detangled at build times[1]
The customer will be reluctant to upgrade major versions due to licensing costs and risk if breakage (your code, or their integrations), but still expect bugfixes (and only bugfixes) on their deployed versions, which you're contracted to provide. Using the evergreen approach.
I'm not convinced using build flags to manage which code is shipped is superior to release branches, I fall on the side of release branches because using bisect is invaluable.
1. I suppose as long as the build system is turing complete, one could hypothetically build Windows XP, 7, 8, 10 and 11 off the same codebase using flags. I would not envy that person.
"At company x, they had a kitchen and a couple meeting rooms. Devs started using the rooms for cooking, and the kitchen for team standups."
Tools are just there, it's people who misuse them. If devs at company x are incapable of understanding that you shouldn't be cooking an omelette in the meeting room, to be honest that's on the dev, not on the separation of concerns that was put there for them.
Probably what's missing there is training to set the expectations, policing on their actions, and a soft reprimand when the typical first time mistake is done. But if the metaphorical rooms have no indicators, no name tags, and no obvious usage guidelines, because no one bothered to set them up, then yeah expect board meetings to end up happening in the kitchen.
Reminds me of when someone at the company didn’t like the branch master so they unilaterally directed their team to start working on “main”, resulting in a massive merge conflict that took over two weeks of dedicated effort to resolve, ugh…
Thanks for that. I loathe replys like "but lang/framework/... can do/will be able to do something similar/does something else which I like better/...". Well, it's not about that. It's about how easy it is to use, how good it is from preventing you to shoot yourself in the foot, sometimes how performant it is, ...
>> The stronger versions, things from List 1 and List 2, are mostly only seen in defense and intelligence
And I don't think that is enough. I agree that it easier and sufficient for most systems to just be connected over the internet. But health, aviation and critical infrastructure in general should try to be offline as much as possible. Many of the issues described with being offline stem from having many third party dependencies (which typically assume internet access). In general but for critical infrastructure especially you want as little third party dependencies as possible. Sure it's not as easy as saying "we don't want third party dependencies" and all is well. You'll have to make a lot of sacrifices. But you also have a lot to gain when dramatically decreasing complexity, not only from a security standpoint. I really do believe there are many cases where it would be better to use a severely limited tech stack (hardware and software) and use a data diode like approach where necessary.
One of the key headaches mentioned when going offline is TLS. I agree and I think the solution is to not use TLS at all. Using a VPN inside the air-gapped network should be slightly better. It's still a huge headache and you have to get this right, but being connected to the internet at all times is also a HUGE headache.
There are unfortunately states within the US that do allow civil forfeiture. Its become quite well known where people will get pulled over with larger amounts of cash $10k+ and the cops don't like the answer of why they have money and they take it. It then becomes the burden of the owner to prove they lawfully own that money.
> Why would the police keep the money? It doesn't get to keep other confiscated money, does it? Wouldn't the state be the "default" recipient?
That really depends on the jurisdiction. Many police departments in the US self fund via confiscations, sometimes even not as a penalty for breaking the law. In a big city, that's a small part of the budget. In some more rural counties, it can be a surprisingly large amount of the budget.
That sounds the same. It seems surprising to take the car. If I kill someone I don't have to give the state my house, or even my pen. Possessions - unless illegal goods - are not the state's to take.
But your are going backwards though. You have a sha-256 value and want to find an input with the same result. But this input again has to be a sha-256 result and you need to find an input for that as well, right? This would only work if you have the intermediate sha-256 value, that produces the final sha-256 or you can find a collision that itself is a sha-256 value.
Going backwards, as you say, is called a pre-image attack. That's different from a collision attack, which is generating two inputs with the same hash.
Pre-image attacks are MUCH more difficult. How much more? well, MD-5 is considered broken, and yet, there isn't one for it.
There is a pre-image attack for MD5, it's just not considered good enough to be practical. Quoting Wikipedia:
> In April 2009, an attack against MD5 was published that breaks MD5's preimage resistance. This attack is only theoretical, with a computational complexity of 2123.4 for full preimage.
Yes, but that's very little improvement over the generic 2^128 attack - trying random messages until one happens to match the target hash. The attack quoted by Wikipedia achieves only 4.6 bits of speedup (note that it's 2^123.4, not 2134.4 :) ). There are attacks of this sort against many cryptographic primitives, including AES, where you can gain just a few bits over the generic / brute force attacks.
Now, I find a collision string SS (of length 128 bits, like an MD5 hash), where MD5(SS) == Y
Then I find a collision string SSS (this time, length doesn't matter), where MD5(SSS) == SS
Then we have MD5(MD5(SSS)) == Y, which was only twice harder than finding a single MD5 collision.
Could someone explain what is wrong with my reasoning ?
Edit: Oh okay, got it, when we say "MD5 is broken, it's possible to do a collision attack", what we mean is that we can easily find 2 strings S1 and S2 where MD5(S1) == MD5(S2)
But S1 and S2 and found randomly, we don't have a way to find a string S3 where MD5(S3) == Y for any Y value (that is what we call a pre-image attack, not a collision attack)
Pre-image is approximately "twice as difficult" as a collision. A generic attack on, say, a 256 bit long hash function takes 2^128 time to find a collision, but 2^256 time to find a preimage. And like you say, this also shows up in practice: both MD-5 and SHA-1 are completely broken when it comes to collision resistance, but both are (probably) still OK for preimage resistance. I would still not recommend either of them for anything.
Where on earth did you get this idea from? What is a "generic attack"? How could you turn a collision somehow into a pre-image attack? How is many orders of magnitude "twice" ?
You can find this in any introduction to cryptography textbook/course. "Generic attack" is a common term for "just use brute force" [1]. It's called "generic" because it works regardless of the implementation of the primitive. For pre-image resistance the generic attack just hashes messages until it finds the right image, for collision resistance you can get a quadratic speedup via the so called birthday problem / birthday attack [1][2], where you keep hashing messages and storing the hashes until any two of the messages happen to hash to the same value.
I don't think that "look, raw brute force has this property" is at all useful in this context where you'd obviously actually compare a real attack not brute force. There's no reason to believe (and every reason not to) that the same property somehow applies.
That Stack Exchange answer also immediately set off alarm bells in my head because it pretends to be entirely generic, but the obvious thing to do with entirely generic cryptographic intuitions is apply them to the One Time Pad and check their answers work. This intuition doesn't work. Even if you could try all the possible keys you learn nothing, because of the hand-waving about "plausible" plaintext.
Birthday attach is a real attack and often useful in practice. "Just use brute force" is a huge oversimplification, but the SO link explains it in more detail.
One time pad is not a hash algorithm so obviously a generic attack on a collision function doesn't apply to it.
I used to struggle with this too, but now I look at it this way: you're always at risk of being breached when connecting to the Internet (zero days in Browser, Router, maybe IoT devices on the local network, supply chain attack of some installed software, router, ...). Everything you add to your system/network adds attack surface. But: somewhat popular github projects are usually low risk, because 1) enough people are looking into it to be reasonably sure there's nothing funny in the code base, 2) it's not big enough to be an instetesting targeted for malicious actors.
I think a big part of why it feels scary is the unpredictability you mention. You don't know how you would be compromised and whether you would even notice. Sure you could get comprised and then spread the infection, but it's extremely hard to build malware like that. The much more likely scenario is the that the malware tries to steal crypto or encrypts your files. The chances that something really bad would happen are very slim (Do you even have large amounts of crypto? Do you not have any backups of important files?). In the end that's just a risk you'll have to live with (when connecting to the Internet) just like you're at risk of getting hit by a car when going outside.