Is there a a good read on how RNA and DNA work for a Computer Scientist? And more generally how biology, genetics, epigenetics, virus, etc work?
Many vulgarization sources say that DNA is like the source code of life. But they mostly skim across the issue and go to conclusions like "this gene or set of genes are responsible for that outcome".
But coming from a CS background that sounds a bit like non-sense. I feel like it is like saying that "this processor instruction is responsible for that outcome". But in the end what is important is not the individual instruction but the interaction between them and the environment (Input / Ouput).
I wouldn't recommend trying to understand it through analogy to CS - I'm a biologist and know enough CS to feel uncomfortable really using any of the usual analogies. The problem is that they work ok for a layperson understanding, but are fundamentally wrong enough that if you want to understand it at the level of something like virology you'll be lead constantly astray. I dont know of a better recommendation than "watch some recorder bio 101 lectures or pick up a textbook, and learn it like a bio major instead of a CS major." It's a lot of time commitment, but I think you have to put in the time to actually learn it from first principles.
For context, imagine if I as a biologist asked you, what's a good read on CS for a biologist? And generally how algorithms, data science, AI, and operating systems work?
You can understand why it would be difficult to recommend any readings on it. For me to learn to code at any level, I had to spend the time learning it from the fundamentals, and it was worth spending the time for me as it was far more rewarding than trying to wrap my mind around something like "so imagine source code is this thing that's like DNA + epigenetics + dna topology that codes for the information you want, but instead of a complex set of chemical and physical interactions, it's math, and you can change it at will and nothing is leaky on purpose, and the code you write doesnt evolve unless you tell it to which you might want to do sometimes for AI stuff, but usually not."
For example, the covid-19 virus's RNA ends with AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA (33 As) which, as someone pointed out, looks suspicially like a "NOP sled"[1] (i.e. so that a protein coming in hot can hit anywhere on the sled and then slide down to the actual information)
A poly A tail is at the end of the transcript, not the start. This is a perfect example of the dangers of analogies because this one is completely incorrect. Poly A tails protect RNA from degradation. A ribosome will not bind to that part, it will bind to something within the 5' untranslated region on the other end of the genome) ribosome binding site, kozak consensus sequence, etc). If you bound to the poly A tail and slide "down," you'd just slide right off the end of the gene.
Definitely nothing wrong with trying to find things that look similar across fields and try and gain differentiated insights, it's certainly an admirable thing and an important part of making many discoveries. But a first principles understanding of the structure of DNA, what a 5' end and a 3' end is, and how transcription/translation works (all bio 101 type stuff) would provide a much better foundation to do that with. You don't need a degree in biology to know the basics, just like you don't need a degree in CS to know the basics, but it's worth learning the basics if you want to get an understanding of the field or work with it.
It’s hard to disagree with what your saying, but analogies are fun.
I think the best analogy is DNA is a series of automated assembly lines. The workers, raw materials, and signals to turn sections on or off are separate. Each nucleotide station is called A, T, C, or G and adds a single element at a time to long chains which then get folded up into useful tools. DNA can also be used to manufacture either more stands of DNA or to do very simple repairs.
Alternatively, a stack of computer punch cards with a program on them. In that it’s a standard format for information transfer that is physically altered by the message it carries. Further, on it’s own nothing happens but connected to the right machinery it’s useful.
Edit - I'll concede that analogies are absolutely fun haha. They might be useful to come up with when you yourself are trying to better understand something by drawing connections to other things you know and all. But I don't think they're ideal for being taught the subject. And again, for a layperson, analogies are totally fine if you just want a surface level understanding of "oh, mitochondrias are powerhouses of the cell." But if you want to do anything with that, beware the dangers.
The second analogy is ok, but again not really great if you want to understand what's actually going on. If you just want to get an intuition that DNA has something to do with storing information, as long as that comes with an understanding that information evolves (biology is really the study of evolution in many ways), that's probably more than enough for a layperson. But it's a dangerous understanding if you plan on using it as a foundation to try and explore more complicated topics in biology. There's no need to do so, but if you are going to, it's far far better to spend the time getting the correct first-principles understanding and get rid of this idea of "DNA as source code" so you can better grapple with the subject.
I don't think the first analogy makes much sense since it makes nucleotides the sort of active agents here and needlessly focuses on individual nucleotides when codon triplets are what are involved in coding for proteins (probably better to, if you have to use an analogy, treat nucleotides as letters instead of words). And, especially in the context of this thread, neither analogy build the understanding needed to really dive into what RNA is, which is important for understanding what RNA viruses are, and what the coronavirus is.
The problem with including codons in such a high level analogy of DNA is their context sensitive. DNA doesn’t care about those details and can be sliced an diced to completely change how it’s interpreted.
Also, you really should include activation sites etc, but soon your out of the realms of analogy and just describing the details.
The big component missing in the analogy is gene expression. That’s what makes DNA behave differently from simple instruction mappings.
I guess we could visualize it as a series of drains, where the size of the hole can be modulated. Since gene expression can decrease or increase the rate at which DNA is transcribed
I have been using this for a while. What would you suggest instead of:
> The workers, raw materials, and signals to turn sections on or off are separate.
People are used to the idea of thermostats using on/off to maintain temperature which is not a terrible association. As you say gene expression is different, but I can’t find a better analogy.
That's why I figured an analogy would include nucleotides as letters that make up words rather than acting on their own to be able to encompass all that (and I love that in trying to help craft better analogies in a chain where im also railing against analogies)
And it also doesn't contain all of the source code and the error correction is when it tries to copy itself? I mean, just thinking of it as a type of biological source code that makes more of itself with mistakes and is messy is a decent enough analogy, and again I think most of these are fine just to be able to know as a layperson "oh right, DNA is that thing that has a lot of information about what the living thing is going to be like." You don't really need to know more than that day to day as long as other people do - I don't know how to fix my car but i know there are people out there that know how it works. If you are going to be more involved with it or want to know more about it for whatever reason though it's maybe worth diving into the bio 101 stuff then.
I dont actually know of any. I've heard of like biostar and maybe a couple others but I dont know how good those are. Researchgate is kind of a forum I guess? But most of the questions I've seen in the social media side of that are like "how do I turn this experiment"
I think most of the discussions I've had have been journal clubs during lab meetings. An hn style website for virtual journal clubs would be super cool...
IMHO, biology is most easily looked at as a recipe for order, from whatever ingredients happen to be laying around.
Essentially, everything is in the service of increasing order. And maintaining it.
Polymerase creates more genetic code out of raw ingredients.
Genetic code creates proteins.
Proteins perform all kinds of functions by leveraging physics and chemistry.
Membranes prevent all of this from being blown away and scattered in the chemical equivalent of a slight breeze.
... so, as you ask, 'Why genetic material in the first place?'
And to that, something of a tautology: because reproducing order reproduces.
Randomly assembled self-assembling assemblages constitute the major components of our environment, because everything else... didn't.
A rock is still one rock. Water is still water.
A blade of grass is now a meadow. A tree, a forest. A slightly intelligent primate, civilization as we know it. And a single mutated coronavirus, a pandemic in every country on the globe.
To me, what blows most comparisons and analogies out of the water, is the insane complexity to it all. That genes can influence and interact with each other. So gene expression (= gene -> protein) can change gene expression, e.g. based on environmental factors or folding of DNA etc...
Gene A is responsible for property A is way way way too simplified.
I have a microbiology degree and an EE degree (but more like comp E).
DNA truly is a blue print. (almost) Every cell in your body has the source code for your entire body.
"this gene or set of genes are responsible for that outcome" is generally going to be a true statement.
The equivalent wouldnt necessarily be a processor instruction.
If you wanted an analogy, a cell is like an entire minimal computer. Power supply, processor, memory, and I/O. Each computer can run different parts of the code, but every computer has the entire code base.
Within the computer are specialized "organelles" for storage, graphics, network, etc. These are comparable to organelles such as mitochondria, nucleus, endoplasmic reticulum etc.
DNA acts like code in that it produces outputs (messages) and receives inputs. Proteins are also like code in that they produce outputs and receive inputs.
It might be that DNA is like the executable source code that is stored as files and proteins are more like in memory processes such as services and actively running programs that get spawned from the files to perform various tasks.
Viruses are like hostile code in bound in an email or coming through as tainted packets. They cant actually do anything on their own - they are inert, but once the processor starts executing the code, they hijack the processor. Sometimes to spawn more malicious code to other computers in a trusted network
To carry the analogy further, departments in a company are a bit like organs. The computers in that department all roughly do the same thing. Imagine if IT imaged every computer in the company the exact same way, but what software was actually used would be based on your department.
The reason why CS analogies don't work very well is that molecular biology is very context driven, and we often don't know what the context is.
For example, in eukaryotes, the same piece of DNA can can produce different amounts of different proteins (i.e. with different amino acid sequences and different post-translational modifications), depending on things like DNA methylation (there's a number of different types), genomic location, accessibility, transcript methylation, multiple mRNA isoforms - the list goes on. A combination of these and other still unknown factors produces the final product, which may vary within a cell, between cells in a tissue, and between tissues.
Things behave consistently and predictability, except when they don't. Makes the field interesting and frustrating.
Analogies obviously don't hold for many cases. But it's a good entry point to explain some basic concepts such as the Central Dogma. I am a biologist who has to teach molecular biology to computer scientists in our Bioinformatics department. It was the only way to bridge the gap. Once we got it we could start on the more detailed stuff.
Biophys grad student here. If I'm understanding your frustration correctly, you're basically saying: "yes for the 1000th time, I KNOW DNA is the recipe book, but who is the chef?" Well, I'm sorry to report that there are a lot of chefs. And to make things worse, every cell is different, so there are also a lot of restaurants. For this reason, it is not possible to make a book on any of the topics you listed above without skimming.
The solution to this is narrow your scope: is there a particular subtopic that interests you?
I have discussed this exact analogy with a couple colleagues from a CS background. If someone said the genome is the source code that's very wrong. The more accurate analogy is that your genome is a compiled, highly compressed executable file (more or less 200 MB in size?) Where the optimizer blindly reuses variables and memory locations for multiple routines that may or may not be related. It probably is quite similar to oracle's DB2 codebase, where all this shit is in C, and you change some variable here and hell breaks loose somewhere else. For the most part though, if you don't tinker with it, it runs like a well oiled machine because a blind watchmaker programmer (with great intuition but no CS organization fundamentals) has debugged and patched it for billions of years, patch on top of patch millions of times over.
It's like writing ultimate spaghetti code in a dynamic language.
The nature of biology is much like classes and data floating around and all seemingly randomly interacting with each other, because ultimately this is just chemistry with thousands of unique compounds, which makes things messy.
But through natural selection it seems to work. Like a totally incomprehensible compression algorithm learned by neural nets or evolved by genetic code. There should be a source somewhere around here...
Once I tried to learn this stuff by asking a biochemist, but I soon became frustrated because it seems there are no 100% rules, only "it is usually like this" all the way down.
Like, did they teach you at school that eukaryotes have two sets of chromozomes, one from each parent? Yeah, "usually". But then also:
https://en.wikipedia.org/wiki/Polyploidy
>It's like writing ultimate spaghetti code in a dynamic language.
The nature of biology is much like classes and data floating around and all seemingly randomly interacting with each other, because ultimately this is just chemistry with thousands of unique compounds, which makes things messy.
It should be noted that biological systems were one of the exlicit inspirations for extremely dynamic and object oriented languages like smalltalk.
Far from being a negative aspect the 'centerless' design and high degree of decentralization is messy, but also extremely robust.The internet as a whole is another example of a human technology that utilizes that sort of design.
Proteins are built out of a chain of aminoacids. There are 20 different kinds of aminoacid, and the particular sequence of aminoacids determines the shape of the protein, and therefore its biological function.
To build a protein, the cell uses a strand of RNA as template. The RNA consists of a sequence of 4 kinds of bases (A=Adenine, C=Cytosine, G=Guanine, U=Uracyl) and there is a genetic code that maps each group of 3 RNA bases into one aminoacid. For example, "AGC" corresponds to serine, "AAG" means lysine, and "AGCAAG" means serine followed by lysine.
DNA is built of similar pieces as RNA. The main difference is that RNA is single stranded, while DNA is double stranded with that famous double helix structure. Additionally, DNA uses T=Thymine in place of U=Uracyl.
DNA acts a store of information. Each gene contains the genetic sequence that describes how to build one protein. When a gene is active, the cell copies that part of the DNA into a messenger RNA, and then uses that RNA as a template to build the protein in question.
Every cell in the body has the exact same set of genes, but they differ in what genes are active at each time.
------------------------------------
Now on to viruses...
The Coronavidus is an RNA virus. Each individual virius is a little ball made of proteins and lipids, encasing a genome made of RNA.
By itself the virus is inert, but once that RNA finds its way to inside a human cell, it behaves like a fork bomb. The cell translates the viral RNA into proteins, just like it would with our own RNA. One of the proteins is a polimerase enzyme, which then makes even more copies of the viral genome. Soon, there are thousands of copies of the virus. Additionally, the viral genome also encodes the structural proteins for the capsule of the virus, including the "spike" protein that gives coronaviruses their signature look.
The polymerase in most RNA viruses is an RNA-dependent RNA polymerase, an enzyme that makes copies of RNA. So what happens is: a single piece of viral RNA comes in the cell. Then it gets translated by the cell machinery, producing the polymerase enzyme. That polymerase then makes more copies of the viral genome. Which in turn gets translated into even more polymerase. Which then make even more copies of the virus... That is where the fork bomb analogy comes in.
That kind of thing doesn't normally happen to us because we don't have those enzymes that can make copies of their own messenger RNA. Putting those fork bombs in production would be asking for trouble :)
Humans have developed protections against some rogue RNA, since out of control fork bombs randomly happening isn't a thing that makes you successful at reproduction.
The Virus reproduces using a fork bomb but for us it's deadly.
I listened to the Great Courses audiobook about Biology. There's a lot about how the machinery of the cell works, how it evolved, and so on. He goes through everything else as well, including the immune system, ecology, and so on. It all ties together.
Regarding the genes and central dogma, he gives a LOT of detail about how for instance it was discovered that codons work in threes (not twos or fours). Also how it was discovered that DNA is the thing with the code, how the transcription process works, how the different kinds of RNA are involved, all the way to how the protein comes out.
The audiobook also comes with a downloadable book in case you'd rather read it.
> Many vulgarization sources say that DNA is like the source code of life.
DNA is a tape archive of files (genes). They are not binary (base-2) encoded, but base-4 encoded. A reading head (polymerase) reads sections of the tape and copies them into working memory. A 3D printer (ribosome) receives these sections from the working memory and prints nano machines (proteins) based on these tape snippets. DNA is not source code, it does not contain conditions or jumps (loops), it's data for the 3D printer. (Actually it's just a 1D printer, but the result folds into a 3D shape).
I have a CS background and for me the best explanation of how life fundamentally works was the MIT 7.00x "Introduction to Biology - The Secret of Life" course. It is absolutely amazing, requires only high school level knowledge and covers so many incredible stuff that I cannot possibly list all of it here. And not only is this course teaching you how stuff works but also the history behind it, how we, humans, figured it out. And, it's free and available online!
If you ever wanted to get deeper into biology than reading news articles, I cannot recommend this course enough.
The cell is an analog computer, but one who's computational state we are still working out describing (it consists of the concentrations of transcription factors as well as epigenetic modifications of DNA and chromatin), and who's evolution in time is governed by simple binding and enzyme kinetics, but that are integrated into networks that can be very complicated.
One really simplified model (inaccurate, but useful) of the computational state and evolution of the cell involves DNA methylation. DNA is methylated to shut it off, demethylation turns it on. To turn some DNA off or on, either DNA methyltransferase (DNMT3a/b) for methylation or one of the Tet proteins for demethylation needs to be guided to that region of DNA, usually through the assembly of some transcription factor complexes. The assembly of these complexes is combinatorial and performs logical addressing depending on the cell state. For example the cell may be expressing TF's A, B, and C, and these guide DNMT3a/b to DNA locus 3xq3945 (made up, whatever) if beta catenin translocates to the nucleus as a response to Wnt signaling, to turn something off. If TF's A, B, and D are expressed, it may bring it to a different location, and may bring Tet there instead, for demethylation.
Anyways, that's an example of "computery" description of cell biology. I don't really like how other responders shot your question down, I think the computer science mindset is really relevant to biology.
Not necessarily specifically RNA/DNA but a super accessible description of biology is The Machinery of Life (https://www.goodreads.com/book/show/6601267-the-machinery-of...). I can't get over much it makes me want to learn more, and as a lay-person, I think it gave me a much better idea of what's going on at pretty much an atomic level.
> Is there a a good read on how RNA and DNA work for a Computer Scientist?
DNA is source code. RNA is an intermediate representation. Enzymes compile DNA into mRNA. Ribosomes compile mRNA into proteins which can actually execute a wide variety of functions.
Proteins may require post-processing before they're usable. The output of a ribosome is a primary protein: a linear, one-dimensional structure. By interacting with itself and other cell machinery, the protein could be folded into three-dimensional secondary and tertiary structures. Several of them might be assembled into a quaternary protein.
The mitochondrion and the ATP synthase are like a hydroelectric power plant. The electrochemical gradient is the gravity, the hydrogen is the water and the enzyme is the turbine.
Enzymes transform substrate into product. They're like functions. When one enzyme's product is another's substrate, a pipeline is formed. Metabolism is a parallel process. I've written about metabolism as a programming metaphor before:
Sometimes bacteria save useful snippets of their own source code into gene cassettes. This allows the code to be exported to other bacteria via horizontal gene transfer.
Developed an awesome antibiotic-destroying enzyme? Make copies and send them to friends through a pipe. Didn't manage to survive? Said friends might be able to absorb the DNA from the environment anyway. They can obtain the fallen's powers just like Mega Man.
DNA is literally only instructions for building proteins (each 'gene' commonly encodes one protein). It's a very low level code, then.
The various proteins that are built are actually interacting with the environment.
So, your desire to look at interaction with the environment might be better captured by a field other than genetics: maybe proteomics or molecular biochemistry.
Maybe you can help me understand a simple example - if there is a gene that controls the eye color, how does it express itself as eye color (vs something completely different)
It's important that there isn't a gene that controls the eye colour.
Eye colour is modulated by several genes which haven't necessarily been identified - I understand this to be an open question.
I can give an example, though:
Eye colour is a function of what pigments are present in the iris, the most relevant pigment (in humans) being melanin, which also contributes to dark skin and tanning.
So, melanin is produced via the biochemical pathway 'melanogenesis'.
Among other things, the process of melanogenesis requires the enzyme tyrosinase, a protein which has a decent length wikipedia page.
If you are missing the TYR gene which codes for the protein tyrosinase, then you will likely be albino. This won't only affect your eyes, then.
I say you will 'likely' be albino only because it's hypothetically possible that a lack of tyrosinase could be compensated for with another biochemical pathway. And, of course, albinism could come about by some other break in melanin production.
A gene is like a DLL. Asking how it works would require a lot of know-how that’s specific to the usage.
When talking about how DNA is an instruction for making a protein, that protein is your data structure or a function. The next level of fundamental building block. Genes can contain a lot of information or non-information for proteins. Just like a program might have a lot of functions or data. The way they work together and interact is what creates a larger effect.
I think maybe the best starting point for understanding biology is learning about enzymes up close. They are proteins that act as “machines”. It gets down to physics and chemistry to determine how they work, so that’s probably when to drop the analogies.
Basically, the variance in melanin is what causes different eye colors. So, if you have a gene that produces melanin, and then turn it's expression down, then you might get bluer eyes.
The complicated part is that a lot of genes actually control for eye color - that is where the complications come. Each gene interacts with every other gene in a unique way.
DNA code (acgaa...) is not the only thing to be taken into account. Folding of DNA is just one more factor that influences outcome aside from raw code.
I read this as a kid: https://www.amazon.co.uk/Cartoon-Guide-Genetics/dp/006273099...
It provides a very useful bridge from a mechanistic / physic-y / computer point of view, to the more messy biochemistry. Still, whenever I think of a biochemical process, it's a cartoon from here that comes to my mind's eye.
So, from a CS point of view, I guess instead of comparing DNA to procedurally executed machine code we should be comparing it to the source code of a set of classes, which once instantiated produces specific proteins that perform all sorts of biological tasks in our bodies.
- proteins: executable object code (ribosomes are also proteins, mostly)
- epigenetics: build config options
- promoter site: `ifdef
First, imagine a system where every time you compile something, the object code is immediately launched in its own thread executing truly in parallel with all the others.
The overall behavior of the cell is the interaction of all these threads. A gene is a single block of code. They can do lots of things, including changing build options and enabling/disabling the generation of other blocks of code. We say a particular gene causes a particular effect, but really it's the aggregate interaction that does things. Most traits require the cooperation of many genes to express. If we associate a gene with a particular trait, it's usually because it is a critical component of that trait; necessary so if you break the gene the normal aggregate behavior goes away, but not sufficient - it still needs a lot of other working genes to have the right effect.
Now imagine that some of those threads run a constant cleanup process that kills running threads, delete your object code and delete files in your working directory (largely at random) while another set of threads are constantly fetching (transcriptase) and recompiling (ribosomes) code to launch new threads to keep the system working and your services up. During this process the relative number and type of running threads will change according to a combination of code and the environment.
Viruses are a tiny executable (protien coat) that copies some rogue code (RNA) into your working directory, so that constant recompilation will generate new instances that copy more code, etc.
Too many running virus instances will crash the system (kill the cell) by starving/corrupting normal processes. The random deletion and constant fetching from the repo slows the virus replication down however, and may stop it entirely.
_Retro_viruses have an additional piece:
- reverse transcriptase: 'git commit; git push'
which will copy the virus code back into your repo. Now the regular 'fetch' will copy the virus code back into your working directory.
Your immune system has processes which run around checking object hashes for threads. There's a whitelist for expected hashes (normal threads). If the same unexpected hash shows up too many times, it goes on a blacklist and you start generating antibodies- special threads that go around searching for a specific object hash and tagging that thread for deletion. If too many bad hashes are found in the same place, a white blood cell nukes the whole site from orbit.
Once you generate enough antibodies, the virus threads start getting killed before they can replicate and you're immune. If the virus mutates, the hash may not match anymore and you'll be vulnerable to the new strain. Things like the common cold and influenza mutate all the time, so you can very them over and over while chicken pox rarely mutates and youth can usually only get it once.
Vaccines are a bunch of copies of (usually inactive) threads with a particular hash, to encourage your immune system to put that hash on the blacklist.
Autoimmune diseases happen when valid code accidentally gets on the blacklist.
Cancer (a fork bomb) tends to evade the immune system because its code was already on the whitelist.
But... this is a super simplified version of how things work, and really only applies to mammals. Do not take these analogies too far.
Biology is fascinating, and definitely worth deeper study.
Many vulgarization sources say that DNA is like the source code of life. But they mostly skim across the issue and go to conclusions like "this gene or set of genes are responsible for that outcome".
But coming from a CS background that sounds a bit like non-sense. I feel like it is like saying that "this processor instruction is responsible for that outcome". But in the end what is important is not the individual instruction but the interaction between them and the environment (Input / Ouput).