This program was written for the "Genetische Algorithmen" ("Genetic Algorithms") lab of the Hochschule Darmstadt University of Applied Sciences (H_DA).
Jan Parisek (734059)
2022-01-31
Reference specs: i7-4790K, DDR3-1600 memory
Usually above 30. Around 7-8 seconds.
The following candidate was achieved within 7.6 seconds.
Rendering of the best candidate.
All candidates were achieved with the following settings:
- 2000 generations
- 1000 candidates
- Sequence
SEQ60
- 1.25 mutation base rate
- 125 Crossover operations per generation
- Tournament selection
- 10 candidates per tournament
- Intra-generational genetic diversity dynamic mutation modifier
The following candidate was achieved within 15 seconds.
Rendering of the best candidate.
- The
Settings
class can be used to make adjustments to the algorithm. - The
Examples
class can be used to add custom sequences. - Results are put into the
./docs/
directory.
The code itself may be changed at your own discretion.
- The protein sequence specified in
Settings
gets parsed fromExamples
into aSequence
instance. - 100
Protein
s are created and put into aPopulation
, representing a generation. Each one with a random genotype aka folding sequence. - A
Protein
consists of multipleAminoacid
s. EachAminoacid
contains its own value (hydrophobic or hydrophilic) and a cached position. The latter was done for performance optimization. - The current population has several of its candidates selected and even duplicated. After that random mutations and crossovers may happen.
- The current generation's data gets logged with the
Logger
class. - Finally, the best candidate is rendered by the
Renderer
class.
- All
Protein
s of one generation are put into List consisting of pairs. - Each pair consists of a fitness value and the
Protein
object. - The total fitness of all
Protein
s is added together and used to normalize the fitness of eachProtein
so they add up to a value of 1.0. - The List of pairs gets populated. The fitness value of each pair consists of the previous pair's fitness value plus the attached
Protein
s fitness value. This represents a chain with increasing fitness values. - A random double value between 0 and 1 is chosen.
- A number of
Protein
s is selected at random from the old generation. - The best
Protein
from that selection is added to the next generation.
These algorithms dynamically adjust the mutation rate by acting as a multiplier.
- The first gene of every
Protein
is counted and stored in a map. - The most dominant gene is looked at.
- The count of the most dominant gene is mapped from between its theoretical minimum and maximum to a value between 0 and 1. The more the first genes of all
Protein
s differ, the lower the value. The more the first genes of allProtein
s are similar, the higher the value. This value represents the diversity of the first gene in a generation. - Repeat steps 1-3 for the remaining genes. Add the diversity of all genes into a total diversity.
Oddly enough, this value always seems to not vary as much as expected.
- The fitness values of the most recent generations are stored in a queue.
- The fitness of the past generations gets averaged.
- The deviation from the average (mean) fitness gets calculated for each candidate.
- The deviations of all candidates are added together into a total variance. The higher the value, the higher the variance. The lower the value, the lower the variance.
- Invert total variance.
Simple sinusoidal mutation rate
Great care was put into optimizing the run time of this program.
Many Proteins
with a potentially bad score can be avoided by not allowing amino acids to be placed on the previous coordinate. This is done by encoding proteins with the values LEFT
, RIGHT
and FORWARD
(3 local directions) rather than global (4) directions.
Aminoacid
s store their own position. The position is calculated before evaluating the fitness. This means the position does not need to be re-calculated when comparing Aminoacid
.
- One
Aminoacid
is always skipped. This means that theAminoacid
currently looked at is never compared to its direct successor. This can be done without error, since we don't run the risk of overlapping with local directions. This improves performance from O(n^2-n/2) to O(n^2-(n-1)). - Neighbors / overlapping is determined by calculating the "Manhattan distance" between
Aminoacid
s. If the distance is above 2, it means we can skip a couple checks because the nextAminoacid
s might be too far away for comparisons to matter. This check was placed first, sinceAminoacids
are more likely to be further apart than next to each other or overlapping. If they're far apart, no further evaluations need to happen.
All classes were constructed to be as lightweight as possible. The less there is to copy, the faster everything performs.
The fitness evaluation of Protein
s within one generation is handled concurrently by multiple threads.
This part of the code needs improvement, since it's a bit messy.
Yeah, it should be readable.