sequence mapping

mapping sequences

Version 2 / April 26, 2012 / Benjamin Aaron Degenhart
can you find the mistake in Version 1? :)

The question that this algorithm answers is; given the sequence of A1 is coupled with B1 and A2 is coupled with B2, what is B3 going to be if we know A3?
Or differently: What is the result if we send the sequence A3 through a mapping process that has been informed by the coupling of A1 with B1 and the coupling of A2 with B2?

You can enter one, two or more couples. The number of source sequences has to be the same as the number of target sequences.

I thought about what an artificial creativity would need to be able to do. I came to think of creativity as recombining existing elements so that the combination becomes an element in itself. The exponential growth of possible combinations with increasing elements requires better and better ways to choose with branch to run down. Human minds can use metaphors to test branches, by 'mimicing' the structural information from successful combinations in other domains. The metaphor establishes a mapping between a source- and a target level. Changes in either one of them will then be interpreted in the other one. 'Artificial creativity' needs a way to pull in structural information from other domains to test possible combinations in a pool of elements more effectively - a digital metaphor? It has to go 'collect' graph-navigational data from targets in other domains. By making the mapping-process complex enough the source can be mapped upon any branch in any target - of course it is of interest to choose targets first that are likely to seed potent structural information into the source.

This algorithm is restricted to monochronic sequences and requires that the data is already processed in a way where the elements are coming from a pool that scopes the totality of its domain. Things like "pasta" and "noodles" have to be unified/matched.

In this simplified example with phrases are two source-sequences from the category of <daily routines> and two target-sequences from the category of <making food>. The first coupling says "these 4 steps i go through every morning are like these 5 steps i go through for making pasta" and the second coupling says "these 4 steps i go through every evening are like these 5 steps i go through when eating a cheese sandwich (for whatever reason i choose to compare these two levels). Then i go wild in my daily routine and find myself performing a new sequence of 4 steps. I wonder; how would this new chain of events be like if it would be for making food?? Most results will be complete nonsense or impossible to apply to actual food making - but maybe, maybe there's one that pokes me towards a new recipe, a combination that hasn't been tested before.

The result in that example don't really make sense though, when interpreted in the context of food, i admit. I will put up a better phrase-example when i think of one :)

given source sequence(s):

given target sequence(s):

new source sequence:

corresponding target sequence: ?

desired length of new target:

reset

results:

log:

How it works (shown with the prefilled example):

3 couples given:
DFB -> acre
FAD -> brncd
BF -> erfd

task:
AFD -> ?

Using the average of the ratio in the given couples, the lenght of the task-target-sequence can be calculated by the rule of proportion:

((4/3) + (5/3) + (4/2)) /3 = x / 3

-> x = 5

Find LCM (least common multiple) for each coupling and multiply each element with the factor of [own seq.length / couple-lcm] so that both sequences in the coupling have the same lengths.

LCM of 3 & 4 is 12:

D	D	D	D	F	F	F	F	B	B	B	B
a	a	a	c	c	c	r	r	r	e	e	e

LCM of 3 & 5 is 15:

LCM of 2 & 4 is 4:

B	B	F	F
e	r	f	d

Find out if there are elements in the new source sequence that are not present in any given source sequence (not the case in this example) - if so (e.g. "X") add ",X" to the source input and ",#" to the target input; code-internally it will be treated as another coupling whereas we know it's a placeholder when "#" appears in the new target result.

Extract the corresponding target-elements for each element in the new source sequence AFD and rewrite it in encoded form:

A: r n n n c [1:r,3:n,1:c]

F: c c r r / b b b r / f d [2:c,2:r/3:b,1:r/1:f,1:d]

D: a a a c / c c d d d [3:a,1:c/2:c,3:d]

Concatenate encoded target-string:

AFD -> 1:r,3:n,1:c + 2:c,2:r/3:b,2:r/1:f,1:d + 3:a,1:c/2:c,3:d

If stopped at this time, the target-sequence for the task would be: rnccrbrfdaccd, clearly too long compared to the target-length of 5 that is aimed for as calculated above. So it needs to be compressed! That brings up issues; single characters can't be splitted any further - so which is the best selection that is allowed to survive the compression?
It needs to be assured that every character from the target-element-pool get's the correct chance to appear in the final target-sequence after the compression.

Hence each target-character-bundle ("1:r" is the first one, "3:n" the second one, ...,"3:d" the last one) is now going to be multiplied with specific factors that will ensure the correct probability of their compression-survival.

First it needs to be compensated for the fact that the packages (=all bundles from each read-in; "1:r,3:n,1:c" is the first one, "2:c,2:r" the second one, ..., "2:c,3:d" the last one) where read-in from the given couples that where treated with different LCM's. So determine the LCM (with which they'll be equalized) of all packages requires to look at the sums of their containing characters:

5 + 4/5/2 + 4/5

The three different package-sums of 2, 4 and 5 give a LCM of 20. Enforce that on each package in the target-string (divide package-sum by LCM and multiply each bundle with the resulting factor):

(20/5) * (1:r,3:n,1:c) + (20/4) * (2:c,2:r) / (20/5) * (3:b,2:r) / (20/2) * (1:f,1:d) + (20/4) * (3:a,1:c) / (20/5) * (2:c,3:d)

-> 4:r,12:n,4:c + 10:c,10:r/12:b,8:r/10:f,10:d + 15:a,5:c/8:c,12:d

Now it needs to be compensated for the disproportionate weight that "multibooked" source-elements (like the F has read-in packages from 3 different targets) bring in - look at all multibook-numbers:

1 + 3 + 2

The LCM of these is 6. Treat the target-string with that; non-multibooked receive full LCM multiplication whereas the multibooked ones receive LCM / "multibookedness"

(6/1) * (4:r,12:n,4:c) + (6/3) * (10:c,10:r/12:b,8:r/10:f,10:d) + (6/2) * (15:a,5:c/8:c,12:d)

-> 24:r,74:n,24:c + 20:c,20:r/24:b,16:r/20:f,20:d + 45:a,15:c/24:c,36:d

Now it is ensured that each character has given the correct weight. One could go ahead run the compression. However; since equal treatment has been given, it can be looked at nothing but the bundles now. What becomes obvious is that in this case two bundles of c would have to be merged before the compression takes place - which significantly increases the chances of c to survive the compression:

24:r 74:n 24:c 20:c 20:r 24:b 16:r 20:f 20:d 45:a 15:c 24:c 36:d

-> 24:r 74:n 44:c 20:r 24:b 16:r 20:f 20:d 45:a 39:c 36:d

As the merging depends on which bundles lie next to each other, it becomes obvious that the constellation of the packages in the multibooked source-elements matters fundamentally. Also the overall arrangements of the packages is crucial because the order before the compression is the same after the compression, even so with fewer elements. So, no other way than running through all possible arrangements (=permutations, n!) of the multibooked source-elements.

To do that i reduce the packages in the uncompressed target-string to single characters; a code-internal mapping:

(A) 24:r,72:n,24:c
(B) 20:c,20:r (C) 24:b,16:r (D) 20:f,20:d
(E) 45:a,15:c (F) 24:c,36:d

-> A+BCD+EF is what the encoded target-string looks now encoded :)

Build a matrix with the packages horizontally and all their possible arrangements (my code to create all permutations is interesting (or funny) too, but won't be explained here) vertically:

1!	3!	2!
A	DCB	FE
	CDB	EF
	CBD
	DBC
	BDC
	BCD

To build a tree from that matrix i run through all 1! * 3! * 2! = 12 paths between the first [1 1 1] (=A+DCB+FE) and the last [1 6 2] (=A+BCD+EF) and collect the stored elements on the way (my code for that is interesting (or funny) too but won't be explained here).

1	1 1 1	A	DCB	FE
2	1 2 1	A	CDB	FE
3	1 3 1	A	CBD	FE
4	1 4 1	A	DBC	FE
5	1 5 1	A	BDC	FE
6	1 6 1	A	BCD	FE
7	1 1 2	A	DCB	EF
8	1 2 2	A	CDB	EF
9	1 3 2	A	CBD	EF
10	1 4 2	A	DBC	EF
11	1 5 2	A	BDC	EF
12	1 6 2	A	BCD	EF

Now decode these double-encoded strings back into the encoded target-string, leave only the bundles and merge those that lay next to each other (important!). At this point in the previous version of this algorithm i was now running in each branch through ALL possible ways a subgroup of 5 can be formed. With n being 12 and k being 5 that gives [n! / ((n-k)! * k!)] = 10824 subgroups. Then i'd check for bundle-sum in each and give out the highest one(s). Here's an implementation that achieves the same, but less processing-costly. Then sort the bundle-numbers and look at the top 5 (that's the target-length after compression). Choose the branches with the highest top-5-bundle-number-sums):

		5 top bundle-#	sum of top 5
1	24:r 72:n 24:c 20:f 20:d 24:b 16:r 20:c 20:r 24:c 36:d 45:a 15:c	72 45 36 24 24	201
2	24:r 72:n 24:c 24:b 16:r 20:f 20:d 20:c 20:r 24:c 36:d 45:a 15:c	72 45 36 24 24	201
3	24:r 72:n 24:c 24:b 16:r 20:c 20:r 20:f 20:d 24:c 36:d 45:a 15:c	72 45 36 24 24	201
4	24:r 72:n 24:c 20:f 20:d 20:c 20:r 24:b 16:r 24:c 36:d 45:a 15:c	72 45 36 24 24	201
5	24:r 72:n 44:c 20:r 20:f 20:d 24:b 16:r 24:c 36:d 45:a 15:c	72 45 44 36 24	221
6	24:r 72:n 44:c 20:r 24:b 16:r 20:f 20:d 24:c 36:d 45:a 15:c	72 45 44 36 24	221
7	24:r 72:n 24:c 20:f 20:d 24:b 16:r 20:c 20:r 45:a 39:c 36:d	72 45 39 36 24	216
8	24:r 72:n 24:c 24:b 16:r 20:f 20:d 20:c 20:r 45:a 39:c 36:d	72 45 39 36 24	216
9	24:r 72:n 24:c 24:b 16:r 20:c 20:r 20:f 20:d 45:a 39:c 36:d	72 45 39 36 24	216
10	24:r 72:n 24:c 20:f 20:d 20:c 20:r 24:b 16:r 45:a 39:c 36:d	72 45 39 36 24	216
11	24:r 72:n 44:c 20:r 20:f 20:d 24:b 16:r 45:a 39:c 36:d	72 45 44 39 36	236
12	24:r 72:n 44:c 20:r 24:b 16:r 20:f 20:d 45:a 39:c 36:d	72 45 44 39 36	236

In this case the fifth top-bundle-number <36> occurs only once, hence the compression simply means to cut out everything but the top-bundles. In this case both branches give the same result after compression: 72:n 44:c 45:a 39:c 36:d
, which gives ncacd as final result. This solution says: Sending the sequence <AFD> through a mapping process that has been informed by the coupling of <DFB> with <acre>, the coupling of <FAD> with <brncd> and the coupling of <BF> with <erfd>, gives the resulting sequence of <ncacd>.

- - - - - -

It can also happen though that the fifth-top-bundle number occurs more than once; to show that possiblity let's pretend only the first branch [24:r 72:n 24:c 20:f 20:d 24:b 16:r 20:c 20:r 24:c 36:d 45:a 15:c] would have the highest sum.
If that happens all equally valid compressions have to be done. As many results as there are different subgroups of 2 in 4 (=6) have to be given out (the code i made for that is interesting but won't be explained here; can be understood from reading the log though with different inputs than the prefilled ones):

24:r

72:n

24:c

20:f

20:d

24:b

16:r

20:c

20:r

24:c

36:d

45:a

15:c

24:r

72:n

24:c

36:d

45:a

r n c d a

24:r

72:n

24:b

36:d

45:a

r n b d a

24:r

72:n

24:c

36:d

45:a

r n c d a

72:n

24:c

24:b

36:d

45:a

n c b d a

72:n

24:c

36:d

45:a

n c c d a

72:n

24:b

24:c

36:d

45:a

n b c d a

Improvements / alternative approaches:

In the implementation above it could be added that multibooked elements deserve more space and therefore it shouldn't be compensated. Also in the compression-process ranks for solutions that contain target-character-duplicates could be weakened or totally surpressed. I think the permutation-process of the multibooked arrangements could be avoided if mergers would be found and favoured before all other possible arrangements have to be checked. In this case for instance...
24:r,74:n,24:c + 20:c,20:r/24:b,16:r/20:f,20:d + 45:a,15:c/24:c,36:d
...you can detect before the permutation that there will only be one arrangement (out of 24 possible ones) that gives the opportunity to merge characters and therefore be ranked higher. As you can see in the how to above the merged 44:c indeed made the race.

I am sure there different and probably more effective (less line/loop-costly) approaches to a sequence-mapping algorithm.

Scanning first for meta-patterns (like reverse, mirror) can surely save lots of processing power. In a situation for instance like [given: ABC -> defg, task: CBA -> ?], it's clearly quicker to recognize that the source-sequence was reversed and therefore the target also can just be reversed.

Also i can think of an approach that looks at the relative position of each element according to the sequence as a whole. So in the situation [given: ABCD -> efghij, task: BAC -> ?], the 'movement' that happened on source-level from given to task could be encoded as [1/4 -> 2/3, 2/4 -> 1/3, 3/4 -> 3/3, 4/4 -> _ ] (A moved from 1st among 3 to 2nd among 4, B moved from...) and mapped upon the movement that needs to happen accordingly on target-level.

Distances between all subgroups could be another approach - probably the most costly but the most precise?