Motivation

The production of most commercial DNA probe arrays is based on a parallel chemical synthesis usually driven by a set of photolithographic masks. Due to the natural properties of light and the ever shrinking feature sizes, the arrangement of the probes on the chip (probe location or placement) and the order in which their nucleotides are synthesized (probe embedding) play an important role on the quality of the final product.

Microarray chips can contain more than one million spots, with each spot accommodating several million copies of a probe. Probes are typically 25 nucleotides long and are synthesized in parallel, on the chip, in a series of repetitive steps. Each step appends the same nucleotide to probes of selected regions of the chip. Selection occurs by exposure to light with the help of a photolithographic mask that allows or obstructs the passage of light accordingly. The sequence of nucleotides added at each masking step is called the nucleotide deposition sequence, and it is clearly a supersequence of all probe sequences.

In general, a probe can be embedded within the deposition sequence in several ways. It is convenient to think of an embedding as a binary string where a zero marks a masked step and a 1 signals an unmasked step in the production of probe. An unmasked step is also productive while a masked step is called unproductive.

Layout Evaluation

Due to diffraction of light or internal reflection, untargeted spots can sometimes be accidentally activated in a certain step, producing unpredicted probes. The problem is more likely to occur near the borders between masked and unmasked spots of a certain step. Hannenhalli et al. formulated the Border Length Minimization Problem, which aims at finding an arrangement of the probes (together with their embeddings) that minimizes the number of border conflicts during mask exposure steps.

Kahng et al. noted that the definition of border length did not take into account two simple yet important practical considerations: a) stray light might activate not only immediate neighbors but also probes that lie as far as three cells away from the targeted spot; and b) imperfections produced in the middle of a probe are more harmful than in its extremities. With these observations in mind, we define the conflict index of a spot using a distance-dependent weighting function that accounts for observation a) and position-dependent weights to account for observation b).

Layout Optimization

We can distinguish between two main approaches to the problem of microarray layout design: placement algorithms and post-placement optimizations. The first concerns algorithms that, given a set of probes, a deposition sequence and layout constraints (chip dimensions), finds a placement of the probes minimizing the sum of conflicts (either border length or conflict indices). The second approach takes an initial solution (a placement together with probe embeddings) and tries to reduce the conflicts by re-embedding the probes without changing their placement. In the following sections we will address some of the existing alternatives as well as some ideas for future research.

Placement Algorithms

Several placing algorithms have been suggested but only a few have been found to be useful for the latest chip dimensions (up to 1164×1164 spots and over a million probes). In fact, the problem size and its inherent properties naturally suggest the use of partitioning strategies to minimize running time, especially for those computing-intensive algorithms.

A common feature of partitioning strategies is that the chip is recursively divided into smaller regions. At the same time, the set of probes is divided into smaller sub-sets which are assigned to one of the final regions. In the end, these algorithms they rely on another placement algorithms to actually place the sub-set of probes on their assigned regions.

In the following sub-sections we describe the only partitioning strategy published so far and two novel partitioning strategies.

Centroid-based Quadrisection

The Centroid-based Quadrisection algorithm is a simple recursive procedure proposed by Kahng et al. It starts by randomly selecting a probe, p1, from the probe set. Then it examines all other probes and finds probe p2 with maximum distance to p1. Similarly it finds p3 maximizing the sum of distances to p1 and p2, and p4 maximizing the sum of distances to p1, p2 and p3. These are called centroids.

Then, all other remaining probes are compared to the centroids and assigned to the one that minimizes the Hamming distance between them (which can be computed in linear time). Then the chip is partitioned into four quadrants, each being assigned to a set of probes. Each sub-region can then be recursively partitioned until a maximum partitioning depth is reached.

Two-dimensional Gray-Code-based Partitioning

Pivotal-driven Partitioning

Post-placement Optimization

We now turn to the post-placement optimization problem, which aims at minimizing conflicts by re-embedding probes without changing their location on the chip. A common feature of such strategies is the use of an Optimum Single-Probe Embedding (OSPE) algorithm proposed by Kahng et al. This algorithm is able to compute an optimum embedding of a probe in regards to its neighbors, whose embeddings are considered fixed. The algorithm is based on a simple dynamic programming recursion that takes quadratic time.

Note that the OSPE algorithm can never increase the amount of conflicts and thus, all optimization algorithms can be repeated several times until a local optimal solution is found (or until the improvements drop below a given threshold).

In the next sub-sections we describe three post-placement algorithms described by Kahng et al. and propose a new heuristic.

Greedy and Batched Greedy Optimization

The Greedy algorithm computes, for each spot, the maximum reduction of conflicts that could be achieved by re-embedding its probe with the OSPE algorithm. Then it greedily selects the spot with higher gain and re-embeds its probe optimally in regards to its neighbors, updating the gains of affected spots. The Batched Greedy version of the algorithm sacrifices its greedy nature by postponing the update of gains and re-embedding all unaffected probes of an iteration.

Chessboard Optimization

The Chessboard optimization is based on a simple heuristic. From the point of view of border length, a chip can be bi-colored just like a chessboard, in such a way that the embeddings of probes located on white spots are independent of those placed on black spots, and vice-versa. The idea is to use this coloring to alternate the optimal re-embedding of probes located on black and white spots.

Sequential Optimization

The Sequential optimization is a strikingly simple yet effective post-placement optimization. This algorithm also re-embeds probes with the OSPE but instead of trying to find a particular order, it just proceeds spot by spot from top to bottom, left to right. Surprisingly, it outperforms all other know post-placement algorithms.

Priority-driven Optimization

Microarray Layout and the Quadratic Assignment Problem

We now turn to a different approach to the array layout problem, namely, to model the placement problem as a quadratic assignment (QAP). The quadratic assignment is combinatorial problem that seeks to assign a set of objects to a set of places with a minimum cost. The cost is defined by the product of two functions: the distance between the places and the flow among the objects.

If we think of the spots as places and the probe as objects to be assigned to these places, we can see the layout problem as an instance of the QAP. The goal function is then defined as the distance between the spots multiplied by distance between the probes (the flow in the QAP formulation). This formulation conveniently resembles the definition of conflict index.

The only problem is that QAP is notoriously hard. Problems of size greater than 15 remain nearly intractable. However, heuristic algorithms can find sub-optimal solutions in a reasonable time. In particular an algorithm called GRASP due to Resende et al. is able to compute good solutions for problem of sizes of up to a few hundred.

We have experimentally used GRASP to improve the layout of small regions of the chip with excellent results. We believe that GRASP can be better used to find a good placement of a small region produced, for example, by a partitioning algorithm.

However, we also believe that GRASP can be used as an optimization algorithm if some changes to the implementation are introduced. The idea is that we can examine an already placed region of the chip and, in most of the times, produce a better placement of that region with GRASP. However, as it is used so far, we also observe an increase in conflicts in the border between the optimized region and the probes that lie immediately close to that region. Clearly this results from the fact that, currently, we cannot pass to GRASP the part of the cost of placing a probe in a spots that is due to the probes outside the considered region.

If a modification can be done in the main GRASP routine so that it also takes into account the cost associated with probes outside the region under consideration, we believe that we can substantially reduce conflicts in a microarray chip. We also think that such an approach could be used in a sliding-window manner to optimize the full set of probes of a chip. Moreover, with an overlapping sliding-window, we expect to move probes around and further increase the chances of reducing conflicts.