Shared Clustering takes your long, undifferentiated list of DNA matches and divides it into smaller clusters — groups of matches who are likely related to each other. Knowing which matches belong together is often a huge help in working out how any one of them is related to you.
The clustering engine is based on the open-source Shared Clustering algorithm created by Jonathan Brecher. DNAGedcom integrates that algorithm directly into the Client, with several additions of its own:
Shared Clustering operates on data already in your database. For clustering directly from exported CSV files, use the Collins-Leeds Method instead.
This page covers both how to run Shared Clustering in the Client and how to interpret the results. The interpretation material draws heavily on Jonathan Brecher's excellent Shared Clustering wiki, which is well worth reading in full.
By definition, a cluster is a group of DNA matches who mostly all match each other. That is the only part that is always true. Everything beyond that is interpretation — and the interpretation depends on which matches are in the cluster.
In a perfect cluster, every member appears in the shared-match list of every other member. Real clusters are rarely perfect: biology is messy, and the testing companies only report so much. The idea behind clustering is similarity, not perfection.
Most of the time, a cluster represents a group of people who share some DNA segment among them. That makes sense — clusters are built from shared-match lists, and those lists come from the underlying DNA data. But whether a cluster points to a common ancestor or just a shared segment depends on how close the matches are.
This distinction is the single most important idea for interpreting your results:
Clustering does not answer genealogical questions on its own. What it does is turn an impossible amount of data into something a human can actually work with. Many people have tens of thousands of matches; nobody can review those by hand. A cluster of a few dozen matches is something anyone can look at.
The point is not a pretty picture — it is focused leads. You can't predict where the next breakthrough comes from. The last member of a cluster might be the one who inherited a family bible from the 1700s. Several members might have public trees with the same unusual surname, or the same surname spelled three different ways — obvious only when you see them side by side. Clustering finds the links that are worth investigating; proving them is the research that follows.
Shared Clustering — like all shared-match clustering — works by looking for similarities among the shared-match lists of your matches. If Alice and Bobby both have Nancy, Oscar, and Patty on their shared-match lists, and Charlie has a completely different set, the algorithm puts Alice and Bobby together and Charlie somewhere else.
It uses only three pieces of information, none of it private or identifiable: the ID of each match, the shared centimorgans of each match, and each match's list of shared-match IDs. The process runs in three stages:
An important point: clustering is a side-effect of similarity. The algorithm never decides "this is a cluster" directly — it just keeps merging the most similar matches, and clusters fall out naturally. That is why it produces larger, more useful clusters than network-graph ("clique-finding") approaches, and why it can still find useful structure in the presence of endogamy without having to exclude your closest, most informative matches.
DNAGedcom's heat map shows the correlation between each pair of matches as the maximum of the two directional scores, so the rendered map is symmetric — the cell at row A / column B looks the same as the cell at row B / column A.
Before running Shared Clustering, gather data for the kit using the Gather tools in the Client.
| Data | Required? | Purpose |
|---|---|---|
| Matches | Required | The list of DNA matches to cluster. |
| ICW (In Common With) | Required | The shared-match relationships that drive the entire algorithm. A match with no ICW data cannot be clustered. |
Shared Clustering uses no chromosome-segment data and no tree or ancestor data for the clustering itself. (Ancestor data, when present, is summarized in the output's Shared Ancestors table, but it does not affect which matches land in which cluster.)
| Setting | Default | Purpose |
|---|---|---|
| Kit Filter | — | Type-ahead filter to narrow the DNA Kit dropdown. |
| DNA Kit | — | The kit to cluster. |
| cM Range (From / To) | 20 – 400 | Includes matches whose total shared cM falls in this range. |
| Show Heat Map | On | Include the heat map in the HTML output. |
| Open HTML When Done | On | Open the HTML output automatically when clustering finishes. |
| Max Cluster Size | 0 (disabled) | Dissolve any cluster larger than this. Must be at least 3 if enabled. |
| Min cM to Cluster | 0 (disabled) | Two-phase extension pipeline. Must be less than the cM Range minimum if enabled. |
| Min Cluster Size | 3 | Dissolve any cluster with fewer than this many members. |
| Max Heatmap Size | 1000 | Skip the heat map if the match count exceeds this. 0 = unlimited. |
| Save Excel | Off | Also generate an Excel workbook alongside the HTML. |
| Open Excel When Done | Off | Open the Excel workbook automatically when clustering finishes. |
Default: 20 to 400 cM.
The clustering algorithm loves data — the more matches you feed it, the better the clusters it can find. Brecher's guidance is emphatic on this point: do not reflexively narrow the cM range. A wide range gives the algorithm the most to work with.
A good habit: run a wide range first to see the full picture, then re-run focused on a subset if a particular region needs a closer look.
Default: On. The heat map is the visual correlation matrix — the part most people think of as "the cluster diagram." Turn it off for faster output on very large kits, or when you only need the cluster table.
Default: 0 (disabled). When set, any cluster larger than this value is dissolved into singletons. Must be at least 3 if enabled. This is primarily an endogamy aid: in endogamous populations a single runaway cluster can swallow a huge share of the matches, and dissolving it lets the smaller, more meaningful structure show through.
Default: 0 (disabled). When enabled, it runs a two-phase pipeline. It must be lower than the cM Range minimum.
Example: a cM Range of 20–400 with Min cM to Cluster set to 10 clusters the 20+ cM matches first, then attaches the 10–20 cM matches to existing clusters. Because the testing companies don't report shared matches below 20 cM, those weaker matches can't form clean square blocks — they get added above and below the clusters they belong to instead. The payoff is large: a full extension pass often roughly triples the number of matches assigned to clusters, surfacing matches you have almost certainly never looked at.
Default: 3. Clusters with fewer members are dissolved. Raise it to suppress small, low-confidence clusters; note that small clusters can sometimes be the interesting surprises (see Caveats), so don't raise it too aggressively.
Default: 1000. If the match count exceeds this value, the heat map is skipped (the cluster table is still produced) because rendering a very large matrix in a browser is slow. Set to 0 for unlimited — not recommended for large kits.
Default: Off. When enabled, an Excel workbook is written alongside the HTML output, with the cluster assignments and the correlation matrix. Useful for annotating, sorting, and sharing results outside the browser.
To stop a run in progress, click Cancel. Use Open Output Folder to jump to the folder where the HTML and Excel files are saved (your database folder).
Shared Clustering is not available in Demo mode — it needs real gathered match and ICW data.
The HTML file has up to four sections, linked from the navigation bar at the top: Clusters, Heat Map, Ancestors, and Legend.
Every cluster is listed with a colored header row you can click to collapse or expand. Within each cluster, every match is shown as a row with these columns:
| Column | Meaning |
|---|---|
| Color swatch | The cluster's color. The palette cycles through 10 colors, so colors repeat for clusters 11+ — the color identifies the cluster on this diagram only; it has no meaning across runs. |
| Name | The match's name. |
| cM | Total shared centimorgans with the test taker. |
| Segments | Number of shared DNA segments, when the service provides it. This column is one of the most useful for interpretation — see Interpreting Clusters. |
| Starred | A star (★) if you starred the match on the testing service. |
| Hint | Shows “Hint” when the service reports a Common Ancestor hint for the match. |
| Correlated | The numbers of any other clusters this match also overlaps with, color-coded. This is how you spot relationships between clusters without reading the heat map. |
| Tree Type | None, Unlinked, Private, or Public — tells you at a glance whether a match has a tree worth checking. |
Matches that have at least one ICW relationship but did not land in any cluster are collected at the end under an Unclustered group with a gray header.
The heat map is an N×N grid where every row and every column is one DNA match, ordered so that clustered matches sit next to each other. A colored strip along the top and left edges shows each match's cluster color, and a set of checkboxes above the map lets you filter the display down to specific clusters (plus an option to show the unclustered matches).
| Color | Meaning |
|---|---|
| Blue (diagonal) | A match against itself. The diagonal is just a visual reference line. |
| Red (deeper = stronger) | Direct ICW. The column match appears in the shared-match list between you and the row match — the two very likely share a DNA segment, and are likely on the same line of descent. Deeper red means a stronger correlation. |
| Gray (darker = stronger) | Indirect correlation. The two matches are not direct shared matches, but both appear in the shared-match list of some third person. They might be on the same line of descent, or on two lines joined by a marriage further back in your tree. |
| White | No correlation between the two matches. |
Borders mark the cluster boundaries. The exact numeric value in any one cell rarely matters — what matters is the pattern of red, gray, and white, and especially the way clusters overlap.
When the gathered data includes ancestor or tree information, the output ends with an All Shared Ancestors table: a roll-up of ancestors that appear across your matches, with the surname, given name, years, the clusters they show up in, and how many matches reference them. It is a fast way to spot a surname or couple that ties several clusters together. Remember that this table is built after clustering and has no effect on which matches landed where — it is a research aid, not part of the algorithm.
The output includes its own legend covering the heat-map colors and the meaning of the Starred, Hint, Correlated, and Tree Type columns — handy when you come back to a saved file later.
When Save Excel is enabled, the workbook contains:
Note that Excel's color rendering uses conditional formatting that some other spreadsheet programs handle poorly — the workbook is most reliable opened in Microsoft Excel itself.
A lot of the value is in the clusters themselves — but you can learn even more from the relationships between clusters. The patterns below are the ones worth learning to recognize.
A red cell means two people who very likely share a DNA segment — likely on the same line of descent. A gray cell means two people who don't share DNA directly but both share with some third person — they might be on the same line, or on two lines joined by a marriage somewhere back in the tree. That single distinction drives most of the interpretation that follows.
Red overlap — a red block where two clusters meet — means matches that share more than one segment with you. This is some of the most valuable output of the whole analysis.
Gray overlap is harder to read because "indirectly related" can mean many things. Small gray speckles probably mean little. But a large gray rectangle shared between clusters can indicate co-descent the same way red overlap does — a set of segments that travelled down the generations together, with different descendants inheriting different combinations. When you see a big, complex gray area tying several clusters together, you can be confident they all come from the same general part of your tree; the work is then to identify even one match in it to anchor the rest.
Most of the action is on the diagonal, but sometimes a dark block appears off of it. Follow that block horizontally and vertically and it will line up with two clusters that are on the diagonal — it is telling you those two clusters overlap even though they aren't neighbors. This happens when one cluster overlaps three or more others: in a flat two-dimensional diagram, a cluster only has two neighbors, so the extra overlapping clusters have to be shown off-diagonal. A cluster that overlaps three others which don't overlap each other is a strong hint — for example, one pair of great-grandparents (the central cluster) feeding three of the four great-great-grandparent lines.
An isolated cluster has no meaningful red or gray running off of it — just white.
Very close relatives — roughly 600 cM and up — show as dark stripes running the full height (and width) of the diagram rather than as tidy clusters. They aren't very helpful for interpreting the clusters: children and grandchildren contribute nothing, a parent or first cousin matches half of everything, a grandparent or second cousin matches about a quarter. The stripes are interesting to look at — they show how segments were passed down — but if a close relative is cluttering the diagram, raising the cM Range minimum or the cM upper bound will tuck them away.
Endogamy — generations of intermarriage within a population — is hard on every clustering tool. Instead of clean red squares on a white background, an endogamous kit produces large areas of near-solid gray flecked with red "speckles." Endogamous testers also see far more matches and far more shared matches per match, which is a reporting artifact, not extra relatives.
You can still extract leads:
The clustering algorithm is based on the open-source Shared Clustering project by Jonathan Brecher, used under its open-source license. The interpretation guidance on this page is adapted from the project's wiki.