HelpAutosomal Tools › Shared Clustering

Shared Clustering

Overview

Shared Clustering takes your long, undifferentiated list of DNA matches and divides it into smaller clusters — groups of matches who are likely related to each other. Knowing which matches belong together is often a huge help in working out how any one of them is related to you.

The clustering engine is based on the open-source Shared Clustering algorithm created by Jonathan Brecher. DNAGedcom integrates that algorithm directly into the Client, with several additions of its own:

  • Works on your gathered database — no separate download step. Cluster any kit you have already gathered.
  • All supported services — A*, FTDNA, 23andMe, MyHeritage, and GEDmatch, not just one vendor.
  • Interactive HTML output — a sortable cluster table, a filterable heat map, and a shared-ancestors table, generated as a single HTML file. An Excel workbook is optional.
  • Endogamy and tuning controls — Max Cluster Size, Min Cluster Size, and a two-phase Min cM to Cluster extension pipeline.

Shared Clustering operates on data already in your database. For clustering directly from exported CSV files, use the Collins-Leeds Method instead.

This page covers both how to run Shared Clustering in the Client and how to interpret the results. The interpretation material draws heavily on Jonathan Brecher's excellent Shared Clustering wiki, which is well worth reading in full.

What Is a Cluster?

By definition, a cluster is a group of DNA matches who mostly all match each other. That is the only part that is always true. Everything beyond that is interpretation — and the interpretation depends on which matches are in the cluster.

In a perfect cluster, every member appears in the shared-match list of every other member. Real clusters are rarely perfect: biology is messy, and the testing companies only report so much. The idea behind clustering is similarity, not perfection.

Most of the time, a cluster represents a group of people who share some DNA segment among them. That makes sense — clusters are built from shared-match lists, and those lists come from the underlying DNA data. But whether a cluster points to a common ancestor or just a shared segment depends on how close the matches are.

Close, distant, and intermediate matches

This distinction is the single most important idea for interpreting your results:

  • Clusters of close matches (roughly 50 cM and up) tend to represent a group of people all related to you through a single ancestor. Close matches share many segments with you and with each other, so they have lots of opportunities to show up on each other's match lists. A cluster of matches over 50 cM might all descend from one of your great-grandparents; a cluster of matches over 90 cM might all descend from one of your grandparents. These clusters are the most useful for adoptees and anyone working on close family.
  • Clusters of distant matches (around 20 cM) tend to represent a single DNA segment rather than a single ancestor. Distant matches usually share just one segment with you, and the people who happen to share that exact segment automatically match each other. A 20 cM cluster can contain cousins of wildly different degrees — fourth through eighth-or-more — all strung along the path that one segment took down the generations.
  • Intermediate matches are the transition between the two. The physical difference is the number of segments shared, not the centimorgan count. The 50 cM default is just a guess that works well for most people — some need 60–70 cM for clean ancestor-based clusters, others get them all the way down to 35–40 cM. It depends entirely on which of your relatives have tested.

Why Cluster?

Clustering does not answer genealogical questions on its own. What it does is turn an impossible amount of data into something a human can actually work with. Many people have tens of thousands of matches; nobody can review those by hand. A cluster of a few dozen matches is something anyone can look at.

The point is not a pretty picture — it is focused leads. You can't predict where the next breakthrough comes from. The last member of a cluster might be the one who inherited a family bible from the 1700s. Several members might have public trees with the same unusual surname, or the same surname spelled three different ways — obvious only when you see them side by side. Clustering finds the links that are worth investigating; proving them is the research that follows.

How the Algorithm Works

Shared Clustering — like all shared-match clustering — works by looking for similarities among the shared-match lists of your matches. If Alice and Bobby both have Nancy, Oscar, and Patty on their shared-match lists, and Charlie has a completely different set, the algorithm puts Alice and Bobby together and Charlie somewhere else.

It uses only three pieces of information, none of it private or identifiable: the ID of each match, the shared centimorgans of each match, and each match's list of shared-match IDs. The process runs in three stages:

  1. Building the correlation matrix. Each match is scored against every other match by the frequency with which the two appear together in the same shared-match lists, on a scale of 0 to 1. If two people appear in a shared-match list together as a direct shared match, that pair gets an extra +1, producing values in the 1–2 range. Values of 0–1 are the gray range; values of 1–2 are the red range.
  2. Measuring similarity. The algorithm compares each match's correlation vector against the others using a modified Euclidean distance, tuned to build clusters that are as large as possible without sacrificing accuracy.
  3. Hierarchical agglomerative clustering. The two most similar matches are merged, then the next two, and so on, building a tree from the bottom up. Dense blocks that form along the diagonal of the sorted matrix become the primary clusters.

An important point: clustering is a side-effect of similarity. The algorithm never decides "this is a cluster" directly — it just keeps merging the most similar matches, and clusters fall out naturally. That is why it produces larger, more useful clusters than network-graph ("clique-finding") approaches, and why it can still find useful structure in the presence of endogamy without having to exclude your closest, most informative matches.

DNAGedcom's heat map shows the correlation between each pair of matches as the maximum of the two directional scores, so the rendered map is symmetric — the cell at row A / column B looks the same as the cell at row B / column A.

Prerequisites

Before running Shared Clustering, gather data for the kit using the Gather tools in the Client.

Data Required? Purpose
MatchesRequiredThe list of DNA matches to cluster.
ICW (In Common With)RequiredThe shared-match relationships that drive the entire algorithm. A match with no ICW data cannot be clustered.

Shared Clustering uses no chromosome-segment data and no tree or ancestor data for the clustering itself. (Ancestor data, when present, is summarized in the output's Shared Ancestors table, but it does not affect which matches land in which cluster.)

Settings

Setting Default Purpose
Kit FilterType-ahead filter to narrow the DNA Kit dropdown.
DNA KitThe kit to cluster.
cM Range (From / To)20 – 400Includes matches whose total shared cM falls in this range.
Show Heat MapOnInclude the heat map in the HTML output.
Open HTML When DoneOnOpen the HTML output automatically when clustering finishes.
Max Cluster Size0 (disabled)Dissolve any cluster larger than this. Must be at least 3 if enabled.
Min cM to Cluster0 (disabled)Two-phase extension pipeline. Must be less than the cM Range minimum if enabled.
Min Cluster Size3Dissolve any cluster with fewer than this many members.
Max Heatmap Size1000Skip the heat map if the match count exceeds this. 0 = unlimited.
Save ExcelOffAlso generate an Excel workbook alongside the HTML.
Open Excel When DoneOffOpen the Excel workbook automatically when clustering finishes.

Setting Details

cM Range

Default: 20 to 400 cM.

The clustering algorithm loves data — the more matches you feed it, the better the clusters it can find. Brecher's guidance is emphatic on this point: do not reflexively narrow the cM range. A wide range gives the algorithm the most to work with.

  • 20 cM lower bound captures everything the testing companies show in shared-match lists. This is the broadest practical starting point.
  • 50 cM lower bound focuses on closer, higher-confidence matches and tends to produce ancestor-based clusters that are easier to interpret — often four regions corresponding to your four grandparents.
  • Below 20 cM can still be useful, but the companies don't report shared matches that weak, so those matches can't form the neat square clusters — see Min cM to Cluster below.
  • The 400 cM upper bound is mostly cosmetic. Very close relatives produce long stripes rather than tidy clusters (see Close Relatives); excluding them tidies the diagram but they do no harm if left in.

A good habit: run a wide range first to see the full picture, then re-run focused on a subset if a particular region needs a closer look.

Show Heat Map

Default: On. The heat map is the visual correlation matrix — the part most people think of as "the cluster diagram." Turn it off for faster output on very large kits, or when you only need the cluster table.

Max Cluster Size

Default: 0 (disabled). When set, any cluster larger than this value is dissolved into singletons. Must be at least 3 if enabled. This is primarily an endogamy aid: in endogamous populations a single runaway cluster can swallow a huge share of the matches, and dissolving it lets the smaller, more meaningful structure show through.

Min cM to Cluster

Default: 0 (disabled). When enabled, it runs a two-phase pipeline. It must be lower than the cM Range minimum.

  • Phase 1 — build clusters using the matches inside the cM Range.
  • Phase 2 — take the weaker matches (between Min cM to Cluster and the cM Range minimum) and place each one into the nearest existing cluster.

Example: a cM Range of 20–400 with Min cM to Cluster set to 10 clusters the 20+ cM matches first, then attaches the 10–20 cM matches to existing clusters. Because the testing companies don't report shared matches below 20 cM, those weaker matches can't form clean square blocks — they get added above and below the clusters they belong to instead. The payoff is large: a full extension pass often roughly triples the number of matches assigned to clusters, surfacing matches you have almost certainly never looked at.

Min Cluster Size

Default: 3. Clusters with fewer members are dissolved. Raise it to suppress small, low-confidence clusters; note that small clusters can sometimes be the interesting surprises (see Caveats), so don't raise it too aggressively.

Max Heatmap Size

Default: 1000. If the match count exceeds this value, the heat map is skipped (the cluster table is still produced) because rendering a very large matrix in a browser is slow. Set to 0 for unlimited — not recommended for large kits.

Save Excel / Open Excel When Done

Default: Off. When enabled, an Excel workbook is written alongside the HTML output, with the cluster assignments and the correlation matrix. Useful for annotating, sorting, and sharing results outside the browser.

Running Shared Clustering

  1. Open Autosomal > Shared Clustering.
  2. Select your DNA Kit from the dropdown (use the Kit Filter box to narrow a long list).
  3. Adjust the cM Range and any advanced settings as needed.
  4. Click Run Clustering.
  5. Status messages report progress through loading match data, loading ICW data, the clustering stages, and output generation.
  6. The HTML output opens automatically when complete (if Open HTML When Done is on).

To stop a run in progress, click Cancel. Use Open Output Folder to jump to the folder where the HTML and Excel files are saved (your database folder).

Shared Clustering is not available in Demo mode — it needs real gathered match and ICW data.

Reading the HTML Output

The HTML file has up to four sections, linked from the navigation bar at the top: Clusters, Heat Map, Ancestors, and Legend.

The Clusters table

Every cluster is listed with a colored header row you can click to collapse or expand. Within each cluster, every match is shown as a row with these columns:

ColumnMeaning
Color swatchThe cluster's color. The palette cycles through 10 colors, so colors repeat for clusters 11+ — the color identifies the cluster on this diagram only; it has no meaning across runs.
NameThe match's name.
cMTotal shared centimorgans with the test taker.
SegmentsNumber of shared DNA segments, when the service provides it. This column is one of the most useful for interpretation — see Interpreting Clusters.
StarredA star (★) if you starred the match on the testing service.
HintShows “Hint” when the service reports a Common Ancestor hint for the match.
CorrelatedThe numbers of any other clusters this match also overlaps with, color-coded. This is how you spot relationships between clusters without reading the heat map.
Tree TypeNone, Unlinked, Private, or Public — tells you at a glance whether a match has a tree worth checking.

Matches that have at least one ICW relationship but did not land in any cluster are collected at the end under an Unclustered group with a gray header.

The Heat Map

The heat map is an N×N grid where every row and every column is one DNA match, ordered so that clustered matches sit next to each other. A colored strip along the top and left edges shows each match's cluster color, and a set of checkboxes above the map lets you filter the display down to specific clusters (plus an option to show the unclustered matches).

ColorMeaning
Blue (diagonal)A match against itself. The diagonal is just a visual reference line.
Red (deeper = stronger)Direct ICW. The column match appears in the shared-match list between you and the row match — the two very likely share a DNA segment, and are likely on the same line of descent. Deeper red means a stronger correlation.
Gray (darker = stronger)Indirect correlation. The two matches are not direct shared matches, but both appear in the shared-match list of some third person. They might be on the same line of descent, or on two lines joined by a marriage further back in your tree.
WhiteNo correlation between the two matches.

Borders mark the cluster boundaries. The exact numeric value in any one cell rarely matters — what matters is the pattern of red, gray, and white, and especially the way clusters overlap.

The Shared Ancestors table

When the gathered data includes ancestor or tree information, the output ends with an All Shared Ancestors table: a roll-up of ancestors that appear across your matches, with the surname, given name, years, the clusters they show up in, and how many matches reference them. It is a fast way to spot a surname or couple that ties several clusters together. Remember that this table is built after clustering and has no effect on which matches landed where — it is a research aid, not part of the algorithm.

The Legend

The output includes its own legend covering the heat-map colors and the meaning of the Starred, Hint, Correlated, and Tree Type columns — handy when you come back to a saved file later.

Reading the Excel Output

When Save Excel is enabled, the workbook contains:

  • Clusters sheet — the color-coded match list grouped by cluster, the same data as the HTML cluster table.
  • Matrix sheet — the full correlation matrix rendered in color-scaled cells. Generated only when the match count is within the Max Heatmap Size limit.

Note that Excel's color rendering uses conditional formatting that some other spreadsheet programs handle poorly — the workbook is most reliable opened in Microsoft Excel itself.

Interpreting Clusters

A lot of the value is in the clusters themselves — but you can learn even more from the relationships between clusters. The patterns below are the ones worth learning to recognize.

What the cell colors tell you

A red cell means two people who very likely share a DNA segment — likely on the same line of descent. A gray cell means two people who don't share DNA directly but both share with some third person — they might be on the same line, or on two lines joined by a marriage somewhere back in the tree. That single distinction drives most of the interpretation that follows.

Red overlap between clusters

Red overlap — a red block where two clusters meet — means matches that share more than one segment with you. This is some of the most valuable output of the whole analysis.

  • Small overlap is golden. The match (or few matches) in a small overlap usually has a much higher total cM than its neighbors, because it shares two segments instead of one. A 160 cM match sitting in the overlap between two 20–25 cM clusters is often a match you can already identify — and if you can, you've just learned that the two segments those clusters represent coexisted in one ancestral couple. That is a powerful tool for breaking through brick walls.
  • Large overlap is less exciting. It usually means one longer segment that frittered away at one end or the other over a long line of descent; the members are likely related through a single line, but possibly spanning third through eighth-or-more cousins. Check the Segments column — if everyone, including the overlap, shares just one segment, you are almost certainly looking at one segment shortened in different ways.

Gray overlap between clusters

Gray overlap is harder to read because "indirectly related" can mean many things. Small gray speckles probably mean little. But a large gray rectangle shared between clusters can indicate co-descent the same way red overlap does — a set of segments that travelled down the generations together, with different descendants inheriting different combinations. When you see a big, complex gray area tying several clusters together, you can be confident they all come from the same general part of your tree; the work is then to identify even one match in it to anchor the rest.

Dark areas off the diagonal

Most of the action is on the diagonal, but sometimes a dark block appears off of it. Follow that block horizontally and vertically and it will line up with two clusters that are on the diagonal — it is telling you those two clusters overlap even though they aren't neighbors. This happens when one cluster overlaps three or more others: in a flat two-dimensional diagram, a cluster only has two neighbors, so the extra overlapping clusters have to be shown off-diagonal. A cluster that overlaps three others which don't overlap each other is a strong hint — for example, one pair of great-grandparents (the central cluster) feeding three of the four great-great-grandparent lines.

Isolated clusters

An isolated cluster has no meaningful red or gray running off of it — just white.

  • Simple, near-perfect squares are seductive but often the least useful. They are easy to find by hand, so you probably already know about them, and they usually represent distant cousins whose common ancestor may be an eighth-great-grandparent or older — useful only if your tree already reaches that far.
  • Sparse clusters — an irregular red pattern on a gray background — are more interesting. Every member matches some of the others, making a valid (if imperfect) cluster, but the pattern is irregular enough that you probably never spotted it by hand. Sparse clusters can hold genuine surprises.

Close relatives

Very close relatives — roughly 600 cM and up — show as dark stripes running the full height (and width) of the diagram rather than as tidy clusters. They aren't very helpful for interpreting the clusters: children and grandchildren contribute nothing, a parent or first cousin matches half of everything, a grandparent or second cousin matches about a quarter. The stripes are interesting to look at — they show how segments were passed down — but if a close relative is cluttering the diagram, raising the cM Range minimum or the cM upper bound will tuck them away.

Endogamy

Endogamy — generations of intermarriage within a population — is hard on every clustering tool. Instead of clean red squares on a white background, an endogamous kit produces large areas of near-solid gray flecked with red "speckles." Endogamous testers also see far more matches and far more shared matches per match, which is a reporting artifact, not extra relatives.

You can still extract leads:

  • Use Max Cluster Size to dissolve the runaway mega-cluster that endogamy tends to create, so the smaller real structure underneath can show.
  • Limit to your strongest matches — raise the cM Range minimum to 50 cM or higher. Stronger matches survive the endogamic noise better.
  • Anchor on what you know. Find a cluster containing a relative you have already identified; that tells you a great deal about the rest of that cluster. Failing that, look for any tight group of 3–4 matches — even unidentified, it may open a new branch.

Tips for Researching Your Clusters

  • Start with what you know. Find the clusters that contain relatives you have already identified and work outward from there — the other members are often related through the same ancestors, or through that family's ancestors or descendants.
  • Mine the public trees. Look across a cluster's trees for surnames you recognize, surnames that recur across several trees, and shared geographic locations. A quick speculative tree can sometimes connect an unfamiliar tree back to your own.
  • Message the matches. The members of a cluster are likely related to each other, which is useful information to them too. Ask whether they recognize anyone else in the cluster. Some won't reply; some will hand you exactly what you needed.
  • Run a wide range first, then focus. Cluster broadly to see the whole structure, then re-run on a narrower cM range or a region of interest for a closer look.
  • Compare tools. Run the same kit through the Collins-Leeds Method and the Warthen Interactive Cluster. Different algorithms on the same data give you more confidence where they agree and a useful second opinion where they don't.

Caveats & Important Notes

  • Clusters suggest; they do not prove. A cluster points you toward a likely shared ancestor or segment — it is a lead for research, not a conclusion.
  • A shared cluster is not a shared ancestor. Members of one cluster may be a mix of second, fourth, and sixth cousins. They likely share a segment; they do not necessarily share the same most-recent common ancestor.
  • Small clusters can be quirks. In a three-person cluster, the three could match each other for three different reasons — no single segment shared by all. The larger a cluster gets, the more the real signal reinforces itself and the more those quirks wash out.
  • Even a shared segment may not be IBD. A segment shared across a cluster could be identical by descent or merely identical by state. Clustering can't tell the difference.
  • Cluster order and color carry no meaning across runs. There is no maternal-then-paternal ordering and no fixed color assignment — the software can only say the clusters exist; what they mean is your research to do.
  • Shared Clustering is not available in Demo mode.

The clustering algorithm is based on the open-source Shared Clustering project by Jonathan Brecher, used under its open-source license. The interpretation guidance on this page is adapted from the project's wiki.