TWF : A DSL/flat file format to describe family trees

Published on March 23, 2021

Pretty much everyone has drawn family trees - most children do it, probably as early as their kindergarten years and in most cases, for most people, it never progresses beyond that kindergarten level - usually, with just 3 layers in the tree : grandparents, parents and "me". Although I too have drawn those trees, I wanted to push beyond that to see how far up the tree I could climb and so that was the initial impetus that started me on the the path of trying to document a larger portion of my family tree.

I started off trying to collate the information that my Dad and Mom had written down on various sheets of paper which were usually hand drawn trees that involved nested lists of names that invariably tended to run into the right edge of the paper that it was written on, as the family tree got fleshed out over multiple generations. There would then be ambiguous lines added to connect the remaining parts of the tree by starting over again on the left edge of the paper. The resulting mess usually did nothing to illuminate the relationship to someone you met at a family function even if they happened to actually be listed in that family tree.

I decided that the right solution to all of this was to transfer all of the available information into a computer. After researching the various available options listed on Wikipedia to see how people track genealogy information I narrowed down my list of options to just two open source pieces of software since all of the remaining options were closed source/proprietary/walled garden type locked-in web sites with no way to export out the data that you had painstakingly entered into it.

The two open source software that I decided to use, where you can track the genealogy information, locally on you own computers, were Gramps ² and Lifelines/llines ³. I tried using them starting with llines first and then moving to Gramps to enter the meager data that I had collected and found both of them rather wearisome to use. Using the web based genealogy sites also involved a lot of clicking and tabbing between fields that I found detestable.

Now, all of my work (even personal stuff), is done on computers running Solaris/Linux, so I do have a preference for the command line unlike most folks who probably use the web or other GUIs to track and capture genealogy related information. Most of these GUIs such as Gramps and even the TUIs such as llines just use the GEDCOM file format as the backend. So I wondered if I could just edit the GEDCOM file as a flat file directly. While it was possible to do that, it is not for the faint of heart since you are now tasked with tracking various minutiae that the software front ends would have handled for you.

Since I found the input format that GEDCOM used fairly unwieldy and the output formats that the tools generated from it were not quite to my liking, it was a double whammy.

Now, I'm a software guy so I should be able to write software to scratch this particular itch, right? My obvious knee jerk reaction was to try to see if I could code up something on my own that could be used in place of GEDCOM (which of course, I could continue to use as a backend, for interoperability).

After a lot of thinking, I decided that what I really wanted was the input to be in a flat file format which could be edited (with any preferred editor, of your choice) and then processed to generate whatever output format (GEDCOM, trees, linear lists etc) was desired. The input format I decided to use was one that could be generated by recursively walking the family tree and listing each of them on separate lines, one line per "family unit". Since this format is the output of a "tree walk", I'll call this format "Tree Walk Format"/twf. I assume that it should be obvious that the reverse operation of parsing each line and re-building the original tree recursively should also be possible although this imposes the constraint that the ordering of lines matters. It also requires a means of annotating the nodes in the tree that need to be expanded out as branches vs left as terminal nodes.

The initial format I came up with was that each family was listed one per line with each line having a format of:

father + mother -> children

where the children are listed in birth order. This format could obviously be simplified even further into a plain CSV like format:

father, mother, child1, child2, ...

where the position determines who each person is in relation to everyone else.

This format has the nice property that a person without access to a computer could also use this exact format to write things down in a notebook. When I started to use this format to document family relationships manually in a book I realized that it was easier if each line was numbered so that families could be added out of order and tagged using the line numbers to link them up. For example, we could use some data starting with the Book of Genesis in the Bible to get:

Adam, Eve, Cain⁴, Abel, Seth², Other, Sons, And, Daughters
Seth, Wife_of_Seth, Enosh³, Other, Sons, And, Daughters
Enosh, Wife_of_Enosh, Kenan⁸, Other, Sons, And, Daughters
Cain, Wife_of_Cain, Enoch
Abraham⁶, Sarah, Isaac⁷
Abraham, Hagar, Ishmael
Isaac, Rebekah,
Kenan, Wife_of_Kenan, ...

On a computer though, reordering lines is easy so we don't need to support line numbers or tagging and we can expect expansions to be done strictly in order. To denote a leaf node on the computer, I decided that adding a period after the name was intuitive/clear enough. Since the comma/',' character needed be used for other purposes⁴, I decided to use the '|' character as the separator. With these changes in place, the newer format would look as shown below with the parts after '#' used for comments.

Here's the same family tree described above starting with Adam from the Book of Genesis in the Bible as formatted for use on a computer:

# Adam and Eve have 3 children. Abel died without any descendants
Adam.|Eve.|Cain|Abel.|Seth  # Abel gets a '.' to denote leaf node
# Since Cain is the first "branch" node, it needs to be expanded first
Cain.|Wife.|Enoch.      # Cain's family is not expanded for now
# Since Seth was listed after Cain, it is expanded next after skipping
# over Abel since he is marked as a leaf/no descendants node
Seth.|Wife.|Enosh       # Seth's family is now expanded via ...
Enosh.|Wife.|Kenan      # ... Enosh, his son who in turn
Kenan.|Wife.|           # ... is expanded via his son Kenan
...             # ... and on and on ...
Abraham|Sarah.|Isaac        # ... until we get to Abraham
Abraham|Hagar.|Ishmael.     # ... who had children by 3 women
Abraham.|Keturah.| ... etc  # ... assuming I counted correctly
# Abraham is finally a leaf node, so we backtrack and expand the next
# non leaf node by skipping Sarah and moving to Isaac
Isaac.|Rebekah.|Jacob|Esau  # Isaac's children need expanding
Jacob.|...          # ... and you know how this story goes
Esau.|...           # ... on and on and on ...

From an implementation point of view, all that is needed is a stack of names with the leftmost (oldest) name at the top of the stack with the convention that only names that that don't end in a period/'.' need further expansion.

For each new line, we just need to look for a match of the topmost item on the stack and error out if it isn't seen thus helping spot errors as early as possible. So, using this simple flat file format it is clear how a family tree can be rebuilt given a flat file. Also, given a family tree, you can see how this flat file can be generated quite mechanically by "walking the tree" and listing one family per line in the output.

So far, I've structured the family tree shown above as a "mathematical tree" where branches don't intersect but in reality, family trees are never trees but are more like graphs when cousins and other (un)related people can have offspring either within a marriage (or outside it, for eg. Abraham and Hagar).

Let's flesh out Rebekah's family to see the adjustments that needed to be made to this scheme. To keep things simple, let's ignore the other wives of Abraham (and their children) and focus on Rebekah's family as described in the Book of Genesis in the Bible.

Abraham.|Sarah.|Isaac       # Isaac is the only node to be expanded
Isaac.|Rebekah|Jacob|Esau   # but we now add Rebekah's tree as well
Bethuel|Wife.|Rebekah.      # by adding Bethuel as her dad and
Nahor|Micah.|Bethuel        # tracing back to Nahor and Haran
Terah.|Wife.|Abraham.|Nahor.|Haran  # who were Abraham's brothers

So this shows that when related people are married, (in this specific case, Isaac and Rebekah are cousins), it brings up a problem with the notation since we need some way to refer forward/backward to the same person via different family links. For this, I've chosen to use two operators which are like the period/'.' operator and are listed after the names.

The forward pointer/'^' operator after a name indicates that it will be resolved later in the file, not immediately on the next line. The cut/'!' operator after a name is a means of resolving the closest forward reference which was listed earlier in the file.

Using these operators the tree described earlier can be rewritten as:

Abraham^|Sarah.|Isaac       # Add a forward reference for Abraham
Isaac.|Rebekah|Jacob|Esau   # ... now fill in Rebekah's tree as well
Bethuel|Wife.|Rebekah.      # by adding Bethuel as her dad and
Nahor|Micah.|Bethuel        # tracing back to Nahor, Abram's bro
Terah.|Wife.|Abraham!|Nahor.|Haran  # and resolve Abraham's ptr here

Obviously this is just one way to resolve the circular reference so pick whatever mods reduces the total amount of changes that are needed.

With the addition of the forward pointer/'^' and cut/'!' operators I assume that even the most complicated family trees become fairly easy to describe even in a flat file format which can be easily described in just a few paragraphs. For example, here's Lot's complicated family dynamics, again straight from the Book of Genesis in the Bible:

Lot|Wife.|Daughter1^|Daughter2^
Lot|Daughter1!|Moab.
Lot.|Daughter2!|Ben-Ammi.

Although this seems sufficient to describe the complicated relationships from the Bible, I'm not even going to attempt to try and see if the families from Greek mythology can be covered by this simple set of operators (ie if these operators fit the requirements of both a. necessity and b. sufficiency).

So I decided to skip the theory/academic parts and move forward to real world testing by trying to apply this format to describe my own family tree.

Using a bunch of data collected by Vavanty¹ with my Mom acting as the intermediary to send me the data over Whatsapp and with Fr. Joseph to fill in various missing bits and pieces and provide needed encouragement, I was able to flesh out my family tree to list over ~250 families. The toolchain that I've developed currently uses a Perl script (called appropriately enough, "twf") to turn the input file in the "tree walk format" described above into an output file in groff/dot format which is then post processed using dot to generate a PDF file that depicts a family tree with all the interconnections in a single graphical display sheet that can be zoomed in/out to your heart's content.

<strike>~~I'll update this post to add a link to the code here if/when I get around to cleaning it up and documenting it and hosting it somewhere.~~</strike> [Update: 11 Mar 2025]: The source code has been uploaded to github

Adeena Paul ↩
Blurb:"GRAMPS is a GNOME genealogy program for Linux and FreeBSD" ↩
Blurb:"LifeLines is a genealogy program to help with your family history" ↩
Currently, in the script, the age at time of death is denoted by appending ",age" to the name of the person. ↩