Fun with bad data

Working from manually transcribed data can have its moments… I was normalizing some small data sets[W], indexes of record holdings, which the client had not accurately specified. Which had resulted in a bug, and which I had tracked down and exterminated.

Naturally, though, I was much less trusting of the data I had been given. Yes it was dual-entry, and yes we had cleaned up all the discrepancies, including the undocumented additional data column. But I no longer trusted it.

So I started writing tests. Each column, each row… And sure enough I uncovered a discrepancy: 6 dates did not match the original indexes.

There are a lot of ways to store data. The best standard is to store only one copy of any given node; if you have 10 rows with the same item, you should have only one item entry, and 10 references to that item. In practice this often ends up with far more work (and code), but it does eliminate what can be a very subtle type of error.  When it comes to dates, though, a really good way to avoid the complexity of references and reduce the chance of errors is to use unix time.

Unix time is the number of seconds which have occurred since 00:00:00 01 Jan 1970. (It is not 100% accurate because it does not account for leap seconds[W], and there have been 27 leap seconds since 1972.) Like the Julian calendar[R], one of the benefits of Unix time is you can easily add or subtract time, to know how far apart any two events were, so it is especially useful for genealogy. Any day is assumed to begin at exactly 00:00:00, so any event where only the date is recorded officially happens at that absolute moment of midnight. (And if the date is before 1 Jan 1970, it is a negative number of seconds before that moment.)

However, that date has to reflect the calendar. The 6 dates which were not matching turned out to have been incorrectly recorded in the original documents. One was for 31 April, another for 31 September, three supposedly occurred on 31 November, and one happened on 29 February 1886 – a year which was not actually a leap-year. Each date was impossible in the Gregorian calendar year.

The date libraries I was using to translate from a written date to Unix time assume that humans make errors. If you have the 42nd of January, you must mean the 11th of February because there are only 31 days in January, so 42 – 31 = 11 days into the next month. Since April, September, and November only have 30 days, those dates were assumed to be the first of the next month. And, again, since there were only 28 days in February in 1886, that came back as 1 March.

These are reasonable assumptions, and based on the records we will probably never really know if these events happened on the first, or if they really did happen on the last day of the previous month. But, after talking it over with the client we decided it was slightly more probable the recording official mis-remembered the day rather than the month, and pushed the events back to the last day of the recorded months.

Personally I found it amusing – after I figured out why it was happening.