Andrea Siotto | When data online is a complete mess.

When data online is a complete mess.

Sometimes we find data gems that are unusable for analysis or even unfriendly for research. One of such gems is the lifelong labor of love of https://www.joebaugher.com/. Well hidden behind a 1990s façade is a lifelong project to document all the serial numbers of military aircrafts produced in the U.S. from 1908 to now. Meticulous years of research created a beefy archive of serial numbers, often annotated with very useful commentaries on the life of the airplane and his fate.

The data, however, is less than friendly. There is not any trace of modern standards and not only it is presented like a 1980s published data report, but it even escapes the normal html tag system with a baffling use of the <pre> tag. Clearly the rationale is to maintain the formatting of some kind of .txt or word file.

The organization of the Data by visual indentation.

If we want to organize the data in a coherent and clean way, we have to do TEXTSCRAPING! Ah, the lovely nightmare of teach a computer to read formatting designed for human eyes and brains!

The whole archive is organized logically, but only by visual indentation. Blocks of content can pertain to a single serial number or to a bunch of them grouped together. Lines in these blocks can be commentary on the whole block or descriptions of a single serial, which can be single line or multiline. Years are sometimes at the top of the page, some other times they are not.

To a reader, navigate the data is pretty trivial, once you understood the organization, but my computer refuses to be able to understand automatically if a line of text describes a serial number or if it is connected to the previous one. In addition, once I obtained the data and started to work on it, I soon discovered that the visual indentation was not clean and consistent: it could be 4 \t or a bunch of spaces; sometimes contained blank characters or random punctuation slipped through the dedication and continual editing of the author.

The only solution was to plan out a series of steps to decode the data, organize it, and reformat it to my choice of file (I had to transform it to an csv file, but it could be Json or SQL database).

Create a number of regex patterns to recognize when a line represents the starting of a group, the name of an airplane model, comments, the beginning of a serial number data, or just continuing text of a multi-line serial number description.
Crawl the text and parse it with the regex patterns to split it in a logical tree that recreates the logical connections in the text.
Unfurl the tree to create single lines of well-organized data.

REGEX

line starts with a serial-range 
    r"(?P<half_serial>^\d+)-(?P<rangemin>\d+)/(?P<rangemax>\d+)(\s+)?(?P<ac_type>.+)?"
Starts with a serial number without range
    r"^((?<![\t ])(?P<halfserial>\d\d)(?:-))?(?P<serial>\d+)\t+(?P<ac_type>.+)"
line starts only with a serial
    r"^(?P<minrange>\d+)/(?P<maxrange>\d+)\s+(?!cancelled)(?P<ac_type>.+)"
Starts with a serial (inside a group)
    r"^((\t+)|(\s{4,}))(?P<poss_serial>\d+)(?!(\d+)?(?:[-/;,:(]))(?P<data>.+)"

The regex is not trivial, but it is not crazy. My sole suggestion here is LEARN REGEX if you want to work with text and data. Natural language toolkit is another must, but in cases like this is not really useful. You can use online regex tools such as https://regex101.com/ to see if your patterns work. You will almost surely miss some fringe cases, but during debugging you can fix your regex to allow for exceptions.

TREE STRUCTURE

Once the rules are working, it is time to work on your classes for the tree structure and on your crawling function to take apart the data. (below) Remember to document your functions with docstrings. You will forget stuff even while you are coding, and docstrings can keep you sane.

class TreeNode:
    """
    Generic Class to represent a node of the serial numbers tree
    """
    def __init__(self, parent, data=None):
        self.parent = parent  # the parent node
        self.data = [] 
        if data:
            self.data.append(data)

    def add_data(self, toadd: str):
        self.data.append(toadd.strip())

    def get_data(self):
        return " ".join(self.data).strip()


class TypeNode(TreeNode):
    """ Class to represent a 'Group' Node 
        (could be also a single serial)
        Attributes: ac_type, half_serial, serial,
                   minrange, maxrange, children[], padding
        Inherit: data[], parent

    """
    def __init__(self, parent, 
                 ac_type="",
                 half_serial="",
                 minrange="0",
                 maxrange="0",
                 data=None,
                 children=None):
        super().__init__(parent, data)
        self.ac_type = ac_type
        self.half_serial = half_serial
        self.serial = None
        self.minrange = int(minrange)
        self.maxrange = int(maxrange)
        self.children = []
        self.padding = 0
        if children:
            self.children.append(children)

    def add_child(self, node):
        self.children.append(node)

    def get_actype(self):
        return self.ac_type.strip()
    
    def get_single_serial(self):
        if self.half_serial:
            return f"{self.half_serial}-{self.serial}"
        else:
            return f"{self.serial}"

    def set_serial(self, value):
        self.serial = value
        
    def get_half_serial(self):
        return self.half_serial

    def set_padding(self, value):
        self.padding = value


class AircraftNode(TreeNode):
    """ Class to represent a 'Single Airplane (serial #)' Node 
        Attributes: serial, half_serial, serial,
                   minrange, maxrange, children, padding
        Inherit: data[], parent
        Note: it is a leaf node with no children.
    """
    def __init__(self, parent, serial="", data=None):
        super().__init__(parent, data)
        self.serial = serial
        self.half_serial = parent.get_half_serial()
    
    def get_serial(self):
        if self.half_serial != "":
            return f"{self.half_serial}-{self.serial}"
        else:
            return f"{self.serial}"
        
    def get_int_serial(self):
        return int(self.serial)

These are my classes. I not calling myself a code guru, so take all with a grain of salt. However, they worked. If it is more than a script like here it is a good idea to add type checking. Add instance methods for your classes also helps in streamlining your code and avoid mistakes on data manipulation.

PARSING FUNCTION

A pretty beefy function. For cleanliness there is a number of inner functions.

The code crawls line by line the text, recognizes if the line starts a new group/branch, if it starts a serial/leaf node, or if it is data to add to the latest node.

def parselines_to_tree(list_lines: list) -> Classes.TreeNode:
    def cleanline(lineText: str) -> str:
        pattern = re.compile(r"^([^\t\da-zA-Z]{,1})\t")
        lineText = re.sub(pattern, "\t", lineText)
        return lineText.strip("\n")

    def start_with_range(lineText: str) -> dict | None:
        """
        check if the line start with a definition of serial number range.
        If so, it returns a dict containing 
        rangemin: int, rangemax: int, ac_type: str, half_serial: str.
        else it returns None.
        """
        pattern = re.compile(
            r"(?P<half_serial>^\d+)-(?P<rangemin>\d+)/(?P<rangemax>\d+)(\s+)?(?P<ac_type>.+)?"
        )
        if match := re.search(pattern, lineText):
            return {
                "half_serial": match.group("half_serial"),
                "rangemin": match.group("rangemin"),
                "rangemax": match.group("rangemax"),
                "ac_type": match.group("ac_type"),
            }
        return None

    def start_with_onlyrange(
        lineText: str,
    ) -> tuple | None:  # beginning of line, just range, then name of model
        """check if it is an only range case.
        If so return the range.
        Otherwise None
        """
        pattern = re.compile(
            r"^(?P<minrange>\d+)/(?P<maxrange>\d+)\s+(?!cancelled)(?P<ac_type>.+)"
        )
        if match := re.search(pattern, lineText):
            return (
                match.group("minrange"),
                match.group("maxrange"),
                match.group("ac_type"),
            )
        return None

    def start_with_serial(lineText: str) -> dict | None:
        """check if the line start with a serial number,
        in case returns a dict{'poss_serial': str, 'data': str}
        else returns None
        """
        # implement the automatic extraction of the serial number, append the first line of the data
        pattern = re.compile(
            r"^((\t+)|(\s{4,}))(?P<poss_serial>\d+)(?!(\d+)?(?:[-/;,:(]))(?P<data>.+)"
        )
        if match := re.search(pattern, lineText):
            return {
                "serial": match.group("poss_serial"),
                "data": match.group("data"),
            }

        return None

    def serial_no_range(lineText: str) -> dict | None:
        """less common case of single serial for aircraft"""
        pattern = re.compile(
            r"^((?<![\t ])(?P<halfserial>\d\d)(?:-))?(?P<serial>\d+)\t+(?P<ac_type>.+)"
        )
        if match := re.search(pattern, lineText):
            fullserial = match.group("serial")
            if match.group("halfserial"):
                fullserial = f"{match.group('halfserial')}-{match.group('serial')}"
            return {
                "serial": match.group("serial"),
                "half_serial": match.group("halfserial"),
                "ac_type": match.group("ac_type"),
            }
        return None

    def add_type_node(
        ac_type: str,
        half_serial: str,
        serial="",
        minrange=0,
        maxrange=0,
        Root=None,
        padding=0,
    ) -> Classes.TypeNode:
        type_node = Classes.TypeNode(
            parent=root,
            ac_type=ac_type,
            half_serial=half_serial,
            minrange=minrange,
            maxrange=maxrange,
        )
        if serial:
            type_node.set_serial(serial)
        if padding > 0:
            type_node.set_padding(padding)
        Root.add_child(type_node)
        return type_node

    def add_aircraft_node(Parent, serial, data) -> Classes.AircraftNode:
        aircraft_node = Classes.AircraftNode(parent=Parent, serial=serial, data=data)
        if isinstance(Parent, Classes.AircraftNode):
            Parent.parent.add_child(aircraft_node)
        else:
            Parent.add_child(aircraft_node)
        return aircraft_node

    root = Classes.TypeNode(parent=None)
    current_ac_type = ""
    rangemin = 0
    rangemax = 0
    activeNode = root
    for line in list_lines:

        line = cleanline(line)
        if line:
            if only_serial_Node := start_with_onlyrange(line):
                rangemin, rangemax, current_ac_type = only_serial_Node
                padding = len(str(rangemin))
                activeNode = add_type_node(
                    ac_type=current_ac_type,
                    half_serial="",
                    minrange=rangemin,
                    maxrange=rangemax,
                    Root=root,
                    padding=padding,
                )
                # add a 'type group node'

            elif result := start_with_range(line):
                padding = len(str(result["rangemin"]))
                activeNode = add_type_node(
                    ac_type=result["ac_type"],
                    half_serial=result["half_serial"],
                    minrange=int(result["rangemin"]),
                    maxrange=int(result["rangemax"]),
                    Root=root,
                    padding=padding,
                )

            elif result := serial_no_range(line):
                # case of single serial number aircraft at the start of line
                padding = len(str(result["serial"]))
                activeNode = add_type_node(
                    ac_type=result["ac_type"],
                    half_serial=result["half_serial"],
                    serial=result["serial"],
                    minrange=int(result["serial"]),
                    maxrange=int(result["serial"]),
                    Root=root,
                    padding=padding,
                )

            elif result := start_with_serial(
                line
            ):  
                current_parent = activeNode
                if isinstance(activeNode, Classes.AircraftNode):
                    current_parent = activeNode.parent
                if (
                    int(current_parent.minrange)
                    <= int(result["serial"])
                    <= int(current_parent.maxrange)
                ):
                    activeNode = add_aircraft_node(
                        Parent=current_parent,
                        serial=result["serial"],
                        data=result["data"],
                    )
                else:
                    activeNode.add_data(line)

            else:
                activeNode.add_data(line)
    return root

Once you are able to create a logical tree with the data, the formatting is a matter of thinking how you need your data organized and if you need more analysis. For example, I needed to be able to search by date, but the date formatting was following multiple standards, so I wrote another function that uses regex again to recognize dates and add them all to a new column consistently formatted for future searches.