When data online is a complete mess.
Sometimes we find data gems that are unusable for analysis or even unfriendly for research. One of such gems is the lifelong labor of love of https://www.joebaugher.com/. Well hidden behind a 1990s façade is a lifelong project to document all the serial numbers of military aircrafts produced in the U.S. from 1908 to now. Meticulous years of research created a beefy archive of serial numbers, often annotated with very useful commentaries on the life of the airplane and his fate.
The data, however, is less than friendly. There is not any trace of modern standards and not only it is presented like a 1980s published data report, but it even escapes the normal html tag system with a baffling use of the <pre> tag. Clearly the rationale is to maintain the formatting of some kind of .txt or word file.

If we want to organize the data in a coherent and clean way, we have to do TEXTSCRAPING! Ah, the lovely nightmare of teach a computer to read formatting designed for human eyes and brains!
The whole archive is organized logically, but only by visual indentation. Blocks of content can pertain to a single serial number or to a bunch of them grouped together. Lines in these blocks can be commentary on the whole block or descriptions of a single serial, which can be single line or multiline. Years are sometimes at the top of the page, some other times they are not.
To a reader, navigate the data is pretty trivial, once you understood the organization, but my computer refuses to be able to understand automatically if a line of text describes a serial number or if it is connected to the previous one. In addition, once I obtained the data and started to work on it, I soon discovered that the visual indentation was not clean and consistent: it could be 4 \t or a bunch of spaces; sometimes contained blank characters or random punctuation slipped through the dedication and continual editing of the author.
The only solution was to plan out a series of steps to decode the data, organize it, and reformat it to my choice of file (I had to transform it to an csv file, but it could be Json or SQL database).
- Create a number of regex patterns to recognize when a line represents the starting of a group, the name of an airplane model, comments, the beginning of a serial number data, or just continuing text of a multi-line serial number description.
- Crawl the text and parse it with the regex patterns to split it in a logical tree that recreates the logical connections in the text.
- Unfurl the tree to create single lines of well-organized data.
REGEX
1 2 3 4 5 6 7 8 | line starts with a serial - range r "(?P<half_serial>^\d+)-(?P<rangemin>\d+)/(?P<rangemax>\d+)(\s+)?(?P<ac_type>.+)?" Starts with a serial number without range r "^((?<![\t ])(?P<halfserial>\d\d)(?:-))?(?P<serial>\d+)\t+(?P<ac_type>.+)" line starts only with a serial r "^(?P<minrange>\d+)/(?P<maxrange>\d+)\s+(?!cancelled)(?P<ac_type>.+)" Starts with a serial (inside a group) r "^((\t+)|(\s{4,}))(?P<poss_serial>\d+)(?!(\d+)?(?:[-/;,:(]))(?P<data>.+)" |
The regex is not trivial, but it is not crazy. My sole suggestion here is LEARN REGEX if you want to work with text and data. Natural language toolkit is another must, but in cases like this is not really useful. You can use online regex tools such as https://regex101.com/ to see if your patterns work. You will almost surely miss some fringe cases, but during debugging you can fix your regex to allow for exceptions.
TREE STRUCTURE
Once the rules are working, it is time to work on your classes for the tree structure and on your crawling function to take apart the data. (below) Remember to document your functions with docstrings. You will forget stuff even while you are coding, and docstrings can keep you sane.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | class TreeNode: """ Generic Class to represent a node of the serial numbers tree """ def __init__( self , parent, data = None ): self .parent = parent # the parent node self .data = [] if data: self .data.append(data) def add_data( self , toadd: str ): self .data.append(toadd.strip()) def get_data( self ): return " " .join( self .data).strip() class TypeNode(TreeNode): """ Class to represent a 'Group' Node (could be also a single serial) Attributes: ac_type, half_serial, serial, minrange, maxrange, children[], padding Inherit: data[], parent """ def __init__( self , parent, ac_type = "", half_serial = "", minrange = "0" , maxrange = "0" , data = None , children = None ): super ().__init__(parent, data) self .ac_type = ac_type self .half_serial = half_serial self .serial = None self .minrange = int (minrange) self .maxrange = int (maxrange) self .children = [] self .padding = 0 if children: self .children.append(children) def add_child( self , node): self .children.append(node) def get_actype( self ): return self .ac_type.strip() def get_single_serial( self ): if self .half_serial: return f "{self.half_serial}-{self.serial}" else : return f "{self.serial}" def set_serial( self , value): self .serial = value def get_half_serial( self ): return self .half_serial def set_padding( self , value): self .padding = value class AircraftNode(TreeNode): """ Class to represent a 'Single Airplane (serial #)' Node Attributes: serial, half_serial, serial, minrange, maxrange, children, padding Inherit: data[], parent Note: it is a leaf node with no children. """ def __init__( self , parent, serial = "", data = None ): super ().__init__(parent, data) self .serial = serial self .half_serial = parent.get_half_serial() def get_serial( self ): if self .half_serial ! = "": return f "{self.half_serial}-{self.serial}" else : return f "{self.serial}" def get_int_serial( self ): return int ( self .serial) |
These are my classes. I not calling myself a code guru, so take all with a grain of salt. However, they worked. If it is more than a script like here it is a good idea to add type checking. Add instance methods for your classes also helps in streamlining your code and avoid mistakes on data manipulation.
PARSING FUNCTION
A pretty beefy function. For cleanliness there is a number of inner functions.
The code crawls line by line the text, recognizes if the line starts a new group/branch, if it starts a serial/leaf node, or if it is data to add to the latest node.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | def parselines_to_tree(list_lines: list ) - > Classes.TreeNode: def cleanline(lineText: str ) - > str : pattern = re. compile (r "^([^\t\da-zA-Z]{,1})\t" ) lineText = re.sub(pattern, "\t" , lineText) return lineText.strip( "\n" ) def start_with_range(lineText: str ) - > dict | None : """ check if the line start with a definition of serial number range. If so, it returns a dict containing rangemin: int, rangemax: int, ac_type: str, half_serial: str. else it returns None. """ pattern = re. compile ( r "(?P<half_serial>^\d+)-(?P<rangemin>\d+)/(?P<rangemax>\d+)(\s+)?(?P<ac_type>.+)?" ) if match : = re.search(pattern, lineText): return { "half_serial" : match.group( "half_serial" ), "rangemin" : match.group( "rangemin" ), "rangemax" : match.group( "rangemax" ), "ac_type" : match.group( "ac_type" ), } return None def start_with_onlyrange( lineText: str , ) - > tuple | None : # beginning of line, just range, then name of model """check if it is an only range case. If so return the range. Otherwise None """ pattern = re. compile ( r "^(?P<minrange>\d+)/(?P<maxrange>\d+)\s+(?!cancelled)(?P<ac_type>.+)" ) if match : = re.search(pattern, lineText): return ( match.group( "minrange" ), match.group( "maxrange" ), match.group( "ac_type" ), ) return None def start_with_serial(lineText: str ) - > dict | None : """check if the line start with a serial number, in case returns a dict{'poss_serial': str, 'data': str} else returns None """ # implement the automatic extraction of the serial number, append the first line of the data pattern = re. compile ( r "^((\t+)|(\s{4,}))(?P<poss_serial>\d+)(?!(\d+)?(?:[-/;,:(]))(?P<data>.+)" ) if match : = re.search(pattern, lineText): return { "serial" : match.group( "poss_serial" ), "data" : match.group( "data" ), } return None def serial_no_range(lineText: str ) - > dict | None : """less common case of single serial for aircraft""" pattern = re. compile ( r "^((?<![\t ])(?P<halfserial>\d\d)(?:-))?(?P<serial>\d+)\t+(?P<ac_type>.+)" ) if match : = re.search(pattern, lineText): fullserial = match.group( "serial" ) if match.group( "halfserial" ): fullserial = f "{match.group('halfserial')}-{match.group('serial')}" return { "serial" : match.group( "serial" ), "half_serial" : match.group( "halfserial" ), "ac_type" : match.group( "ac_type" ), } return None def add_type_node( ac_type: str , half_serial: str , serial = "", minrange = 0 , maxrange = 0 , Root = None , padding = 0 , ) - > Classes.TypeNode: type_node = Classes.TypeNode( parent = root, ac_type = ac_type, half_serial = half_serial, minrange = minrange, maxrange = maxrange, ) if serial: type_node.set_serial(serial) if padding > 0 : type_node.set_padding(padding) Root.add_child(type_node) return type_node def add_aircraft_node(Parent, serial, data) - > Classes.AircraftNode: aircraft_node = Classes.AircraftNode(parent = Parent, serial = serial, data = data) if isinstance (Parent, Classes.AircraftNode): Parent.parent.add_child(aircraft_node) else : Parent.add_child(aircraft_node) return aircraft_node root = Classes.TypeNode(parent = None ) current_ac_type = "" rangemin = 0 rangemax = 0 activeNode = root for line in list_lines: line = cleanline(line) if line: if only_serial_Node : = start_with_onlyrange(line): rangemin, rangemax, current_ac_type = only_serial_Node padding = len ( str (rangemin)) activeNode = add_type_node( ac_type = current_ac_type, half_serial = "", minrange = rangemin, maxrange = rangemax, Root = root, padding = padding, ) # add a 'type group node' elif result : = start_with_range(line): padding = len ( str (result[ "rangemin" ])) activeNode = add_type_node( ac_type = result[ "ac_type" ], half_serial = result[ "half_serial" ], minrange = int (result[ "rangemin" ]), maxrange = int (result[ "rangemax" ]), Root = root, padding = padding, ) elif result : = serial_no_range(line): # case of single serial number aircraft at the start of line padding = len ( str (result[ "serial" ])) activeNode = add_type_node( ac_type = result[ "ac_type" ], half_serial = result[ "half_serial" ], serial = result[ "serial" ], minrange = int (result[ "serial" ]), maxrange = int (result[ "serial" ]), Root = root, padding = padding, ) elif result : = start_with_serial( line ): current_parent = activeNode if isinstance (activeNode, Classes.AircraftNode): current_parent = activeNode.parent if ( int (current_parent.minrange) < = int (result[ "serial" ]) < = int (current_parent.maxrange) ): activeNode = add_aircraft_node( Parent = current_parent, serial = result[ "serial" ], data = result[ "data" ], ) else : activeNode.add_data(line) else : activeNode.add_data(line) return root |
Once you are able to create a logical tree with the data, the formatting is a matter of thinking how you need your data organized and if you need more analysis. For example, I needed to be able to search by date, but the date formatting was following multiple standards, so I wrote another function that uses regex again to recognize dates and add them all to a new column consistently formatted for future searches.