Still coding the reader for the Surface Mapping System (SMS) data. Last week I got most functions working, the files are read in, converted to tab delimited, then stored in a main file. I’ve found that multiple file operations are actually much more efficient than manipulating a large file in memory. Originally all of the text (140,000 lines) was stored in memory until it was written in a single step, but was taking about 5 minutes to complete. The runtime for the program using multiple file IO commands is about 4 seconds.
The next problem to solve is allowing only alphanumeric characters and signs in the text. I’ve tried Regex, several variants of compares and replaces, but still haven’t gotten one that works 100%. 1 out of every 1-2000 lines contains some random symbols that don’t appear in any consistency. The function I use to remove the data should ideally remove any characters that aren’t on an accepted characters list, or at least remove the line. My current idea is to replace all the accepted characters with either a whitespace or null, then compare it to the original. Any duplicate characters between the files could be removed, leaving only the characters on the list, in theory.