Now that I've hooked up preliminary SmartWord filtering into the main Filter utility, I think I'm about ready to make a first public release. There are still a few issues to iron out and these are listed below...
Filter is dropping words... This seems to happen when the word in question spans two "lines". (I define lines here as 255 characters.) What I'm doing at the moment is to read in a line from the input file at a time and split this into words, which I then check against the database. If it comes back as found, then I write this to the output line and continue to the next word. If not found, then it enters it into the unknowns list to be written to a file when the program finishes with the document. What I think I should be doing is checking if the word is in the dictionary, if not, then are we at the end of the line? If we are then store it and move on to the beginning of the next line. When we see a word which isn't in the dictionary and we're beginning a new line, then we add the current word to the end of the stored word to see if we get a match. I've tried variations of this a couple of times now, with different, but un-desired results.
AddWords single word logging... Still haven't got anywhere with this long standing bug. A full description of this one is in the !ReadMe file distributed with the archive.
Done a little optimisation here and there... Managed to shave 2k off the SmartWord API (Libs.SmartW). Also got word counts for each start letter right aligned properly in the database statistics report options from within SWAdmin.
This seems to have a few more bugs introduced. For some reason, the output produced by Filter when SmartWord filtering is turned on loses all spaces between words. I have no idea why as I haven't made any changes to this. The main part of the update has been to expand the dictionary to over 5000 words. Still a long way to go, but the program is now at a stage where it can scan a document and produce a list of unknowns from it. This is then manually tidied up and inserted into the database. Yes... I do put each of the test documents through a spell-checker BEFORE I stick them through this, so all words are correctly spelt.
Made a few changes to the way that the tools initialised libraries. Added a feature to the SWAdmin tool that allows the user to save a dump of the brief report to the root directory of the database. Cleaned a few other things up and included LibASH in preparation for a re-write of the SmartWord parsing feature inside the Filter tool.
Hopefully by using LibASH blocks, rather than BASIC strings, I can get around the 255 character limit that's causing words to split between lines on occasion.
Anyhow... For those of you that are following, the latest download is at...