A number of source files are provided and the resultant output files will be placed in the same folder.
A concordance is a means of determining the places where the same string of words is repeated in a text.
Usually the concordance is constructed for sequences of words for length 1 up to some defined value N.
Usually used for large texts.
Output comprises the strings of words and where they were found in the text.
Processing is carried out in a number of distinct phases.
Phase 1
Read file line-by-line and extract words removing extraneous punctuation
Calculate an integer value based upon the letters that make up the word including hyphens and apostrophes.
It is easier to compare using integers
Just use the ASCII coding for the letter values.
Save values in a list called word-List
This also provides the required data for N = 1
Phase 2
Each value of N will have its own data structures
For strings of length 2 to N
Sum sequences of values depending on the length
Save these values in a list called a sequence-List
Phase 3
For each of the sequence-Lists 1 up to N Find the index of each element that has the same value Store this in a map, called equal-Key-Map comprising: Key : value Entry: list of index values where that value was found
Phase 4
For each of the N equal-Key-Maps Process each entry in turn Problem is that the same key value may result from different word strings Build a map comprising Key: String of words Entry: index of places where that string was found This map is the concordance for that value of N.
Commentary
Each data structure is indexed by N
Each data structure is only written to by one of the phases.
The original word list is referred to in phase 4 but is only read.
Hence we can do the processing in parallel for each value of N
Author, Licence and Copyright statement author Jon Kerridge School of Computing Edinburgh Napier University Merchiston Campus, Colinton Road Edinburgh EH10 5DT Author contact: j.kerridge (at) napier.ac.uk Copyright Jon Kerridge Edinburgh Napier University * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Class | Description |
---|---|
ConcordanceCombine | |
ConcordanceData | |
ConcordanceResults | |
ConcordanceWords | |
RunConcordancePoG | |
RunExtendedConcordance | |
RunGoPConcordance | |
SeqConcordance | |
SeqExpandedConcordance |