Package org.jcsp.gpp.demos.concordance

Package org.jcsp.gpp.demos.concordance Description

Concordance provides a typical problem scenario that justifies the use of a parallel solution. Both a sequential and parallel version are provided that use the same data classes.

A number of source files are provided and the resultant output files will be placed in the same folder.

A concordance is a means of determining the places where the same string of words is repeated in a text.

Usually the concordance is constructed for sequences of words for length 1 up to some defined value N.

Usually used for large texts.

Output comprises the strings of words and where they were found in the text.

Processing is carried out in a number of distinct phases.

Phase 1
Read file line-by-line and extract words removing extraneous punctuation
Calculate an integer value based upon the letters that make up the word including hyphens and apostrophes.
It is easier to compare using integers
Just use the ASCII coding for the letter values.
Save values in a list called word-List
This also provides the required data for N = 1

Phase 2
Each value of N will have its own data structures
For strings of length 2 to N
Sum sequences of values depending on the length
Save these values in a list called a sequence-List

Phase 3

 For each of the sequence-Lists 1 up to N
   Find the index of each element that has the same value
      Store this in a map, called equal-Key-Map comprising:
        Key : value
        Entry: list of index values where that value was found
 

Phase 4

 For each of the N equal-Key-Maps
   Process each entry in turn
     Problem is that the same key value may result from different word strings
     Build a map comprising
       Key: String of words
       Entry: index of places where that string was found
     This map is the concordance for that value of N.
 

Commentary
Each data structure is indexed by N
Each data structure is only written to by one of the phases.
The original word list is referred to in phase 4 but is only read.
Hence we can do the processing in parallel for each value of N

 Author, Licence and Copyright statement
 author  Jon Kerridge
 		   School of Computing 
 		   Edinburgh Napier University
 		   Merchiston Campus, 
 		   Colinton Road 
 		   Edinburgh EH10 5DT
 
 Author contact: j.kerridge (at) napier.ac.uk
   
 Copyright  Jon Kerridge Edinburgh Napier University *   
  
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 
     http://www.apache.org/licenses/LICENSE-2.0
 
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.