Friday, October 17, 2008

Root algorithm

Root algorithm sounds like a dental procedure but it isn't.

What I have is a preliminary algorithm for extracting the root given a Hebrew lexical word. There is nothing about meaning yet! The mechanism seems to work more often than not.

Here is the sequence in English. What do you think? The prefixes and suffixes are explained in more detail here.

Common single prefixes 'א:ה:ב:ל:ת:מ:י:ו:כ'
Common single suffixes 'ה:ך:י:ו:ת'
Common double prefixes 'וא:וה:וב:ול:ות:ומ:וי:וכ'
Common double plural suffixes 'ים:ות'
Common double possessive suffixes 'כם:נו'
Common triple suffixes 'ינו:יכם'

and given a table of roots (you can't do this without some internal memory)

Step 1 - strip obvious plurals and possessives
  • when the length of the lexical word is greater than 5 and the last three characters are a common triple suffix then strip the last three characters - else take the whole word
  • when the length of what remains is greater than 4 and the last two characters are a common double possessive suffix then strip the last two characters
  • else take what remains
  • when the length of what remains is greater than 5 and the last two characters are a common double suffix and the first two characters are NOT a common double prefix then strip the last two characters
  • when the length of what remains is greater than 4 and the last two characters are a common double suffix and the first character is NOT a common single prefix then strip the last two characters
  • when the length of what remains is greater than 5 and the last character is a common single suffix and the first two characters are NOT a common double prefix then strip the last character
  • when the length of what remains is greater than 4 and the last character is a common single suffix and the first character is NOT a common single prefix then strip the last character
  • else take what remains
Step 2. apply the following tests in order, if any succeeds you are done. Note the phrase distinctions: 'match' compared to 'contained in'
  1. If you find a match in the root table for what remains, you are done
  2. if what remains is still plural and its length is 4 see if the singular form is in the root table and matches the first two characters
  3. if what remains ends with a common single suffix see if there is a three character root that matches the rest of the word
  4. if what remains begins with a common single prefix see if there is a root that matches the rest of the word
  5. if what remains begins with a common double prefix see if there is a root that matches the rest of the word
  6. if what remains is longer than 4 characters and begins with a common single prefix and ends with a common single suffix see if there is a root that matches the rest of the word
  7. if what remains is longer than 5 characters and begins with a common double prefix and ends with a common single suffix see if there is a root that matches the rest of the word
  8. if what remains is longer than 3 characters and begins with a common single prefix see if there is a root longer than 2 characters that is contained in the rest of the word
  9. if what remains is longer than 4 characters and begins with a common double prefix see if there is a root longer than 2 characters that is contained in the rest of the word
  10. if what remains begins with a common single prefix and ends with a common single prefix see if there is a root matching the rest of the word
  11. if what remains is plural see if the singular form matches the first remaining characters
  12. if what remains contains a mater vav, see if you can find a matching root without the mater
  13. if what remains contains a mater yod, see if you can find a matching root without the mater
  14. if what remains begins with a common single prefix see if there is a root that is contained in the rest of the word
  15. See if there is a root that is contained in the word
whew! The sequence is critical - I do not have a sense as to which sequence will give the best results, because I can only sometimes immediately recognize a correct result! The algorithm knows nothing of distinction between a noun or a verb. And when root characters disappear - which they do - they algorithm is blind.

No comments: