i like words and language, so i decided to make a script that builds English sentences (ok, a bit free-form sometimes, call it poetry) out of random words. a part of this script would be a way of detecting syllables, in order to arrange words in some metrically desirable way, or something.
anyway, here is the 'detect_syllables' script i just finished. it is a bit messy, and if people have suggestions for improvements etc, that is very welcome indeed. i did all this through loads of Google and experimentation, so that explains the mess a bit i think.
#!/bin/bash # script to detect syllables in a word # count the vowels in the word. # subtract any silent vowels, (like the silent e at the end of a word, or the second vowel when two vowels are together in a syllable) # subtract one vowel from every diphthong (diphthongs only count as one vowel sound.) # the number of vowels sounds left is the same as the number of syllables. # usage: detect_syllables [word] ([word] [word] ...) # exit when no argument is given if [ $# -lt 1 ]; then echo "$(basename $0): no argument given." >&2 exit 1 fi # continue if there is an argument full_count="" for arg in "$@"; do # convert to lowercase word="$( tr [:upper:] [:lower:] <<< $arg )" # cleanup of the word for syllable-matching: # remove silent vowel 'e' at the end, unless it is the only vowel in the word # remove stuff ending in '...ened', which has 2 vowels but is 1 syllable # remove diphthongs (ou,ai,ei), replace with 'o' # include 'y' when flanked by consonants (which makes it a vowel), f.e. in 'dyke', replace with 'o' # ! is special character, translates into a single vowel. # '#' is special character, translates into double vowels. # '_' is special character, translates into nothing clean=$(sed 's/^...$/!/ ; s/^e[^aeiou!#_]/!/ ; s/coax\|ua\|ire\|ove/#/g ; s/i[eao][rt]/#/g ; s/ce$\|se$\|ve$\|mes$\|fe$/_/ ; s/[a-z][oiu].e/!/g ; s/ou/!/g ; s/theater/th#t!r/g ; s/[^aeiou#!_]le$/!/g ; s/e[rt]e/!/ ; s/[aeiou][aeiou]/!/g ; s/[bgt][aeiou][aeiou]/!/g ; s/[^aeiou#!_]y/!/g ; s/#/ii/g ; s/!/i/g' <<< $word) # 1] 3 letter words are always one syllable # 2] 'ua' must stay as 2 vowels # 3] 'ou' must always become 1 vowel, even after 'th' # 2] to catch 'theatre' (and hopefully more..?) # 3] consonant l-e syllables. consonants followed by 'le' # 6] catch double vowels (unless preceded by 'th') # 7] catch double vowels a 2nd time, unless preceded by certain consonants # 8] to catch the cases where the 'y' becomes a vowel # debug: echo -n "$clean, " # --- PROBLEM WORDS: # bake # above # immediately # reactions # cautious # closer # sixes # strangely # powerful # meander # equaly # headquarters # dozens # custodians # count the syllables syll_count=$(grep -io [aeiou] <<< $clean | wc -w) # when syllable count returns 0, it must be 1 if [ $syll_count == 0 ]; then syll_count=1 fi # add the syllable count to the full count full_count="$full_count $syll_count" done echo $full_count | tail -c +1 exit
Last edited by rhowaldt (2011-09-18 04:27:00)
All I can suggest is some trivial polishing up...
Send message to standard error instead of standard output with >&2, and set non-zero exit code:
echo "$(basename $0): no argument given." >&2 exit 1
tr [:upper:] [:lower:] is safer than tr '[A-Z]' '[a-z]' for some alphabets.
Save a subshell "echo $var|process" by using "herestring" <<<
word="$( tr [:upper:] [:lower:] <<< $1 )"
You can put all those sed commands together, separated by semicolons:
clean=$(sed 's/e$//;s/ened$/o/;s/[aeiou][aeiou]/o/g;s/[^aeiou]y[^aeiou]/o/g' <<< $word)
You don't need a temporary file just for grep - you can send it standard input directly from $clean instead.
echo $clean | grep...
syll_count=$(grep -io [aeiou] <<< $clean | wc -w)
WARNING I haven't actually tried any of this...
@johnraff: thanks, incorporated all those and it is still working perfectly fine
did discover some errors, but they are not in the code but in the method of syllable detection. seems i have not taken into account every single English word out there
just found out 'many' registered as 1 syllable. changed that already. now to find a way for 'beautiful' to register as 3 instead of 4 syllables...
edit: editted my 1st post with the new script. also incorporated the ability to provide multiple words as arguments, which is working great as well.
Last edited by rhowaldt (2011-09-17 16:27:36)
Hmm. I'll need to keep an eye on this topic and try it a bit later, for sure. This certainly could be something interesting in the long run. Thank you, rhowaldt.
@dubois: thanks man, no problem. i never thought anyone but myself would be able to find a use for something like this. pleasantly surprised! if you're interested, here is my 'words' script that used the 'detect_syllables' script. it is very much under construction, and you do not have the word-lists that need to go with it (if you want i can provide you those as well of course), but i think you can see what i am getting at here.
#!/bin/bash # not 100% sure what this is yet. something to do with words (?) # ... [adjective] [noun] # define variables for different types of words ADJ=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/adjective.words) ADV=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/adverbs.words) DESC=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/descriptive.words) NOUN_S=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/nouns_singular.words) NOUN_P=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/nouns_plural.words) TRANS=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/transitional.words) VERB=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/verbs.words) # detect syllables to determine metrum DS=~/scripts/detect_syllables echo "$ADV ($($DS $ADV)) $DESC ($($DS $DESC)) $ADJ ($($DS $ADJ)) $NOUN_P ($($DS $NOUN_P)) $VERB ($($DS $VERB)) $TRANS ($($DS $TRANS))." exit
some example outputs as it is right now (this will not be its final form of course):
[18:24:54]$ words many times (2 2) frightfully (3) valid (2) mothers (2) joke (1) after a few days (2 1 1 1). [18:25:10]$ words always (2) lazy (2) beautiful (4) twenties (2) stitch (1) for this purpose (1 1 2). [18:25:20]$ words in order to (1 2 1) skillful (2) perplexed (3) governors (3) beep (1) however (3). [18:27:53]$ words all (1) faintly (2) optimum (3) originals (4) creak (1) for this purpose (1 1 2). [18:32:59]$ words yesterday (3) quiet (2) angry (2) formats (2) snooze (1) yet (1).
Last edited by rhowaldt (2011-09-17 16:33:13)
@dubois: something like that, yes. generative poetry. was thinking of using the syllable-detection for building haikus as well. it can be difficult to make somewhat proper sentences though, that is something i will really have to work on.
Last edited by rhowaldt (2011-09-17 17:14:23)
# --- PROBLEM WORDS: # coax # eager # extenuations # although vs theatre
that is about the 10th list so far..
damn, it is pretty difficult writing a script that accurately determines syllables for words. there are so many combinations of letters and grammar is a system of exceptions upon exceptions upon exceptions. but i am determined and will go on. maybe do some Googling on a set of rules for syllable-detection or something..
Last edited by rhowaldt (2011-09-18 01:38:08)
clean=$(sed 's/^...$/!/ ; s/^e[^aeiou!#_]/!/ ; s/coax\|ua\|ire\|ove/#/g ; s/i[eao][rt]/#/g ; s/ce$\|se$\|ve$\|mes$\|fe$/_/ ; s/[a-z][oiu].e/!/g ; s/ou/!/g ; s/theater/th#t!r/g ; s/[^aeiou#!_]le$/!/g ; s/e[rt]e/!/ ; s/[aeiou][aeiou]/!/g ; s/[bgt][aeiou][aeiou]/!/g ; s/[^aeiou#!_]y/!/g ; s/#/ii/g ; s/!/i/g' <<< $word) # --- PROBLEM WORDS: # bake # above # immediately # reactions # cautious # closer # sixes # strangely # powerful # meander # equaly # headquarters # dozens # custodians
can you believe that with such a sed-expression i still have this huge (and incomplete) 'problem word' list?
now, off to detect the patterns in these words and catch them with sed! (damn i need to go to sleep, 06.31 here..)
i would comment the main sed command differently:
do not use shell commands in front of and after the sed command but use sed comments inside the sed command. and also do not use semicolon in long sed commands but newlines.
clean=$( sed ' # does this s/^...$/!/ #does that s/^e[^aeiou!#_]/!/ #does something s/coax\|ua\|ire\|ove/#/g #and so on ' <<< $word )
you might not want to move all comments inside the sed expression but some might also help others to better/faster understand your code.
btw very interesting project
@luc: thanks man, i wanted to have it like that, but tried with the bash escaped newline and it didn't work. that is why i settled for the comments after the whole sed-expression. i'll try your suggestion, would be much better!
so i decided to Google 'syllable algorithm', and check it out: http://stackoverflow.com/questions/4051 … -in-a-word
no real solution yet, but seeing at least two people wrote thesis and stuff on this, it is even more difficult than i already thought. might go for either using an existing algorithm or using parts of them. at least i got some complicated thesis to read now which will keep me busy for a while.
(my wife asked 'why are you doing this?' - i said 'just for fun'... i must be insane or something :)
so, progress report:
practically shoved my entire old arm-length sed-code out, and started afresh. too much hit&miss, and that was my own fault for not working systematically enough on this. so, out with the old, in with the new. even named all the different rules for reference, and specified which words must be excepted and such:
#!/bin/bash # script to detect syllables in a word # count the vowels in the word. # subtract any silent vowels, (like the silent e at the end of a word, or the second vowel when two vowels are together in a syllable) # subtract one vowel from every diphthong (diphthongs only count as one vowel sound.) # the number of vowels sounds left is the same as the number of syllables. # usage: detect_syllables [word] ([word] [word] ...) # exit when no argument is given if [ $# -lt 1 ]; then echo "$(basename $0): no argument given." >&2 exit 1 fi # continue if there is an argument full_count="" for arg in "$@"; do # convert to lowercase word="$( tr [:upper:] [:lower:] <<< $arg )" # cleanup of the word for syllable-matching # '!' is special character, translates into a single vowel. # '#' is special character, translates into double vowels. # '>' is special character, translates into a single consonant. # '_' is special character, translates into no vowels. clean=$( sed ' # --- PREREQUISITES s/^..$/!/ # A1. 2-letter words are 1 vowel # --- STARTERS s/^dia/d#/ # S1. starting with dia-: 2 vowels # --- ENDINGS s/ion[s]\?$/!/ # E1. ending in -ion(s): 1 vowel s/[auoe]re[s]\?$/!>/ # E2a. ending in -re(s): 1 vowel # E2a. excluded forms: yre, require, fire s/[dfkmnptvw]e[s]\?$/_/ # E2b. ending in -[dfkmnptvw]e(s): 0 vowels s/[cghsxz]e$/_/ # E2c. ending in -[cghsxz]e: 0 vowels # E2c. ending in -[cghsxz]es: 1 vowel s/ye$/!/ # E2d. ending in -ye: 1 vowel s/[aeiou]le$\|[aeiou]be$/_/ # E2e. ending in (vowel) -le/-be: 0 vowels # E2e. ending in (consonant) -le/-be: 1 vowel s/[^bcdfgkpstxz]led$/_>/ # E3a. NOT ending in -[bcdfgkpstxz]led: 0 vowels s/\(.*\)[^edtl]ed$/\1_>/ # E3b. ending in -ed: 0 vowels s/\([aeiou]\)yer$/\1!/ # E3c. ending in -yer: 1 vowel (y because of rule X1.) s/[uio]er$/#/ # E3d. ending in -[uio]er: 2 vowels s/ally$/l!/ # E4a. ending in -ally: 1 vowel s/[ui]ary$/#r!/ # E4b. ending in -[ui]ary: 3 vowels s/\([^aeiou!#]\)y$/\1!/ # E4c. ending in -y: 1 vowel s/iest$/#/ # E4d. ending in -iest: 2 vowels s/cie/c#/ # T5. cie must be 2 vowels # --- DIPHTHONGS s/ee\|oo/!/g # D1. ee: 1 vowel s/eau/!/ # D2. eau: 1 vowel s/\([cgx]\)ious/\1!s/ # D . [cgx]ious: 1 vowel s/ies/!s/ # D . ies: 1 vowel s/[aeiou][aeiou]\([^_aeiou]\)/!\1/ # D1. catch these because of E2b. # --- TEST s/ace/!/ # T1. ace is 1 vowel s/ver[y!]/v!r!/ # T2a. very must be 2 vowels (E4c.) s/[aeiou]ve\([^aeiou!#][aeiou!#]\)/!v\1/ # T2b. *ve* s/e[r]e/!/ # T3. ere must be 1 vowel s/ike/!k_/ # T4. ike must be 1 vowel #s/^e[^aeiou!#_]/!/ # e then consonant at start=1 vowel (why? maybe to save this vowel?) #s/[^q]ua\|ire\|ove[^aeiou#!]\|ia\|io\|ea[c]/#/g # these must always be 2 vowels #s/i[eao][rt]/#/g # ie, ia, io followed by r or t must be 2 vowels #s/[^eaoiu#!][aoiu][^eaoiu#!_]e[^r]/!/g # consonant then aoiu then consonant then e must be 1 vowel #s/i*ou/!/g # these must always be 1 vowel #s/[aeiou][aeiou]/!/g # remaining double vowels must be 1 vowel #s/[bgt][aeiou][aeiou]/!/g # remaining double vowels must be 1 vowel when preceded by certain consonant #s/[et][^aeiou#!_]y/!/g # e or t then consonant then y make y a vowel and total 1 vowel # --- SPECIALS s/[^aeiou#!]y/>!/g # X1. remaining y must be a vowel s/>/q/g # X2. translate special char > into a consonant s/#/ii/g # X3. translate special char # into 2 vowels s/!/i/g # X4. translate special char ! into 1 vowel ' <<< $word ) # debug: echo -n "$clean, " # --- PROBLEM WORDS: # determine (ete often 1 vowel) # triangles # reinforcement (correct no. vowels, wrong substitution) # influences # cat nouns_singular_ds.txt | grep "\*" # (ended at n) # --- SPECIALS (probably catch these separately) # theater, meander, area(s), creative, seance # coax, chaos # (E1.) dandelion, lion, scion, ion, pion, axion # (E2b.) gimme # (E2b.) recipe # (E2b.) coyote # (E2c.) fiance # (E2c.) blase # (E2c.) *aches # (E3b.) embed, seabed, flatbed, roadbed, sickbed, deathbed, slugabed, waterbed, flowerbed # (E3b.) unfed, malfed, overfed, underfed, breastfed # (E3b.) unshed, cowshed, -thed # (E3b.) naked # (E3d.) -[gq]uer # (E3d.) tattooer, shampooer # (E3e.) priest # (D .) bluest, truest # (T2a.) every # count the syllables syll_count=$(grep -io [aeiou] <<< $clean | wc -w) # when syllable count returns 0, it must be 1 if [ $syll_count == 0 ]; then syll_count=1 fi # add the syllable count to the full count full_count="$full_count $syll_count" done echo $full_count | tail -c +1 exit
so, it's a bit of a mess combined with orderly goodness, which is what a work in progress should look like, imo.
last post was about people writing thesis on this subject. read one of those and it focuses on work-breaking in print, for the TEX system. since i am only after counting syllables, not breaking off words in the proper places, this did not matter at all to me. so i just went ahead.
i must now pay homage to the greatest website ever for anyone attempting this. check this out: http://www.morewords.com/ends-with-by-length/yer/
that has helped me so much i cannot believe it. yes, tediously skimming through all of those words, but still.
speaking of skimming through lists of words, i have just applied my script to a 3000-word list of singular nouns, and have proceeded to look through this entire list word for word to determine where the mistakes were. this was a tedious process, but it leaves me with a reference-list which i can use to automate this process next time around.
that's it! hope you're all still with me :) still going strong here!
Last edited by rhowaldt (2011-09-26 20:21:28)
looks nice so far.
one small cosmetic bug:
the test command (aka "[") has special operators for comparing numbers. they are: -eq -ne -gt -ge -lt -le. the symbols = != < > are used to compare strings. and == is only valid in bash but not in sh. This is not really breaking anything but still i would change "if [ $syll_count == 0 ]; then" to "if [ $syll_count -eq 0 ]; then"
(read "man test" for more)
@luc: thanks, this makes sense and i kinda knew about it, but it is easy to confuse different programming languages, that's why i did it with == because i'm used to that from other languages... will change it up later.
short update: all the damn manual checking is now finished. i almost have a headache, but luckily not yet.
i now have:
- lists of words (nouns.words, verbs.words, adverbs.words, adjectives.words)
- lists of syllable counts (nouns.wc, verbs.wc, adverbs.wc, adjectives.wc)
- a list with all the wrong conversions (wrong_ds_270911.txt)
- write a script to quickly compare the output of a list (the syllable counts) to the syllable counts in the .wc files
- may be possible to write a script that automatically detects a change to a previously good word suddenly turned bad as a result of the introduction of a new rule
to get an overview of the current state of things:
979 adjective.words 78 adverbs.words 123 descriptive.words 2635 nouns_plural.words 2877 nouns_singular.words 72 transitional.words 270 verbs.words 7034 total 528 wrong_ds_270911.txt
that means i have a 7,5% failure rate with my script in its current form. unless my math is totally off, which it could well be as i pretty much suck at it. anyway, i feel like i'm making good progress here!
Last edited by rhowaldt (2011-10-03 21:10:50)
finished my script for comparing the lists of correct syllable-counts to the lists outputted by the 'detect_syllables' script!
if anyone has recommendations for stuff to change or do more efficient or whatever, please let me know, happy to learn. while i was at it i figured out how to make a fancy progress counter as well :)
#!/bin/bash # detect_syllables helper script # to check for differences with .wc files DS=~/scripts/detect_syllables DATE="$(date +%d%m%y)" ADJ=/home/rhowaldt/scripts/words_ref/adjective.words ADV=/home/rhowaldt/scripts/words_ref/adverbs.words DESC=/home/rhowaldt/scripts/words_ref/descriptive.words NOUN_S=/home/rhowaldt/scripts/words_ref/nouns_singular.words NOUN_P=/home/rhowaldt/scripts/words_ref/nouns_plural.words TRANS=/home/rhowaldt/scripts/words_ref/transitional.words VERB=/home/rhowaldt/scripts/words_ref/verbs.words TEST=/home/rhowaldt/scripts/words_ref/test.words echo "1. adjectives" echo "2. adverbs" echo "3. descriptive" echo "4. nouns (singular)" echo "5. nouns (plural)" echo "6. transitional" echo "7. verbs" echo -n "List to check: " read LIST case $LIST in 1) FILE=$ADJ ;; 2) FILE=$ADV ;; 3) FILE=$DESC ;; 4) FILE=$NOUN_S ;; 5) FILE=$NOUN_P ;; 6) FILE=$TRANS ;; 7) FILE=$VERB ;; *) echo "WRONG!" exit 1 ;; esac echo "Detecting syllables for $(basename $FILE)..." BS=$(echo $(basename $FILE) | sed 's/\.words//') WC="$BS.wc" TFILE="$BS.tmp-$DATE" DFILE="$BS.chk-$DATE" COUNT=0 cat $FILE | while read LINE do ((COUNT++)) echo -en "\r$COUNT" echo "$LINE ($($DS $LINE))" >> $TFILE done echo "; Done." echo "Making comparison..." LN=0 cat $TFILE | while read COMP do ((LN++)) echo -en "\r$LN" LNP=$LN"p" NUM=$(echo $COMP | grep -o [0-9]) NUM2=$(sed -n "$LNP" $WC) if [ $NUM -ne $NUM2 ]; then echo $(sed -n "$LNP" $TFILE) >> $DFILE fi done echo "; Done." echo echo "Results: $TFILE" echo "Difference: $DFILE" exit 0
@tranche: thanks, read that one! however, that was written with word break-off in mind, in type-setting situations. i just want to determine the amount of syllables, not their correct break-off points. so that piece, although very informative, turned out not to be so relevant in the end.
how about command line options insted of interactive script control? I (nearly always) prefer command line options over interactive selection. you can do it like this:
case $1 in adj|adjective) FILE=$ADJ;; adv|adverb) FILE=$ADV;; desc|descriptive) FILE=$DESC;; singular|noun_singular) FILE=$NOUN_S;; plural|noun_plural) FILE=$NOUN_P;; trans|transitional) FILE=$TRANS;; verb) FILE=$VERB;; -h|--help) echo "Help message." >&2; exit;; *) echo "WRONG!"; exit 1;; esac
@luc: thanks for that, really good example answering some questions i would've had would you just have suggested that without the example :)
i think i will build this in. i share your love for commandline-options, and actually the only reason i did it through a menu was because i'd never built such a menu before and i wanted to try it out for a change. just teaching myself new stuff as i go along, like the 'count in place' thingy. still think it is awesomely cool how the numbers run up in place. (before that, i just threw a 'echo -n "."' in those while-loops, giving me a pretty cool line of dots... the numbers, however, are even cooler and more functional as i can actually check if the two files are the same length)
Last edited by rhowaldt (2011-10-05 13:04:57)
so, it has been a while since i worked on this (couple of months i suppose), but started again tonight. i removed some over-complicated ideas i implemented when i last worked on the script, because they were over-complicated
what i added is a system (or at least the start of it) to separate composite words. for some cases this is easy, and these cases i have already implemented. for example, split a word like 'autoimmune' into 'auto-immune', so the script may perform its tasks on the separate parts 'auto' and 'immune', instead of mistaking 'oi' for a diphthong (and - wrongly - applying my diphthong rules).
this system has already shown its worth, and it has improved my algorithm considerably. i believe a system such as that is necessary for detecting the syllables in words like 'milestone', 'guideline', 'horsepower', where a silent e in the first part of the composite word is blocked by the second part.
however, it would be almost impossible to consider all possible composites. doing this would mean saying 'if a word ends in 'power', make it '-power' (for the horsepower-example)'. how many words can you come up that i'd need to check for? the list is pretty much endless. currently i'm already checking for a bunch of them, like '-able', 'auto-', 'hydro-' etc.
so what is on the agenda now is to figure this problem out. how to determine where to split a word? i think this might prove pretty difficult. however, i feel it is also one of the final hurdles i'll have to take to complete this script, so there is some good stuff on the horizon, or so it seems.
well, so much for the update. hope someone is still following this
sorry johnraff, it seems the forum didn't give me an update on this thread so i never saw your question. luckily for you, when you read the above story, you'll see i have recently improved the script again. i have posted the current code on pastebin, so you can use it if you still want to: http://pastebin.com/SYaendP9
oh, a couple of hints: remove the 'debug' line (283?) to not output the result of the conversion and the counted number of syllables to the screen.
also, when you are reading the code, and you see !#Exc., that means that word is marked as an exception, and i need to handle it separately because i cannot properly catch it with rules. i have not yet built these exceptions in, so they will fail. i'll see when i get to building those in.
Last edited by rhowaldt (2011-12-09 23:44:15)