You are not logged in.
i like words and language, so i decided to make a script that builds English sentences (ok, a bit free-form sometimes, call it poetry) out of random words. a part of this script would be a way of detecting syllables, in order to arrange words in some metrically desirable way, or something.
anyway, here is the 'detect_syllables' script i just finished. it is a bit messy, and if people have suggestions for improvements etc, that is very welcome indeed. i did all this through loads of Google and experimentation, so that explains the mess a bit i think.
#!/bin/bash
# script to detect syllables in a word
# count the vowels in the word.
# subtract any silent vowels, (like the silent e at the end of a word, or the second vowel when two vowels are together in a syllable)
# subtract one vowel from every diphthong (diphthongs only count as one vowel sound.)
# the number of vowels sounds left is the same as the number of syllables.
# usage: detect_syllables [word] ([word] [word] ...)
# exit when no argument is given
if [ $# -lt 1 ]; then
echo "$(basename $0): no argument given." >&2
exit 1
fi
# continue if there is an argument
full_count=""
for arg in "$@"; do
# convert to lowercase
word="$( tr [:upper:] [:lower:] <<< $arg )"
# cleanup of the word for syllable-matching:
# remove silent vowel 'e' at the end, unless it is the only vowel in the word
# remove stuff ending in '...ened', which has 2 vowels but is 1 syllable
# remove diphthongs (ou,ai,ei), replace with 'o'
# include 'y' when flanked by consonants (which makes it a vowel), f.e. in 'dyke', replace with 'o'
# ! is special character, translates into a single vowel.
# '#' is special character, translates into double vowels.
# '_' is special character, translates into nothing
clean=$(sed 's/^...$/!/ ; s/^e[^aeiou!#_]/!/ ; s/coax\|ua\|ire\|ove/#/g ; s/i[eao][rt]/#/g ; s/ce$\|se$\|ve$\|mes$\|fe$/_/ ; s/[a-z][oiu].e/!/g ; s/ou/!/g ; s/theater/th#t!r/g ; s/[^aeiou#!_]le$/!/g ; s/e[rt]e/!/ ; s/[aeiou][aeiou]/!/g ; s/[bgt][aeiou][aeiou]/!/g ; s/[^aeiou#!_]y/!/g ; s/#/ii/g ; s/!/i/g' <<< $word)
# 1] 3 letter words are always one syllable
# 2] 'ua' must stay as 2 vowels
# 3] 'ou' must always become 1 vowel, even after 'th'
# 2] to catch 'theatre' (and hopefully more..?)
# 3] consonant l-e syllables. consonants followed by 'le'
# 6] catch double vowels (unless preceded by 'th')
# 7] catch double vowels a 2nd time, unless preceded by certain consonants
# 8] to catch the cases where the 'y' becomes a vowel
# debug:
echo -n "$clean, "
# --- PROBLEM WORDS:
# bake
# above
# immediately
# reactions
# cautious
# closer
# sixes
# strangely
# powerful
# meander
# equaly
# headquarters
# dozens
# custodians
# count the syllables
syll_count=$(grep -io [aeiou] <<< $clean | wc -w)
# when syllable count returns 0, it must be 1
if [ $syll_count == 0 ]; then
syll_count=1
fi
# add the syllable count to the full count
full_count="$full_count $syll_count"
done
echo $full_count | tail -c +1
exitLast edited by rhowaldt (2011-09-18 04:27:00)
Offline
All I can suggest is some trivial polishing up...
Send message to standard error instead of standard output with >&2, and set non-zero exit code:
echo "$(basename $0): no argument given." >&2
exit 1tr [:upper:] [:lower:] is safer than tr '[A-Z]' '[a-z]' for some alphabets.
Save a subshell "echo $var|process" by using "herestring" <<<
word="$( tr [:upper:] [:lower:] <<< $1 )"You can put all those sed commands together, separated by semicolons:
clean=$(sed 's/e$//;s/ened$/o/;s/[aeiou][aeiou]/o/g;s/[^aeiou]y[^aeiou]/o/g' <<< $word)You don't need a temporary file just for grep - you can send it standard input directly from $clean instead.
echo $clean | grep...or
syll_count=$(grep -io [aeiou] <<< $clean | wc -w)WARNING I haven't actually tried any of this...
John
--------------------
( a boring Japan blog , and idle twitterings )
Offline
@johnraff: thanks, incorporated all those and it is still working perfectly fine 
did discover some errors, but they are not in the code but in the method of syllable detection. seems i have not taken into account every single English word out there 
just found out 'many' registered as 1 syllable. changed that already. now to find a way for 'beautiful' to register as 3 instead of 4 syllables...
edit: editted my 1st post with the new script. also incorporated the ability to provide multiple words as arguments, which is working great as well.
Last edited by rhowaldt (2011-09-17 16:27:36)
Offline
Hmm. I'll need to keep an eye on this topic and try it a bit later, for sure. This certainly could be something interesting in the long run. Thank you, rhowaldt. 
Offline
@dubois: thanks man, no problem. i never thought anyone but myself would be able to find a use for something like this. pleasantly surprised! if you're interested, here is my 'words' script that used the 'detect_syllables' script. it is very much under construction, and you do not have the word-lists that need to go with it (if you want i can provide you those as well of course), but i think you can see what i am getting at here.
#!/bin/bash
# not 100% sure what this is yet. something to do with words (?)
# ... [adjective] [noun]
# define variables for different types of words
ADJ=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/adjective.words)
ADV=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/adverbs.words)
DESC=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/descriptive.words)
NOUN_S=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/nouns_singular.words)
NOUN_P=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/nouns_plural.words)
TRANS=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/transitional.words)
VERB=$(shuf -n 1 /home/rhowaldt/scripts/words_ref/verbs.words)
# detect syllables to determine metrum
DS=~/scripts/detect_syllables
echo "$ADV ($($DS $ADV)) $DESC ($($DS $DESC)) $ADJ ($($DS $ADJ)) $NOUN_P ($($DS $NOUN_P)) $VERB ($($DS $VERB)) $TRANS ($($DS $TRANS))."
exitsome example outputs as it is right now (this will not be its final form of course):
[18:24:54]$ words
many times (2 2) frightfully (3) valid (2) mothers (2) joke (1) after a few days (2 1 1 1).
[18:25:10]$ words
always (2) lazy (2) beautiful (4) twenties (2) stitch (1) for this purpose (1 1 2).
[18:25:20]$ words
in order to (1 2 1) skillful (2) perplexed (3) governors (3) beep (1) however (3).
[18:27:53]$ words
all (1) faintly (2) optimum (3) originals (4) creak (1) for this purpose (1 1 2).
[18:32:59]$ words
yesterday (3) quiet (2) angry (2) formats (2) snooze (1) yet (1).Last edited by rhowaldt (2011-09-17 16:33:13)
Offline
Yeah. Like this but for computers? I'll definitely give this a go later.
Offline
@dubois: something like that, yes. generative poetry. was thinking of using the syllable-detection for building haikus as well. it can be difficult to make somewhat proper sentences though, that is something i will really have to work on.
Last edited by rhowaldt (2011-09-17 17:14:23)
Offline
# --- PROBLEM WORDS:
# coax
# eager
# extenuations
# although vs theatrethat is about the 10th list so far..
damn, it is pretty difficult writing a script that accurately determines syllables for words. there are so many combinations of letters and grammar is a system of exceptions upon exceptions upon exceptions. but i am determined and will go on. maybe do some Googling on a set of rules for syllable-detection or something..
Last edited by rhowaldt (2011-09-18 01:38:08)
Offline
clean=$(sed 's/^...$/!/ ; s/^e[^aeiou!#_]/!/ ; s/coax\|ua\|ire\|ove/#/g ; s/i[eao][rt]/#/g ; s/ce$\|se$\|ve$\|mes$\|fe$/_/ ; s/[a-z][oiu].e/!/g ; s/ou/!/g ; s/theater/th#t!r/g ; s/[^aeiou#!_]le$/!/g ; s/e[rt]e/!/ ; s/[aeiou][aeiou]/!/g ; s/[bgt][aeiou][aeiou]/!/g ; s/[^aeiou#!_]y/!/g ; s/#/ii/g ; s/!/i/g' <<< $word)
# --- PROBLEM WORDS:
# bake
# above
# immediately
# reactions
# cautious
# closer
# sixes
# strangely
# powerful
# meander
# equaly
# headquarters
# dozens
# custodianscan you believe that with such a sed-expression i still have this huge (and incomplete) 'problem word' list?
now, off to detect the patterns in these words and catch them with sed!
(damn i need to go to sleep, 06.31 here..)
Offline
i would comment the main sed command differently:
do not use shell commands in front of and after the sed command but use sed comments inside the sed command. and also do not use semicolon in long sed commands but newlines.
like this
clean=$(
sed '
# does this
s/^...$/!/
#does that
s/^e[^aeiou!#_]/!/
#does something
s/coax\|ua\|ire\|ove/#/g
#and so on
' <<< $word
)you might not want to move all comments inside the sed expression but some might also help others to better/faster understand your code.
btw very interesting project
luc
Offline
@luc: thanks man, i wanted to have it like that, but tried with the bash escaped newline and it didn't work. that is why i settled for the comments after the whole sed-expression. i'll try your suggestion, would be much better!
Offline
so i decided to Google 'syllable algorithm', and check it out: http://stackoverflow.com/questions/4051 … -in-a-word
no real solution yet, but seeing at least two people wrote thesis and stuff on this, it is even more difficult than i already thought. might go for either using an existing algorithm or using parts of them. at least i got some complicated thesis to read now which will keep me busy for a while.
(my wife asked 'why are you doing this?' - i said 'just for fun'... i must be insane or something :)
Offline
@luc Thank you for that sed insight! 
John
--------------------
( a boring Japan blog , and idle twitterings )
Offline
so, progress report:
practically shoved my entire old arm-length sed-code out, and started afresh. too much hit&miss, and that was my own fault for not working systematically enough on this. so, out with the old, in with the new. even named all the different rules for reference, and specified which words must be excepted and such:
#!/bin/bash
# script to detect syllables in a word
# count the vowels in the word.
# subtract any silent vowels, (like the silent e at the end of a word, or the second vowel when two vowels are together in a syllable)
# subtract one vowel from every diphthong (diphthongs only count as one vowel sound.)
# the number of vowels sounds left is the same as the number of syllables.
# usage: detect_syllables [word] ([word] [word] ...)
# exit when no argument is given
if [ $# -lt 1 ]; then
echo "$(basename $0): no argument given." >&2
exit 1
fi
# continue if there is an argument
full_count=""
for arg in "$@"; do
# convert to lowercase
word="$( tr [:upper:] [:lower:] <<< $arg )"
# cleanup of the word for syllable-matching
# '!' is special character, translates into a single vowel.
# '#' is special character, translates into double vowels.
# '>' is special character, translates into a single consonant.
# '_' is special character, translates into no vowels.
clean=$(
sed '
# --- PREREQUISITES
s/^..$/!/
# A1. 2-letter words are 1 vowel
# --- STARTERS
s/^dia/d#/
# S1. starting with dia-: 2 vowels
# --- ENDINGS
s/ion[s]\?$/!/
# E1. ending in -ion(s): 1 vowel
s/[auoe]re[s]\?$/!>/
# E2a. ending in -re(s): 1 vowel
# E2a. excluded forms: yre, require, fire
s/[dfkmnptvw]e[s]\?$/_/
# E2b. ending in -[dfkmnptvw]e(s): 0 vowels
s/[cghsxz]e$/_/
# E2c. ending in -[cghsxz]e: 0 vowels
# E2c. ending in -[cghsxz]es: 1 vowel
s/ye$/!/
# E2d. ending in -ye: 1 vowel
s/[aeiou]le$\|[aeiou]be$/_/
# E2e. ending in (vowel) -le/-be: 0 vowels
# E2e. ending in (consonant) -le/-be: 1 vowel
s/[^bcdfgkpstxz]led$/_>/
# E3a. NOT ending in -[bcdfgkpstxz]led: 0 vowels
s/\(.*\)[^edtl]ed$/\1_>/
# E3b. ending in -ed: 0 vowels
s/\([aeiou]\)yer$/\1!/
# E3c. ending in -yer: 1 vowel (y because of rule X1.)
s/[uio]er$/#/
# E3d. ending in -[uio]er: 2 vowels
s/ally$/l!/
# E4a. ending in -ally: 1 vowel
s/[ui]ary$/#r!/
# E4b. ending in -[ui]ary: 3 vowels
s/\([^aeiou!#]\)y$/\1!/
# E4c. ending in -y: 1 vowel
s/iest$/#/
# E4d. ending in -iest: 2 vowels
s/cie/c#/
# T5. cie must be 2 vowels
# --- DIPHTHONGS
s/ee\|oo/!/g
# D1. ee: 1 vowel
s/eau/!/
# D2. eau: 1 vowel
s/\([cgx]\)ious/\1!s/
# D . [cgx]ious: 1 vowel
s/ies/!s/
# D . ies: 1 vowel
s/[aeiou][aeiou]\([^_aeiou]\)/!\1/
# D1. catch these because of E2b.
# --- TEST
s/ace/!/
# T1. ace is 1 vowel
s/ver[y!]/v!r!/
# T2a. very must be 2 vowels (E4c.)
s/[aeiou]ve\([^aeiou!#][aeiou!#]\)/!v\1/
# T2b. *ve*
s/e[r]e/!/
# T3. ere must be 1 vowel
s/ike/!k_/
# T4. ike must be 1 vowel
#s/^e[^aeiou!#_]/!/
# e then consonant at start=1 vowel (why? maybe to save this vowel?)
#s/[^q]ua\|ire\|ove[^aeiou#!]\|ia\|io\|ea[c]/#/g
# these must always be 2 vowels
#s/i[eao][rt]/#/g
# ie, ia, io followed by r or t must be 2 vowels
#s/[^eaoiu#!][aoiu][^eaoiu#!_]e[^r]/!/g
# consonant then aoiu then consonant then e must be 1 vowel
#s/i*ou/!/g
# these must always be 1 vowel
#s/[aeiou][aeiou]/!/g
# remaining double vowels must be 1 vowel
#s/[bgt][aeiou][aeiou]/!/g
# remaining double vowels must be 1 vowel when preceded by certain consonant
#s/[et][^aeiou#!_]y/!/g
# e or t then consonant then y make y a vowel and total 1 vowel
# --- SPECIALS
s/[^aeiou#!]y/>!/g
# X1. remaining y must be a vowel
s/>/q/g
# X2. translate special char > into a consonant
s/#/ii/g
# X3. translate special char # into 2 vowels
s/!/i/g
# X4. translate special char ! into 1 vowel
' <<< $word
)
# debug:
echo -n "$clean, "
# --- PROBLEM WORDS:
# determine (ete often 1 vowel)
# triangles
# reinforcement (correct no. vowels, wrong substitution)
# influences
# cat nouns_singular_ds.txt | grep "\*"
# (ended at n)
# --- SPECIALS (probably catch these separately)
# theater, meander, area(s), creative, seance
# coax, chaos
# (E1.) dandelion, lion, scion, ion, pion, axion
# (E2b.) gimme
# (E2b.) recipe
# (E2b.) coyote
# (E2c.) fiance
# (E2c.) blase
# (E2c.) *aches
# (E3b.) embed, seabed, flatbed, roadbed, sickbed, deathbed, slugabed, waterbed, flowerbed
# (E3b.) unfed, malfed, overfed, underfed, breastfed
# (E3b.) unshed, cowshed, -thed
# (E3b.) naked
# (E3d.) -[gq]uer
# (E3d.) tattooer, shampooer
# (E3e.) priest
# (D .) bluest, truest
# (T2a.) every
# count the syllables
syll_count=$(grep -io [aeiou] <<< $clean | wc -w)
# when syllable count returns 0, it must be 1
if [ $syll_count == 0 ]; then
syll_count=1
fi
# add the syllable count to the full count
full_count="$full_count $syll_count"
done
echo $full_count | tail -c +1
exitso, it's a bit of a mess combined with orderly goodness, which is what a work in progress should look like, imo.
last post was about people writing thesis on this subject. read one of those and it focuses on work-breaking in print, for the TEX system. since i am only after counting syllables, not breaking off words in the proper places, this did not matter at all to me. so i just went ahead.
i must now pay homage to the greatest website ever for anyone attempting this. check this out: http://www.morewords.com/ends-with-by-length/yer/
that has helped me so much i cannot believe it. yes, tediously skimming through all of those words, but still.
speaking of skimming through lists of words, i have just applied my script to a 3000-word list of singular nouns, and have proceeded to look through this entire list word for word to determine where the mistakes were. this was a tedious process, but it leaves me with a reference-list which i can use to automate this process next time around.
that's it! hope you're all still with me :) still going strong here!
Last edited by rhowaldt (2011-09-26 20:21:28)
Offline
looks nice so far.
one small cosmetic bug:
the test command (aka "[") has special operators for comparing numbers. they are: -eq -ne -gt -ge -lt -le. the symbols = != < > are used to compare strings. and == is only valid in bash but not in sh. This is not really breaking anything but still i would change "if [ $syll_count == 0 ]; then" to "if [ $syll_count -eq 0 ]; then"
(read "man test" for more)
luc
Offline
@luc: thanks, this makes sense and i kinda knew about it, but it is easy to confuse different programming languages, that's why i did it with == because i'm used to that from other languages... will change it up later.
Offline
short update: all the damn manual checking is now finished. i almost have a headache, but luckily not yet.
i now have:
- lists of words (nouns.words, verbs.words, adverbs.words, adjectives.words)
- lists of syllable counts (nouns.wc, verbs.wc, adverbs.wc, adjectives.wc)
- a list with all the wrong conversions (wrong_ds_270911.txt)
to do:
- write a script to quickly compare the output of a list (the syllable counts) to the syllable counts in the .wc files
- may be possible to write a script that automatically detects a change to a previously good word suddenly turned bad as a result of the introduction of a new rule
to get an overview of the current state of things:
979 adjective.words
78 adverbs.words
123 descriptive.words
2635 nouns_plural.words
2877 nouns_singular.words
72 transitional.words
270 verbs.words
7034 total
528 wrong_ds_270911.txtthat means i have a 7,5% failure rate with my script in its current form. unless my math is totally off, which it could well be as i pretty much suck at it. anyway, i feel like i'm making good progress here!
Last edited by rhowaldt (2011-10-03 21:10:50)
Offline
finished my script for comparing the lists of correct syllable-counts to the lists outputted by the 'detect_syllables' script!
if anyone has recommendations for stuff to change or do more efficient or whatever, please let me know, happy to learn. while i was at it i figured out how to make a fancy progress counter as well :)
#!/bin/bash
# detect_syllables helper script
# to check for differences with .wc files
DS=~/scripts/detect_syllables
DATE="$(date +%d%m%y)"
ADJ=/home/rhowaldt/scripts/words_ref/adjective.words
ADV=/home/rhowaldt/scripts/words_ref/adverbs.words
DESC=/home/rhowaldt/scripts/words_ref/descriptive.words
NOUN_S=/home/rhowaldt/scripts/words_ref/nouns_singular.words
NOUN_P=/home/rhowaldt/scripts/words_ref/nouns_plural.words
TRANS=/home/rhowaldt/scripts/words_ref/transitional.words
VERB=/home/rhowaldt/scripts/words_ref/verbs.words
TEST=/home/rhowaldt/scripts/words_ref/test.words
echo "1. adjectives"
echo "2. adverbs"
echo "3. descriptive"
echo "4. nouns (singular)"
echo "5. nouns (plural)"
echo "6. transitional"
echo "7. verbs"
echo -n "List to check: "
read LIST
case $LIST in
1)
FILE=$ADJ
;;
2)
FILE=$ADV
;;
3)
FILE=$DESC
;;
4)
FILE=$NOUN_S
;;
5)
FILE=$NOUN_P
;;
6)
FILE=$TRANS
;;
7)
FILE=$VERB
;;
*)
echo "WRONG!"
exit 1
;;
esac
echo "Detecting syllables for $(basename $FILE)..."
BS=$(echo $(basename $FILE) | sed 's/\.words//')
WC="$BS.wc"
TFILE="$BS.tmp-$DATE"
DFILE="$BS.chk-$DATE"
COUNT=0
cat $FILE | while read LINE
do
((COUNT++))
echo -en "\r$COUNT"
echo "$LINE ($($DS $LINE))" >> $TFILE
done
echo "; Done."
echo "Making comparison..."
LN=0
cat $TFILE | while read COMP
do
((LN++))
echo -en "\r$LN"
LNP=$LN"p"
NUM=$(echo $COMP | grep -o [0-9])
NUM2=$(sed -n "$LNP" $WC)
if [ $NUM -ne $NUM2 ]; then
echo $(sed -n "$LNP" $TFILE) >> $DFILE
fi
done
echo "; Done."
echo
echo "Results: $TFILE"
echo "Difference: $DFILE"
exit 0Offline
See also "Word Hy-phen-a-tion by Com-put-er"
Offline
@tranche: thanks, read that one! however, that was written with word break-off in mind, in type-setting situations. i just want to determine the amount of syllables, not their correct break-off points. so that piece, although very informative, turned out not to be so relevant in the end.
Offline
how about command line options insted of interactive script control? I (nearly always) prefer command line options over interactive selection. you can do it like this:
case $1 in
adj|adjective) FILE=$ADJ;;
adv|adverb) FILE=$ADV;;
desc|descriptive) FILE=$DESC;;
singular|noun_singular) FILE=$NOUN_S;;
plural|noun_plural) FILE=$NOUN_P;;
trans|transitional) FILE=$TRANS;;
verb) FILE=$VERB;;
-h|--help) echo "Help message." >&2; exit;;
*) echo "WRONG!"; exit 1;;
esacOffline
@luc: thanks for that, really good example answering some questions i would've had would you just have suggested that without the example :)
i think i will build this in. i share your love for commandline-options, and actually the only reason i did it through a menu was because i'd never built such a menu before and i wanted to try it out for a change. just teaching myself new stuff as i go along, like the 'count in place' thingy. still think it is awesomely cool how the numbers run up in place. (before that, i just threw a 'echo -n "."' in those while-loops, giving me a pretty cool line of dots... the numbers, however, are even cooler and more functional as i can actually check if the two files are the same length)
Last edited by rhowaldt (2011-10-05 13:04:57)
Offline
@rhowaldt Which is the current "best" version? I'd like to use it in a haiku-generator...
John
--------------------
( a boring Japan blog , and idle twitterings )
Offline
so, it has been a while since i worked on this (couple of months i suppose), but started again tonight. i removed some over-complicated ideas i implemented when i last worked on the script, because they were over-complicated 
what i added is a system (or at least the start of it) to separate composite words. for some cases this is easy, and these cases i have already implemented. for example, split a word like 'autoimmune' into 'auto-immune', so the script may perform its tasks on the separate parts 'auto' and 'immune', instead of mistaking 'oi' for a diphthong (and - wrongly - applying my diphthong rules).
this system has already shown its worth, and it has improved my algorithm considerably. i believe a system such as that is necessary for detecting the syllables in words like 'milestone', 'guideline', 'horsepower', where a silent e in the first part of the composite word is blocked by the second part.
however, it would be almost impossible to consider all possible composites. doing this would mean saying 'if a word ends in 'power', make it '-power' (for the horsepower-example)'. how many words can you come up that i'd need to check for? the list is pretty much endless. currently i'm already checking for a bunch of them, like '-able', 'auto-', 'hydro-' etc.
so what is on the agenda now is to figure this problem out. how to determine where to split a word? i think this might prove pretty difficult. however, i feel it is also one of the final hurdles i'll have to take to complete this script, so there is some good stuff on the horizon, or so it seems.
well, so much for the update. hope someone is still following this 
Offline
sorry johnraff, it seems the forum didn't give me an update on this thread so i never saw your question. luckily for you, when you read the above story, you'll see i have recently improved the script again. i have posted the current code on pastebin, so you can use it if you still want to: http://pastebin.com/SYaendP9
oh, a couple of hints: remove the 'debug' line (283?) to not output the result of the conversion and the counted number of syllables to the screen.
also, when you are reading the code, and you see !#Exc., that means that word is marked as an exception, and i need to handle it separately because i cannot properly catch it with rules. i have not yet built these exceptions in, so they will fail. i'll see when i get to building those in.
Last edited by rhowaldt (2011-12-09 23:44:15)
Offline
Copyright © 2012 CrunchBang Linux.
Proudly powered by Debian. Hosted by Linode.
Debian is a registered trademark of Software in the Public Interest, Inc.