On a Thread of the Web: Wordle（４）

2022年3月30日水曜日

Wordle（４）

Wordle（３）からの続き

Wordleをやっていると，そもそも英語の文章で出てくる単語のうち5文字である確率はどんなものか，また，最もよく出現するのは何文字の単語なのか，などなどが気になるようになった。

これを調べるために，与えられたテキストファイルやpdfファイルから単語を切り出して，その文字数の分布を調べるためのシェルスクリプトを作ってみた。

case \$3 in
"txt")
echo "txt";
;;
"pdf")
echo "pdf";
pdftotext \$2.pdf;
;;
*)
echo "undefined";
;;
esac

for ((i=1 ; i<=\$1 ; i++))
do
perl -nse 'while (/\b[a-z]{\$num}\b/ig) {print "\$&\n";}' -- -num=\$i \$2.txt | tr A-Z a-z > tmp.txt
cat tmp.txt | wc -l >> \$2-\s1.txt;
cat tmp.txt | sort | uniq | wc -l >> \$2-\s2.txt;
rm tmp.txt
done