Heterogeneous sensors for teaching spoken French

On October 11, 2003, I sent the following email to Dr. Stephen LaRocca, who at that time was employed by the United States Military Academy as director of the Center for Technology Enhanced Language Learning (CTELL) and Professor of French and Linguistics.

Subject: Nasality/lip-shape detection for CALL

Dr. LaRocca,

I just found your CTELL web site and I found it very interesting. I am a speech recognition researcher, and I am interested in computer-aided language learning (CALL) but I have never worked in it.

The CALL technology I have read about seems based on one-dimensional pronunciation scoring, sometimes enhanced by supplying information about places in the utterance where the score was particularly bad. It would be nice if the software went past this to giving specific feedback (in articulatory or acoustic terms) on how to improve. I wonder if this goal could be brought closer by using additional input devices besides the microphone.

I am interested in your opinion of the following ideas as an expert on CALL and French instruction. I expect you are busy and will not be offended if there is no reply. What I am wondering is, assuming the technology described below can be made to work reliably, do you think there would there be enough pedagogical benefit to justify its use? (Let's say either kind of extra device costs $50/seat and requires an extra minute spent at the start of each session adjusting the device's position.)

- To determine whether the student is nasalizing: a nose clip which holds under the nostril a pressure sensor (to detect air flow) or a temperature sensor (to detect warm air from nostril)

- To determine whether the student is sticking their lips out: video camera(s)

The examples (nasality, lip use) relate to difficulties an English speaker might have with French pronunciation.

The use of a camera to observe lips has the advantage that it observes the lips separately from the other articulators, as opposed to a microphone signal which has the effects of various articulators mixed together. I used the simple example of how far the lips are out because of my ignorance of other aspects of lip usage in French and of how much sophistication computer vision technology can provide.

Thank you, David Gelbart

Dr. LaRocca replied:

Hello David.

Allow me to paraphrase your question in hopes of demonstrating that I understand it. Do I think that additional input devices, including a sensor for nasal airflow and a videocamera aimed at the lips, would benefit those of us who want to automate pronunciation evaluations for learners of languages such as French?

Actually, I have never considered sensors other than microphones for this purpose, though I do understand why you have.

My first reaction is to say hey, why not? Bring on the noseclamps (ouch!) and the videocams. Neither device seems particularly difficult to build or install. Try them out; some students will presumably learn/more better with them than without them. A project at Johns Hopkins University's CLSP Summer 2000 Workshop showed that adding mouth geometry data derived from video camera imput improved speech recognition accuracy for televised news broadcasts. A similar approach might be applicable for pronunciation feedback as well. Look under Audio-Visual Speech Recognition at www.clsp.jhu.edu/ws2000

In way of a second reaction, allow me to point out that synchronizing the detection of nasal airflow and lip rounding/spreading with sound segment time boundaries is likely to be difficult. Our CTELL is wrestling right now with how to teach tone in multiword Chinese utterances, and a similar synchronization issue faces us.

So my answer is, I am not sure. Inexpensive modifications to existing computer workstations that will sense directly nasal air flow and lip rounding/spreading might provide considerable help to some students of French. Of the two sensors, I like the vidcam better. Sorting out the double set of French front vowels (round and unrounded) using a pre-synchronized image with sound (from the camera and its microphone) might be a great place to start.

Good luck with your investigations!

Steve LaRocca