In a fascinating, but common sense 2014 study by Kawase, Hannah and Wang it was found that being able to see the lip configuration of the subjects, as they produced the consonant 'r', for example, had a significant impact on how the perceived intelligibility of the word was rated. (Full citation below.) From a teaching perspective, providing visual support or schema for pronunciation work is a given. Many methods, especially those available on the web, strongly rely on learners mirroring visual models, many of them dynamic and very "colorful." Likewise, many, perhaps most f2f pronunciation teachers are very attentive to using lip configuration, their own or video models, in the classroom.
What is intriguing to me is the contribution of lip configuration and general appearance to f2f intelligibility. There are literally hundreds of studies that have established the impact of facial appearance on perceived speaker credibility and desirability. So why are there none that I can find on perceived intelligibility based on judges viewing of video recordings, as opposed to just audio? In general, the rationale is to isolate speech, not allowing the broader communicative abilities of the subjects to "contaminate" the study. That makes real sense on a theoretical level, bypassing racial and ethnic and "cosmetic" differences, but almost none on a practical, personal level.
There are an infinite number of ways to "fake" a consonant or vowel, coming off quite intelligibly, while at the same time doing something very much different than what a native speaker would do. So why shouldn't there be an established criterion for how mouth and face look as you speak, in addition to how the sounds come out? Turns out that there is, in some sense. In f2f interviews, being influenced by the way the mouth and eyes are "moving" is inescapable.
Should we be attending more to holistic pronunciation, that is what the learner both looks and sounds like as they speak? Indeed. There are a number of methods today that have learners working more from visual models and video self recordings. That is, I believe, the future of pronunciation teaching, with software systems that provide formative feedback on both motion and sound. Some of that is now available in speech pathology and rehabilitation.
There is more to this pronunciation work than what doesn't meet the eye! The key, however, is not just visual or video models, but principled "lip service", focused intervention by the instructor (or software system) to assist the learner in intelligibly "mouthing" the words as well.
This gives new meaning to the idea of "good looking" instruction!
Kawase S, Hannah B, Wang Y. (2014). The influence of visual speech information on the intelligibility of English consonants produced by non-native speakers. J Acoust Soc Am. 2014 Sep;136(3):1352. doi: 10.1121/1.4892770.