There are any number of studies in various fields exploring the impact of racial, age or ethnic "physical presence" (what you look like) on perception of accent or intelligibility. In effect, what you see is what you "get!" Visual will often override audio, what the learner actually sounds like. Actually, that may be a good thing at times . . .
Haptic pronunciation teaching and similar movement-based methods use visual-signalling techniques, such as gesture, to communicate with learners concerning status of sounds, words and phrases. Exactly how that works has always been a question.
Research by Collegio, Nah, Scotti and Shomstein of George Washington University, summarized by Neurosciencenews.com, “Attention scales according to inferred real-world object size", points to something of the underlying mechanism involved: perception of relative object size. The study compared subjects' reaction or processing time when attempting to identify the relative size of objects (as opposed to the size of the image of the object presented on the screen). What they discovered is that, regardless of the size of the images on the screen, the objects that were in reality larger consistently occupied more processing time or attention.
In other words, the brain accesses a spatial model or template of the object, not just the size of the visual image itself in "deciding" if it is bigger than an adjacent object in the visual field. A key element of that process is the longer processing time tied to the actual size of the object.
How does this relate to gesture-based pronunciation teaching? In a couple of ways potentially. If students have "simply" seen the gestures provided by instructors (e.g., Chan, 2018) and, for example, in effect have just been commanded to make some kind of adjustment, that is one thing.The gesture is, in essence, a mnemonic, a symbol, similar to a grapheme, a letter. The same applies to such superficial signalling systems such as color, numbers or finger contortions.
If, on the other hand, the learner has been initially trained in using or experiencing the sign, itself, as in sign language, there is a different embodied referent or mapping, one of experienced physical action across space.
In haptic work, adjacent sounds in the conceptual and visual field are first embodied experientially. Students are briefly trained in using three different gesture types, distinctive lengths and speeds, accompanied by three distinctive types of touch. In initial instruction, students do exercises where they experience physically combinations of those different parameters as they say the sounds, etc.
For example, the contrastive, gestural patterns (done as the sound is articulated) for [I], [i], [i:],and [iy] are progressively longer and more complex: (See linked video models.)
a. Lax vowels, e.g., [I] ("it')- Middle finger of the left hand quickly and lightly taps the palm of the right hand.
b. Tense vowels, e.g., [i] ("happy")- Left hand and right hands touch lightly with finger tips momentarily.
c. Vowel before voiced consonant, e.g., [i:] ("dean") - Left hand pushes right hand, with palms touching, firmly 5 centimeters to the right.
d. Tense vowel, plus off glide, e.g., [iy] ("see") - Finger nails of the left hand drag across the palm of the right hand and, staying in contact then slide up about 10 centimeters and pause.
The same principle applies to most sets of contrastive structures and processes, such as intonation, rhythm and consonants. See what I mean, why embodied gesture for signalling pronunciation differences is much more effective? If not, go here, do a few haptic pedagogical movement patterns (PMPs) just to get the feel of them and then reconsider!