Daniel Johnson – “The Emotions that Get Stuck in Your Throat”: Expressivity in Speech, Script, and Sound in Japanese Animation

The final episode of the first season of the TV anime My Dress-Up Darling (Sono bisuku dōru ha koi wo suru) (Tokyo MX, 2022) concludes with a phone call between its two protagonists, Gojo Wakana and Kitagawa Marin. The duo are high school classmates and have grown close over their collaborations on Kitagawa’s cosplay hobby, which involves dressing up as characters from her favorite computer games. Gojo, who comes from a family of artisans who make traditional Japanese dolls (hina ningyō), possesses skills that are unexpectedly suited to designing and crafting outfits for Kitagawa. The scene in question comes at the end of their summer vacation.

Kitagawa begins the call by explaining how she has worked herself into a fright by watching a horror movie on her own and wants to talk in order to calm down before going to bed. Gojo agrees, but he’s exhausted from his day and falls asleep midway through their conversation. His phone is still on, however, so Kitagawa can hear his breathing as he dozes off. Knowing that he likely can no longer hear her, she quietly confesses her feelings over the phone, telling Gojo suki da yo (“I love you”). There is no answer from Gojo save for the gentle sounds of his breathing as he drifts off into sleep.

Whispers and breathing are intimate forms of vocalization. They project meaning in how they sound the speaking body, highlighting the physical texture of the voice alongside the sensation of proximity between speaker and listener. The opening example from My Dress-Up Darling is evocative of this potential of the voice in animation for expressing emotion and physical presence through its hushed tones and sighs. Following that sentiment, this article is concerned with the expression of emotion in contemporary TV animation from Japan. The ways that technology (such as the telephone and the microphone) captures emotion through sounds of breathing and whispering voices will form a significant part of this investigation into expressivity, particularly in terms of the sensation of corporeality – a vocalizing body – suggested by said expressions. The animated embodiment of speech and writing will therefore take priority over the denotational content those expressions offer. I see this as part of what is “animating” in animation, in which the unity of form of the human body and structuredness of language are open to playful revision at the service of a cartoon logic of expression that finds meaning in overflowing emotion rather than rational, contained forms of expression.[i]

My Dress-Up Darling (hereafter Dress-Up) is an example this article will return to, although the primary focus is the TV anime adaptation of Komi Can’t Communicate (Komi-san ha komyusho desu) (hereafter Komi-san), which, like Dress-Up, is a high-school coming of age love story and comedy. This article navigates the ways that voice, script, and body in traditional animation are rendered through a technological condition that cartoonishly dramatizes the externalization of emotion through the animation of onomatopoeic script and the overflowing expressivity of the whispering voice. I argue that these two expressive forms provide a way for thinking about the sensation of emotion in Japanese animation, and in a way that connects that mode of expressivity to the drama of being “in” and “outside” of the animated body via the dual registers of voice and text. For the voice, we can observe this through the dynamic created between the speaking character we see on screen and the voice of the unseen actor (seiyū) that figuratively share the same body. In the case of animated text, it is in how onomatopoeic script is used to visually externalize emotions felt by the character(s) through bold sound effects and ideophones.

This article will focus on forms of expression that emphasize emotion and bodily sensation rather than structured communication. Whispers and sighs will take priority over dialogue, and the sound effects of onomatopoeic script that create mimetic shocks and jolts will provide a visual rhyme to this type of overflowing expressivity. Furthermore, in linking these two forms of expression to the (animated) body as a vessel for representing emotion, I will also pay attention to their technological rendition: the microphone for the voice and compositing tools for animated onomatopoeia. It is through this ambivalence between the technological mediation of emotion and the cartoon logic of its expression that I will locate how contemporary Japanese animation figuratively resolves the drama of overflowing emotion in a way that evokes the very mediation of the animated body.


First Day of the New School Year

Komi Can’t Communicate began as a manga written and drawn by Oda Tomohito. It first appeared in 2015 as a one-shot in Weekly Shonen Sunday and launched as a regular series in 2016 in the same magazine. A TV animation produced by OLM Inc. first aired in 2021 on TV Tokyo and its regional affiliates in their late-nighttime slot. It resumed for a second series in 2022. The manga has progressed far beyond this point in the story, with Komi’s ability to vocalize speech growing past what she is capable of in the chapters adapted in the animation.

The series follows a group of classmates at an elite high school in Eastern Japan. It begins with their first year at school as everyone is getting acclimated to the new environment and peer group. The primary focus is on two characters, the female student Komi Shoko and her male classmate Tadano Hitohito. Komi is presented as a striking beauty who quickly becomes the idol of her classmates, but who also suffers from intense anxiety when facing other people. This makes it nearly impossible for her to speak with other people face-to-face.[ii] Tadano is the first to realize that she struggles with communicating with others (rather than merely being aloof or cold), and he decides to help her open up and succeed in her goal of making one hundred friends. The process of Komi making friends serves as a way for the series to introduce new characters, often in small batches, as the classroom peer group begins to expand. The story also quickly grows into a teenage romance between Komi and Tadano as they grapple with their feelings for one another alongside Komi’s difficulties in expressing herself.

The use of onomatopoeia is a common element in the expressive forms of manga, and the print version of Komi-san makes extensive use of these effects to render comic elaborations on its characters’ emotional states. Many of these effects are used in tandem with Komi’s inability to speak freely by employing gitaigo script. These are onomatopoeia used to represent sensations rather than literal sounds and include sensations such as gogogo (a gloomy or menacing aura) or dododo (rising intensity), which appear when another character is confused by the absence of verbal response from Komi. Komi is also frequently depicted in a chibi (diminutively proportioned cartoon illustration) form without a mouth to further emphasize her inability to speak and accompanying awkwardness. Because manga lacks actual sound, these effects often become proxies for representing sounds or the idea of sound, with sensations of loudness, duration of silence, and proximity of source being rendered figuratively through onomatopoeic script and related visual cues (Yomota 94).

The TV animation adaptation of Komi-san includes voiced dialogue, music, and other sound effects, but it still maintains many of the visual cues used to represent sounds that were found in the original manga. This presentation of onomatopoeia was accomplished by editing tools that allow for the compositing of these scripts into the drawn image, a technique that first gained widespread recognition in the 2012 TV animation of JoJo’s Bizarre Adventure (JoJo no kimyō na bōken)(Tokyo MX) made by David Production.[iii] These strings of text often appear as they were originally written in the manga, but with animated effects that produce sensations of movement. Other examples of TV anime that use onomatopoeia effects include the aforementioned Cloverworks’ 2022 TV anime My Dress-Up Darling (Tokyo MX) and David Production’s 2022 version of Urusei Yatsura (Fuji TV).

Komi’s voice in the TV anime adaptation is provided by Koga Aoi. However, because this character rarely vocalizes speech, her voice is instead portrayed largely through sighing and gasping. In all, her speech is represented primarily through different forms of breathing. The moments in which she does speak – often through a telephone rather than face-to-face communication – are depicted as a whisper. This helps portray her struggles with making the words leave her body. The animation compensates for Komi’s muted speech by relying on what are essentially “close-ups” of her dialogue. This is sometimes achieved by images of her mouth depicted in extreme close-up, but more generally, it is done by artificially amplifying the volume of her voice so that it can be heard clearly despite our understanding that her voice is actually difficult to hear for those around her. The voices of other characters are not amplified in the same manner, hence the feeling of Komi’s voice being portrayed in “close-up.” The treatment of Komi’s hushed voice in this manner demonstrates a variation on the “vococentric” model of sound design Michel Chion has remarked upon, in which the clarity of the voice is privileged within sound design (6).[iv] This also has the added effect of making Komi’s voice (that of Koga Aoi) sound as if it were whispering into our ears.

We can observe how these two forms of expression contribute to the emotional tenor of the series. This is particularly salient in how overflowing emotion is constantly being articulated in relation to a spatial relationship between speaker and listener. The use of visual cues (onomatopoeic script) to represent sounds and sensations turns things that are felt within a character’s heart or mind into something that floats and travels outside of their bodies, but the pronounced emphasis on breathing for Komi’s speech also makes more tangible the sensation of the voice actor speaking or breathing into the microphone. That allusion to the voice actor’s own speaking body introduces another element of expressivity. These strong traces of emotive utterances are both deeply entwined in the material production of animation and how voice is used in said productions. In that sense, the technologies of emotion will be the central focus of my analysis of Komi-san as a work of TV animation from Japan in the twenty-first century.


Whispering into the Receiver

One of the first instances of Komi speaking occurs after she obtains a cell phone for the first time. This happens in episode 3 of the animation.  The new acquisition leads her to ask her new friends for their numbers, a process that goes off the rails due to her inability to clearly vocalize that wish. However, after getting her friends to enter their respective numbers into her phone at school, she returns home and fidgets with the new device in her room, enjoying the novelty of being able to (potentially) communicate in a new manner. The model of phone her parents bought for her has the feature of automatically dialing when it is held against the user’s ear, though, so when Komi play-acts at making a phone call, she accidently dials Tadano. Caught off guard, she immediately hangs up, but he calls her right back to make sure everything is okay.

Komi does manage to answer the phone with her voice, but still in a panic, pretends to be an automated message about the number being out of service. Her voice is extremely timid, speaking in a hushed whisper and barely able to get the words out. Even before answering the phone, Komi’s voice is expressed through a series of short, panicked breaths and cries of distress as she’s realized what has happened.  Upon answering the phone, she begins by taking a breath that can be heard over the line (the image is of Tadano holding his phone up to speak). She then stammers through the fake message of being a disconnection notice to escape from the embarrassing situation. This is followed by another bout of nervous breathing into the phone, after which she hangs up. As the call ends, Tadano notices that she is able to speak – if only a little – over the phone, before rushing to the window and exclaiming “she has a really beautiful voice!” (koe meccha kirei!).

The relationship between the spoken voice and the telephone is a familiar trope in popular media. This is due in large part to the telephone’s ability to close physical distance, which in turn has allowed storytellers to use that device to link distinct spaces via simultaneous action (Gunning; Schantz). Writing on technologies of sound in modern Japan, Kerim Yasar observes how the telephone allowed “the intimacy of the voice to reach the ear in a new way” (Yasar 40-41). In this sense, the telephone is comparable to the radio as a device for making voices feel near and personal. The most salient function of the phone in Komi-san is its role as a virtual microphone within the diegesis for capturing and amplifying Komi’s hushed whispers. It is used, as Simon Frith has written of the microphone, as a device that allows us to “hear people in ways that normally implied intimacy—the whisper, the caress, the murmur” (187). The microphone and telephone both collapse the distance between speaker and listener, a routine that allows for this “implied intimacy” to be expressed by the voice that carries across space, but also allows for things that might normally be difficult to express to be said.

A more prolonged sequence featuring Komi’s voice being carried through the phone occurs in episode 19. Komi and Tadano’s class learns that they will be going on a class trip to Kyoto. The two friends walk home together after the school day is over, but as they part ways, Komi hides herself behind the wall outside of a suburban home and calls Tadano to confess that she has never been to Kyoto before (despite having claimed earlier that this is where her middle school class trip went to). The characters are not facing one another during this sequence, but the use of their phones as a way of bridging the distance (and direction) allows for the conversation to still feel personal and intimate. This conversation is also the most that Komi has spoken aloud to this point, and even with the rendering of her speech as a whisper, we can observe the emotion in what she says as she tries to explain to Tadano how lonely and isolated she has felt in the past. Each piece of dialogue is preceded by a stressed breath, and she repeatedly sighs as she tries to gather her words. Her eyes are rarely shown, and her mouth is drawn as a very small shape – features that emphasize her struggle with communicating (such as being unable to look Tadano in the face to speak to him) while also visually signifying the softness of her voice.

Following Kerim Yasar’s point on the voice “reaching the ear” in a new, intimate way via the telephone, it’s worth considering two complementary elements in how this sense of “reaching” is achieved within animation. The first is how Komi’s spoken voice is represented as hushed, almost to the point of being inaudible. This is achieved through Koga Aoi’s performance of the character, which she describes as whispering “feelings that get stuck in her throat” (nodomoto de tsukkaechau kimochi) (Weekly Shonen Sunday, 4). The second is how that same voice is projected as loud through overlapping technological forms – the telephone within the drama of the diegesis, and the microphone within the production of the animation. What we end up with is a voice that is dramatically understood to be quiet and strained, but which we hear as perfectly loud and legible.

This overlap between telephone and microphone creates a doubling of how the voice can be augmented through technology. The animation insists on the amplifying power of the telephone by constantly showing us Komi and Tadano speaking into these devices and holding their phones close to their ears to be able to hear one another. But there is also the sensation (and awareness) of the voice actor whispering into the microphone in the studio. This suggests another form of proximity of the voice and even intimacy of expression that complements what the telephone is represented as providing: the “reaching the ear” as intimate expression.  The intimacy of the voice is staged twofold through its technological capture and transmission, and with a speaking body felt more tangibly through its amplified sense of proximity to the listener.

The emphasis on breathing, gasps, and sighs foregrounds the hushed whispers of Komi’s speech, and through their technological mediation, we still understand these expressions despite the constraints on her ability to communicate. In that sense, the priority of expression in Komi’s voice is not found in words or language in a conventional sense, but rather through the ways in which the volume of the voice can project a feeling of closeness between speaker and listener. The voice we’re being offered here is one that projects a sense of being near and personal, filled with meaning and emotion, but sometimes in ways that feel difficult to grasp. This is the sensation of being “stuck in your throat,” as Komi’s seiyū describes it. This notion of the voice being “stuck” or somehow between different roles/performers has been an important part of how voicework is imagined in Japan (and especially for animation), which the following section will elaborate upon further.


Speaking of the Voice

Dieter Mersch has described the voice as what “marks boundaries between the inwardness and outwardness of the subject” (26). In animation, this sense of the voice being something that holds expressivity within and without the body seems especially relevant due to the composite nature of how the recorded voice is paired with the drawn body.  Pamela Robertson Wojick has similarly observed that voice acting is sometimes described as “unfastened from the body,” a sentiment that echoes Mersch’s assertion (71). Shunsuke Nozawa has described how the language of “the person inside” (naka no hito) has developed as a kind of homegrown theorization by voice actors to describe the process of recording in the sound booth and the sensation of mediation found within that setting (170). Sugawa Akiko elaborates on this notion, describing how the discourse on “the person inside” in magazines and internet media helped establish a routine of fantasy in which the voice actor is figuratively located at the verge of 2D and 3D regimes of expression, with their voice belonging to themselves but also the character(s) they play (19, 62). This disassociation of the voice from the performer is frequently repeated when voice actors speak at live appearances in front of their fans. Such appearances serve as a kind of performance of shared values that entertains the fans’ interest in seeing their fantasy of how voices are subsumed by characters be reflected back to them, but also in reinforcing the fantasy of voices as animating forces that “breathe life” into characters with vitalizing agency (Nozawa172).

We can see that sensibility in the way that Koga Aoi describes her relationship to Komi. In an interview with Weekly Shonen Sunday, Koga describes how she felt emotionally closer to the character due to being nervous during the early recording sessions, which she likens to Komi’s own extreme anxiety in speaking before others (Weekly Shonen Sunday 10). This feeling of association is further articulated when Koga describes how she was “proud” of Komi for gaining the courage to express herself throughout the course of the series, similar to how Koga herself began to feel more confident with the role over time (Shonen Weekly Sunday 10).

This routine of “disembodiment” of actor from character is a regular part of the work of a seiyū in Japan. The extra-textual spaces of interviews, public appearances, and radio are where the self-sustaining discourse between performers and audiences often affirms itself. However, on a more material level, we can find this disassociation of body and voice in the production stage. Nozawa notes that what makes this sensation of disembodiment function is the microphone, the device that technologically separates the voice from the speaking body (172). Fujitsu Ryouta offers a detailed explanation of how the recording of the voice in animation is practiced in Japan, also noting how the “feeling of vitality” (seimeikan) that we perceive in an animated character is allowed by the process of “after-recording,” in which the voice actor conducts their performance with footage of the character on a screen before them (103). The overlapping (kasanari) of the “color” (iro) of both actor and character is, as Sugawa Akiko describes, a result of the virtual synthesis of the actor’s “body” (shintai) with the “information” (jōhō) of the character within the animated image (29). Sugawa extends this idea to consider the significance of the corporeal body (nikutai) and copresence (kyōzai) in live, in-person performances by voice actors in Japan (such as musicals) where the bodies of performers (and audiences) share space with the characters (37).

Breathing into the microphone offers a particularly powerful feeling that it is the corporeal body (nikutai) of a human that is speaking. The “color” of a character’s voice – which may be defined by hyperbolic ways of speaking or even language not normally used in everyday speech – is reduced in favor of a very human and very ordinary form of vocalization, and one that we associate more readily with a human body than an animated one. The expression of a character’s personality or role within the action is also scaled back by the assertion of the whispering voice, one that more distinctly conjures the presence of a human body vocalizing and the emotional weight of its exhalation.

Writing on the introduction of microphones to radio dramas and singing, Jacob Smith describes how this device allowed for a sensation of closeness in the voice (82). The microphone encouraged the sense of an “intimate connection” between speaker and listener by capturing nuance in the voice and through the directness that was instilled by listening at home and in private (Smith 85). The aesthetic effect of this was to locate the voice of the singer or speaker “in our ear,” coming through clearly even when accompanied by other forms of sound, such as music. Animations such as Komi-san are far removed from the sound culture that Smith is describing, but how the voice is received as if speaking directly and intimately through the microphone conjures very similar affect of human connection that is felt as overflowing with emotion.

Vocalizations that are not strictly language are particularly impactful in this regard. We find instances of Komi’s breath being used as a way of expressing her emotional state throughout the series. This includes the pattern of a deep exhalation followed by an inhalation being used to mark Komi’s pause as she expects something to happen but also gasps being used to identify moments of shock and embarrassment. We can also observe this in more isolated instances, such as when she deeply exhales after Tadano stumbles in his mock love confession during the “confession game” of episode 14, which shows her exiting the classroom and then deeply sighing into the palms of her hands once out of sight of the other students. This is also accompanied by the sound effect of steam venting as if to figuratively portray her embarrassment in a more comical manner.

One of the most dramatic examples of this can be found in episode 24 when Tadano visits Komi’s home on White Day to return her gift from Valentine’s Day one month earlier.[v] Komi is away when he first arrives, so her mother shows Tadano to Komi’s room and tells him to wait. However, her father soon appears to join Tadano, sitting across from him in awkward silence (like his daughter, he does not readily speak). Komi eventually returns from her errand and, learning that Tadano is already there, bursts into the room. She quickly moves to push her father out in a panic and then tries to hurriedly clean up stray books and clothes so as not to appear disorganized in front of Tadano. This sequence features an array of breath-like vocalizations, from loud gasps of embarrassment and frustration to the labored breathing of the exertion of running around in a hurry.

The preceding moments are filled with (comically) dramatic piano music, exaggerating Tadano’s awkward attempts to impress Komi’s father. However, once Komi arrives, the music stops and the soundtrack is replaced with the sounds of her feet moving about the room and her vocalizations of gasps, sighs, and deep breaths. Throughout the sequence, her face is largely depicted in chibi form, without a drawn mouth, but the noises of her vocalization are amplified to be more audible and exaggerated. Even as she moves about the space of the room, her voice also sounds “close” enough for us to hear, and always at the same volume and tone. The moment when she sits down across from Tadano and tries to collect herself also has her deep breaths accompanied by graphic representations of onomatopoeia (fuuu) and cartoon clouds of breath to further exaggerate the emotional and physical strain of her breathing (see Figure 1). This sequence is uncommonly noisy within the series but is also noteworthy for how much expression of voice it contains while also restricting those expressions to non-speech vocalizations. The dramatic effect is to render Komi’s frantic attempt to save herself from embarrassment (and the ensuing moment of exhaustion), but this use of sound also produces a powerful sensation of the voice actor breathing into the mic in a way that is remarkably physical and felt as having virtual proximity to the audience/listener.

Figure 1: Komi (left) struggles to catch her breath after Tadano (right) visits her home on Valentine’s Day. Note the use of expressive markers such as the puffs of steam representing Komi’s breathing, the visualized onomatopoeia (Fuuu), and the red lines of her face depicting blushing. The beads of sweat on Tadano’s head are used to a similar effect.

What these different examples demonstrate is how the voice is frequently given a composite existence in animation. Voices need to be heard clearly and with meaning but also anchored to the figure of the onscreen character as their source. The pursuit of the former sometimes complicates the latter. This is because the technological compositing of the speaking body re-introduces the voice as belonging to the unseen voice actor (possessing their own physical body), which in turn highlights the technological amplification of the voice. In the previous example, we can observe this relationship between voice and image in how the volume of Komi’s voice is the same no matter where she stands in relation to the “camera,” which suggests a flatness to the soundscape of animation that resonates with the visual flatness that is often associated with television animation in Japan. Thomas Lamarre has described how this flatness in anime emerges from compositing techniques that shift the activity of motion to the surface of the image even as we perceive depth of space (130–137). The use of sound here provides a kind of rhyme to that method in how it artificially levels the soundscape to allow us to focus on the expression of Komi’s emotional state (via her breathing) even if the staging of the characters does not strictly follow that dynamic of space. In other words, if the conventional “flatness” of the anime image allows us to feel sensations of motion and action even when those things aren’t animated in full or in depth, the “flatness” of sound here produces a related feeling of sound being drawn to the “surface” by prioritizing the legibility of perception over the strict matching of Komi’s voice with her position and actions within the image.

If this relationship between voice and image produces one potential effect of “flatness” in how we perceive sound, the use of onomatopoeic script suggests another mixing of dimensional cues. These strings of text within the image also provide a form of emotional expression that transmutes the inner emotional lives of onscreen characters into externalized effects and noise. The following section will discuss onomatopoeic script as an elaboration on how expressions of intense, overflowing emotion are given priority over specific instances of speech.


Expanded Text

Textual depictions of onomatopoeia have been used in manga since the 1950s (Natsume 112). Yomota Inuhiko has described these effects as providing the “clamorous” (sōzōshii) nature of manga and as a way of overcoming the lack of voice or other literal sounds within the written page (93-94). However, it is not just sounds that are represented this way; onomatopoeia also depicts other sensations, including more abstract notions of emotion and duration. These pieces of writing typically appear outside of the spaces where we find dialogue speech balloons, bubbles meant to depict a character’s thoughts, or even the voice of a narrator figure appearing in a window within the panel. Neil Cohen describes speech balloons and thought bubbles as “emergent” due to the way the text within them is “bound” to the image despite not existing within the image matter itself (38). Furthermore, onomatopoeic effects are usually not written in the same standardized script as other forms of text, but rather, as Natsume Fusanosuke has observed, drawn by hand, allowing them to flirt more closely with becoming something akin to a sound or motion within the image (116). These qualities all distinguish portrayals of onomatopoeia in manga as distinct from other forms of writing that appear within the panel (koma), such as speech balloons.

Robert S. Paterson notes how manga often employ these expressions to “convey the essence of lived sensations” and “fuse the sign/icon into a single sensation” (163). He emphasizes this notion of “sensation” in how onomatopoeic effects are used, describing them as producing a “sensual presence” through which the narrative is embodied (Paterson 165). The affinity for sensation means that onomatopoeic script is conceptually bound to the idea of a body that can be the locus for feeling, whether it be the atmosphere of sound around it, or the ability to receive different sensations of emotional and tactile input. The ability to capture the general abstraction of a feeling (its “essence”) into a single expression of sound or sensation is also made possible by the reference to a body that feels. This reference allows the “essence” of a sensation to be rendered at the scale of the personal and the proximate. Yomota identifies something similar, describing onomatopoeia in manga as making the “essence” (honshitsu) of an emotion clearer by rendering it at the zenith of the sensation it represents, such as someone screaming for joy, an explosion, or the pop of a baseball bat as the batter hits a homerun (93). For Natsume Fusanosuke, this connection with feeling and sensation in manga also lends onomatopoeia an expression of freedom (jizaisa) that is distinct from the more rational mindset of modern society (124).

Paize Keulemans connects written onomatopoeia to the spaces in which sounds are produced (53). Writing on Chinese martial arts literature, Keulemans observes how onomatopoeia can conjure the feeling of public spaces such as markets, temples, and fairs, which are themselves alive with sound. In these novels, onomatopoeia provides the sensation of a space with volume and dimension, but also a form of expression that is always recalling another type of sound and location. This allows these images-as-sounds to be legible to us, and, as with Paterson’s notion of being embodied within a narrative, places the body within a space of action and volume.

Keulemans elaborates further, explaining how the sensory-driven form of expression of onomatopoeia allows for a “dissolution between the boundary between the reading subject and textual object” (Keulemans56). This recalls Robert Peterson’s argument that onomatopoeia makes the general sensations being described in manga feel personal and intimate, while the sensual element of sounds, actions, and emotions places the window of identification in the body of the reader, rendering what is being written as sensation.  Peterson even describes the relationship between the reader and onomatopoeia in manga as one of performance, with the reader figuratively “performing” the action of the comic as a type of ventriloquist as they whisper the sounds and sensations depicted as onomatopoeia (164). In a way, this echoes the sensation of the voice being “in” and “outside” of the animated body through its dual-anchoring in both the onscreen character and the offscreen voice actor, forming an ambivalent interval of sound and body.

The use of animated onomatopoeia in anime adds some complexities to these understandings of how said forms are used in literature and manga. In animations such as Komi-san, these scripts exist and move within the space of the image but are also given a more strongly pronounced textual appearance. They appear in colorful text, move in short repeating loops, and hover around or behind characters as specters of sensation. Their animation through digital editing tools even allows for warping effects to depict these scripts as moving in ways that mimic the sensations they represent. Furthermore, although some of these animated blocks of text are used to depict sounds, they are more commonly and boldly used to depict sensations of emotion. The coiling waves of gogogo of JoJo’s Bizarre Adventure provide a weight of dread to go along with the menacing figures they accompany, while the kokkuri of Komi-san is used to further “animate” her nodding of her head in lieu of vocalized speech as she communicates with other characters.  

What is perhaps distinct about the use of onomatopoeia in anime is the way these image-sounds and sensation-sounds appear to exist within a volume of space. The inserts of onomatopoeia often appear to hover around characters, sometimes floating behind and beside them, or even moving around the characters as the script extends to express something. This is because these script effects are drawn independently of the primary image and then composited into the frame, often as looped clips that play as sub-animations within the image. One example from episode 6 of Dress-Up depicts Kitagawa Marin (the female lead) surrounded by floating scripts of dokidoki that throb like beating hearts. This is accompanied by her internal monologue about the realization of her feelings for her friend, Gojo Wakana (Figure 2). However, despite being part of the space of the image, these effects remain ambivalent in terms of their expressivity within the diegesis. The pronounced use of sound effect onomatopoeia (giongo) allows for the sense that the characters also hear these noises (hence the collapse of diegetic separation that Keulemans describes), but because anime such as Dress-Up and Komi-san tend to favor onomatopoeia that depict sensations of emotion, the notion that they are perceivable to the characters is less clear. In some cases, such as the gogogo effect in JoJo’s Bizarre Adventure, what the onomatopoeia represents is an aura that the character is projecting to others. In that sense, we might surmise that these are a more tangible manifestation of how a character is perceived by others around them. However, those same characters that behold this spectacle of text don’t react specifically to the onomatopoeia graphics, so it might be more precise to think of them as a projection of what other characters feel or perceive rather than a specific, literal evocation.

Figure 2: Kitagawa Marin is figuratively swarmed by script representing the sound of her pounding heart as she realizes her feelings for her friend, Gojo Wakana. Note the use of heart-shapes for the dakuten diacritics.

An instructive example of this can be found in the first episode of Komi-san. One of the first scenes shows Tadano arriving at school on the first day of classes, thinking to himself about how the year will go as he retrieves his shoes from the locker. He notices another student (who turns out to be Komi) has arrived at the same row of lockers, and turns to exchange greetings. However, before he can finish saying “good morning,” he is struck by her incredible beauty, and then arrested by the (seemingly) cold aura she is giving off and her surprisingly rigid way of moving.

Komi turning her head to face Tadano is accompanied by a musical cue one might expect in a thriller or suspense film, but also by animated text depicting the onomatopoeia gigigi, which is the onomatopoeia used for a grating noise. The script for this expression seems to radiate from Komi’s head and neck, but also “floats” in the image slightly behind her, creating a sudden feeling of depth in the image. This moment concludes with a gakun, which is used to represent a jerking movement, again referring to Komi’s awkward, stilted movements. The animation then cuts to a shot of Komi’s face, with her eyes looking down at Tadano (off screen) in a seemingly menacing manner, and waves of dododo onomatopoeia moving in a vertical column beside her. The dododo strands continue in the next shot but now circle around and between Komi and Tadano, as if the two of them were standing inside a whirlpool of the text swirling like a vortex around them (Figure 3). After a moment Komi scampers off in a hurry and without saying a word, with the onomatopoeia dododo being used again, this time to represent the sound of her hasty footsteps. The scene then concludes with Tadano catching his breath after the strange encounter, with animated effects of dokidoki (representing a pounding heartbeat) emerging from his clutched chest.[vi]

This sequence presents a series of onomatopoeia effects, each representing an actual sound or a more abstract sensation of emotion. They are all based around the same “sound,” dododo, but are applied with different iterations of what that can express: the feeling of panic, the sound of tapping or knocking, and finally the evolution into a pounding heartbeat (dokidoki). The use of the same sound effect creates a rhyming scheme that, particularly in how the script is animated, allows for the intensification of the moment to rapidly escalate, exaggerating the strange intensity and Tadano’s perception of Komi’s aura.[vii]

Figure 3: Tadano reacts to Komi’s intimidating aura as they meet for the first time. The two are surrounding by swirling scripts of dododo, which gives figurative voice to both Komi’s cold aura and Tadano’s racing heart-beat.

Philip Brophy has observed what he describes as the “core dilemma” of animation, which is resolving the organically produced sounds of music and the human voice with the artificial images that move onscreen (134). He elaborates by describing how our relationship to the animated world is distinct from our relation to the natural world. Nature is formal and material, while animation is “syncopative” and percussive, emphasizing how music and sound seem to help “bring to life” the things we see onscreen through routines of synchronization (Brophy 135). In a way, the script effects of onomatopoeia in animations such as Komi-san produce a similar sensation by not only making the sounds and emotions of that world visible to us but also by portraying them as things that virtually inhabit the same world as the characters they are anchored to, providing a (as Brophy might call it) syncopative articulation of the “organic” materials of feeling and emotion with the “artificial” materials of animation and its technological construction.  There is a synchronization of how internal emotion is rendered as an external image and sign. This offers another sense of the fantasy of what is “inside” and “out” à la the discourse on “the person inside” within anime subculture. Just as the concept of “the person inside” offers an explanation to the fantasy of characters having their own vitality and existence, the projection of onomatopoeia as images that move outside of the body of the one who feels is a fantastical way of portraying how emotions can be understood and experienced.

With that in mind, we might return to Peterson’s claim that the reader “performs” the action of manga by reading and subvocalizing onomatopoeia effects (164). In anime, the external agency of the viewer isn’t needed to “perform” the vocalization of onomatopoeia, so the mode of performance that Peterson describes might not be at hand in the same manner. We, as the audience, are no longer performing the action of sensation via onomatopoeia, but the animated renditions of onomatopoeia are projecting back to us something like Paterson’s notion of “lived sensations” of emotion. These visual effects are also often accompanied by sound effects or music, furthering the sense of their “animation” by complimentary sensation, but also producing something like the “oblique polyphony” that Ryan Holmberg has described in gekiga (adult-oriented) manga that use text and diagrams to describe one another as caption-like elements that share the same page (443).


Hooked on a Feeling

As a way of concluding this article, I would like to return to the topic of emotion and its expression via animated onomatopoeic script and whispering vocalization. These are both hyperbolic forms of expression, but also in some ways imprecise. They project an abundance of feelings, but the content of those feelings is often less clear. This is because the sense of emotion portrayed in these expressions often captures that very quality of being overflowing in a more present or impactful way than it does a particular meaning, or even the ability to express something in a concrete way. Yomota Inuhiko’s observation about the ways in which onomatopoeia is often used to depict the zenith of emotion in manga is a significant point in understanding what register of feeling is being portrayed in animations such as Komi-san (93). Natsume Fusanosoke’s point about onomatopoeia in manga offering an expression of freedom that is somehow distinct from or beyond the rationality of modern society is also important in understanding how this style of expression of feeling is often so intense but also unformed (124). The use of whispering, gasps, and cries of embarrassment are similarly emotions that are played at (forgive the pun) maximum volume, conjuring a powerful sensation of their feeling as something that is reverberating through a human body (the voice actor) in addition to that of the animated character.

Returning to the concept of register, these are the emotions that are almost too strong to escape the speaker’s mouth, with feelings of embarrassment or bewilderment being among the most common forms of expression that we find. The heavy signs and deep breaths, as well as the bold, dramatic scripts of onomatopoeia are conducted at the register to approximate these feelings and the frustration of their expression. The voice whispered into the microphone and animated scripts of onomatopoeia allow for the rescaling of emotion to be “big” (as in a loud voice or a feeling projected outward) but also personal and intimate, close to the body and the ear. This is accomplished through technological mediation that allows for figurative amplification of sound and emotion alike.

Part of the aesthetic effect of techniques such as whispering and animated onomatopoeia is to make the daunting weight of those overflowing feelings appear or seem manageable, but also playful. This isn’t to say that this style of manga or animation is cathartic or therapeutic. Instead, it arrives at a form of expressivity that resonates with the imagination of emotion found in anime in which bold declaration and action provide a fantasy of emotional directness and agency of feeling. Japanese TV dramas, popular music and animation frequently connect this frustrated expression of emotion to the turbulence of youth (seishun). The “zenith” of emotional expression is sometimes dramatized in moments such as love confessions between students or shared triumphs by teammates, but we also find many representations of the inability to express emotion directly in popular song lyrics and scenes in TV dramas, such as stalled love confessions that get (to borrow Koga Aoi’s phrase again) “stuck in the throat.” Dorothy Finan describes how the language of youth (particularly the term seishun) has been featured in popular music in Japan, and particularly in how said music depicts a fantasy of shared experience through settings such as schools and the common routines of the “struggle” of youth in contemporary Japan (2022).

The inability to articulate a feeling is, in a way, being caught between the language of expression and the body that feels. Much in the way that the voice in animation is always between one body and another (character and seiyū) and that onomatopoeic script is between text and image, the sensation of being caught between things and imagining some type of meaning within that ambivalence is part of the power of these expressions. Animations such as Komi-san and Dress-Up render that ambivalence of expression and feeling as something characterful and charming, but also in a way that allows them to feel like they make more sense by turning what is inside (the “feelings caught in your throat”) into something that is projected or amplified out into the world. In that sense, emotion in anime such as Komi-san is no longer felt by just the body that expresses, but by those that sense that body in image and sound.


Daniel Johnson teaches and publishes in Japanese media studies and contemporary culture. His previous writing has touched on subjects related to electronic text, image rendering, and dubbing.


Works Cited

Brophy, Phillip. “The Animation of Sound,” in Movie Music, The Film Reader, ed. Kay Dickinson, New York: Routledge, 2002, pp. 133–142.

Buhler, James. “The End(s) of Vococentrism,” in Voicing the Cinema: Film Music and the Integrated Soundtrack, ed. James Buhler and Hannah Lewis, Champaign, Illinois: University of Illinois Press, 2020, pp. 278–296.

Bukatman, Scott. “Some Observations Pertaining to Cartoon Physics; or, the Cartoon Cat in the Machine,” in Animating Film Theory, ed. Karen Beckman, Durham: Duke University Press, 2014, pp. 301–316.

Chion, Michel. The Voice in Cinema, trans. Claudia Gorbman, New York: Columbia University Press, 1999.

Cohn, Neil. “Beyond Speech Balloons and Thought Bubbles: The Integration of Text and Image,” Semiotica Vol. 197, 2013, pp. 35–63.

Finan, Dorothy Lisa. “Struggle as Value: Exploring the Thematic Importance of seishun in Japanese Idol Music,” PhD dissertation, School of East Asian Studies, University of Sheffield, 2022.

Frith, Simon. Performing Rites: On the Value of Popular Music, Cambridge: Harvard University Press, 1998.

Fujitsu, Ryouta. “On Voice Actors: An Empirical Investigation of Their History,” (seiyū-ron tsuushiteki, jyusshouteki ikkousatsu), in An introduction to anime studies: Investigating 11 points (Anime kenkyu nyumon: Anime wo kiwameru 11 no kotsu), ed. Koyama Masahiro, Sugawa Akiko, Gendai Shokan, 2018, pp. 93–117.

Gunning, Tom. “Heard It Over the Phone: The Lonely Villa and the de Lorde Traditions of the Terrors of Technology,” Screen, Vol. 32 No. 2, 1991, pp. 184–196.  

Holmberg, Ryan. “For Your Words, I Shall Rip Out Your Tongue: Shirato Sanpei and the Talking Head of Manga,” International Journal of Comic Arts, Vol. 8 No. 1, 2006, pp. 426–455.

Keulemans, Paize. “Listening to the Printed Martial Arts Scene: Onomatopoeia and the Qing Dynasty Storyteller’s Voice,” Harvard Journal of Asiatic Studies, Vol. 67 No. 1, 2007, pp. 51–87.

Komi Can’t Communicate – Anime Cast Interview!! (Komi-san ha komyushou desu anime kyasuto intabyuu!!), Shogakukan No. 19, April 6, 2022, pp. 10–11.

Lamarre, Thomas. The Anime Machine: A Media Theory of Animation, Minneapolis: University of Minnesota Press.

Mersch, Dieter. “Presence and Ethicity of the Voice,” in Vocal Music and Contemporary Identities: Unlimited Voices in East Asian and the West, ed. Christian Utz and Fredrick Lau, London: Routledge, 2017, pp. 25–44.

Natsume, Fusanosuke. Why is Manga so Interesting? (Manga ha naze omosiroi no ka – sono hyōgen to bunpo), NHK Library, 1997.

Nozawa, Shunsuke. “Ensoulment and Effacement in Japanese Voice Acting,” in Media Convergence in Japan, ed. Jason Karlin and Patrick Galbraith, Kinema Club, 2016, pp. 169–199.

Oda, Tomohito. “Interview with Oda-sensei,” (Oda-sensei intabyuu) in Komi Can’t Communicate Official Fan Book (Komi-san ha komyusho desu kōseki fuan bukku), Shogakukan, 2021, pp. 178–183.

Ono, Masahiro. The World of Onomatopoeia (Onomatope – giongo gitaigo no sekai), Kadokawa, 2019.

Peterson, Robert. “The Acoustics of Manga,” in A Comic Studies Reader, ed. Jeet Heer and Kent Worchester, Jackson: Mississippi, University Press of Mississippi, 2009, pp. 163–171.

Sakamoto, Mamoru. “The Merits and Demerits of Overflowing Television Subtitles” (Hanran suru jimaku bangumi no kōzai), GALAC, June, 1999.  http://www.maroon.dti.ne.jp/mamos/tv/jimaku.html (accessed September 24, 2023).

Schantz, Ted. “Telephonic Film,” Film Quarterly, Vol. 56 No. 4, 2003, pp. 23–35.

Smith, Jacob. Vocal Tracks: Performance and Sound Media, Oakland: University of California Press, 2008.

Sterne, Jonathan. The Audible Past: Cultural Origins of Sound Reproduction, Durham: Duke University Press, 2003.

Sugawa, Akiko. Cultural Theory of 2.5 Dimensions: Stage, Character, and Fandom (2.5 jigen bunkaron – butai, kyarakutaa, fandamu), Seikyuusha, 2021.

Weekly Shonen Sunday (Shuukan shonen sandee) (2021) Komi Can’t Communicate – Anime Cast Interview!! (Komi-san ha komyushou desu anime kyasuto intabyuu!!), Shogakukan No. 45, October 6, 2021, pp. 4–5.

Wojcik, Pamela Robertson. “The Sound of Film Acting,” Journal of Film and Video, Vol. 58 No. 1, 2006, pp. 71–83.

Yasar, Kerim. Electrified Voices: How the Telephone, Phonograph, and Radio Shaped Modern Japan, 1868 – 1945, New York: Columbia University Press, 2018.

Yomota, Inuhiko. A Discussion of Manga (Manga genron), Chikuma Shobō, 1994.


[i] My use of “cartoon logic of expression” is borrowed from Scott Bukatman’s (2014) discussion of cartoon physics. However, rather than approaching the “cartoon” as a matter of the plasmatic in action and form, I am interested in the particular register of emotional expression found in animation.

[ii] According to Oda Tomohito, the author of Komi-san, the phrase “communication disorder” (komyu shō) was gaining traction among the general public in Japan circa 2016, which inspired him to use that concept for his manga (178).

[iii] Other forms of onscreen text, commonly known as telop (television opaque projector), have been used in Japanese television since the 1960s. These were originally achieved through the use of slides that could be run through transparent and opaque projection systems, allowing for the effect of text being superimposed over the key image. The use of comic book style captions for comedic effect became popular in the 1980s and 90s and has since become a staple of variety television in Japan. For more on the rise of telop effects in Japanese television, see Sakamoto (1999). For a dramatic portrayal of how telop have been produced in Japanese television, see the film We Couldn’t Become Adults (Boku tachi ha minna otona ni narenakatta) (2021, dir. Yoshihiro Mori).

[iv] James Buhler has suggested that “expressive sound effects” such as robot voices present an alternative to Chion’s vococentrism by employing “centrifugal” sound alongside a “centripetal” image (289).

[v] White Day in Japan is the companion to Valentine’s Day and is held on March 14 (one month later). Girls and women typically give gifts of chocolate on Valentine’s Day, with those gifts being reciprocated with similar treats by boys on White Day.

[vi] The manga features an additional moment within this scene that was not included in the anime. After Tadano tries to catch his breath and understand what happened, there is an additional panel showing Komi also trying to collect herself after the stress of being spoken to by another person. She is also portrayed with a dokidoki onomatopoeia effect, which is additionally identified by a caption as representing her racing heartbeat (shinon).

[vii] The opposite effect of this – the same type of sound being represented by multiple forms of onomatopoeia – is more common in manga. Ono Masahiro explains this by analyzing the example of a lighter or match producing a flame in the manga Golgo 13. He identifies the sound shupon as a sound produced by a lighter that, through its presence or suddenness, grabs the attention of those nearby. Other sounds, such as katchikatchi, can be used to represent a cheap, disposable lighter that doesn’t call attention to itself with the same volume. See Ono (2019, 43, 44).