For ease of use, I recommend a combination of jQuery and Popcorn.js for anything where you want to integrate media with HTML, and visa versa. See this jsfiddle post for an example.
For the record, should jsfiddle ever disappear, here's the code:
<audio id="greeting" src="" controls></audio>
<div id="text">
<span id="w1" class="word" data-start="1.0">Hello</span>,
and <span id="w2" class="word" data-start="2.0">welcome</span>
to Stack <span id="w3" class="word" data-start="3.0">Overflow</span>.
Thank you for asking your question.
.word {
color: red;
.word:hover, .word.selected {
color: blue