Monday, August 3, 2015

Lyrics for text processing

Recently I'm looking into lyrics before implementing music lyric processing codes. 
There are many aspects that make song lyrics different from usual text, which are quite interesting and going to give me some pain. Few of them are introduced as below. Please enjoy!

  • Incomplete sentence
Scars make us who we are
Hearts and homes are broken, broken
Far, we could go so far
With our minds wide open, open

Love you forever and forever 
Love you with all my heart 
Love you whenever we're together 
Love you when we're apart 

In terms of grammar, it's easy to find some incomplete sentences. What makes them trickier is however, that they don't have proper punctuation marks. It would be okay for word-level processing (though there is still some issues such as negation..), but sentence-level processing would require a huge pre-processing. New-line character doesn't always indicate the end of the sentences, of course!

  • Use of arbitrary special characters
사르륵녹은 그대를 보면 사랑을 느끼죠
oh so beautiful 사랑을 말해봐요 
매일 너와 함께! ~~해!~~~
조금더 다가와줘
다가와- 느껴봐 음- 
It may be more about Korean speaker's writing habit. Korean (and Japanese) prefers to use '~' than '-'. In some cases, -- is automatically 'corrected' to  , a long dash, which makes the problem more complex. (Not that much, actually, since all of them should be just ignored as a stopword.)

  •  Multiple languages
You make me happy
夕べの すれ違い
(유우베노 스레치가이)
저녁때의 엇갈림
まだまだ 埋まってない
(마다마다 우맛테나이)
아직아직 채워지지 않아
So I'm waiting ソワソワ Oh

The example above is an extreme one, written with Korean, Japanese, and English (+Korean interpretation for Japanese sentences in parentheses). However, it's very common to use English words in Korean lyrics.

  • Corruption by other information
Yeah (yeah) 
Shorty got down to come and get me [x2]

There are many non-lyric texts such as [Chorus] or [x2] to help the viewer, while giving me pain.

  • So many Yeah's and Oh's
There are so many Yeah's and Oh's in lyrics. I'm not sure what should I do with them. 

  • To summarise, these are the lists that I think I should add to usual stopwords list to process lyrics.
    • Some special characters (which should be already included in stopwords list)
      • * ** *** + " ' ` . .. ... / ~ ~~ ~~~ ~~~~ ~~~~~ ? - -- --- ^, ^^, a, b, c,...z,...
    • Unnecessary (and common) words that are included in lyrics texts
      • chorus, verse, pre-chorus, bridge, feat, hook, song, solo, twice, outro, sabi, intro, pre-hook, rap, x2, x3, x4, x5, x6, x7, x8, x9, x10, copyright, azlyrics, writer, br, choir, guitar
      • (Korean words:)간주, 후렴, 반복, 가사입력, 출처, 작성자, 악보, 연주곡, 간주중

No comments:

Post a Comment