Monday, August 3, 2015

Lyrics for text processing

Recently I'm looking into lyrics before implementing music lyric processing codes. 
There are many aspects that make song lyrics different from usual text, which are quite interesting and going to give me some pain. Few of them are introduced as below. Please enjoy!

  • Incomplete sentence
example:
Scars make us who we are
Hearts and homes are broken, broken
Far, we could go so far
With our minds wide open, open

example:
Love you forever and forever 
Love you with all my heart 
Love you whenever we're together 
Love you when we're apart 

In terms of grammar, it's easy to find some incomplete sentences. What makes them trickier is however, that they don't have proper punctuation marks. It would be okay for word-level processing (though there is still some issues such as negation..), but sentence-level processing would require a huge pre-processing. New-line character doesn't always indicate the end of the sentences, of course!



  • Use of arbitrary special characters
example:
사르륵녹은 그대를 보면 사랑을 느끼죠
oh so beautiful 사랑을 말해봐요 
매일 너와 함께! ~~해!~~~
조금더 다가와줘
example:
다가와- 느껴봐 음- 
It may be more about Korean speaker's writing habit. Korean (and Japanese) prefers to use '~' than '-'. In some cases, -- is automatically 'corrected' to  , a long dash, which makes the problem more complex. (Not that much, actually, since all of them should be just ignored as a stopword.)

  •  Multiple languages
example:
簡単に
(칸탄니)
간단히
You make me happy
一言で
(히토코토데)
한마디로
夕べの すれ違い
(유우베노 스레치가이)
저녁때의 엇갈림
まだまだ 埋まってない
(마다마다 우맛테나이)
아직아직 채워지지 않아
So I'm waiting ソワソワ Oh
(소와소와)
안절부절

The example above is an extreme one, written with Korean, Japanese, and English (+Korean interpretation for Japanese sentences in parentheses). However, it's very common to use English words in Korean lyrics.


  • Corruption by other information
example:
[Chorus]
Yeah (yeah) 
Shorty got down to come and get me [x2]

There are many non-lyric texts such as [Chorus] or [x2] to help the viewer, while giving me pain.


  • So many Yeah's and Oh's
There are so many Yeah's and Oh's in lyrics. I'm not sure what should I do with them. 





  • To summarise, these are the lists that I think I should add to usual stopwords list to process lyrics.
    • Some special characters (which should be already included in stopwords list)
      • * ** *** + " ' ` . .. ... / ~ ~~ ~~~ ~~~~ ~~~~~ ? - -- --- ^, ^^, a, b, c,...z,...
    • Unnecessary (and common) words that are included in lyrics texts
      • chorus, verse, pre-chorus, bridge, feat, hook, song, solo, twice, outro, sabi, intro, pre-hook, rap, x2, x3, x4, x5, x6, x7, x8, x9, x10, copyright, azlyrics, writer, br, choir, guitar
      • (Korean words:)간주, 후렴, 반복, 가사입력, 출처, 작성자, 악보, 연주곡, 간주중



No comments:

Post a Comment