Recently I'm looking into lyrics before implementing music lyric processing codes.
There are many aspects that make song lyrics different from usual text, which are quite interesting and going to give me some pain. Few of them are introduced as below. Please enjoy!- Incomplete sentence
example:
example:Scars make us who we areHearts and homes are broken, brokenFar, we could go so farWith our minds wide open, open
In terms of grammar, it's easy to find some incomplete sentences. What makes them trickier is however, that they don't have proper punctuation marks. It would be okay for word-level processing (though there is still some issues such as negation..), but sentence-level processing would require a huge pre-processing. New-line character doesn't always indicate the end of the sentences, of course!Love you forever and foreverLove you with all my heartLove you whenever we're togetherLove you when we're apart
- Use of arbitrary special characters
example:
example:사르륵녹은 그대를 보면 사랑을 느끼죠oh so beautiful 사랑을 말해봐요매일 너와 함께! ~~해!~~~조금더 다가와줘
다가와- 느껴봐 음-
It may be more about Korean speaker's writing habit. Korean (and Japanese) prefers to use '~' than '-'. In some cases, -- is automatically 'corrected' to –, a long dash, which makes the problem more complex. (Not that much, actually, since all of them should be just ignored as a stopword.)
- Multiple languages
example:
The example above is an extreme one, written with Korean, Japanese, and English (+Korean interpretation for Japanese sentences in parentheses). However, it's very common to use English words in Korean lyrics.簡単に(칸탄니)간단히You make me happy一言で(히토코토데)한마디로夕べの すれ違い(유우베노 스레치가이)저녁때의 엇갈림まだまだ 埋まってない(마다마다 우맛테나이)아직아직 채워지지 않아So I'm waiting ソワソワ Oh(소와소와)안절부절
- Corruption by other information
example:
[Chorus]Yeah (yeah)Shorty got down to come and get me [x2]
There are many non-lyric texts such as [Chorus] or [x2] to help the viewer, while giving me pain.
- So many Yeah's and Oh's
There are so many Yeah's and Oh's in lyrics. I'm not sure what should I do with them.
- To summarise, these are the lists that I think I should add to usual stopwords list to process lyrics.
- Some special characters (which should be already included in stopwords list)
- * ** *** + " ' ` . .. ... / ~ ~~ ~~~ ~~~~ ~~~~~ ? - -- --- ^, ^^, a, b, c,...z,...
- Unnecessary (and common) words that are included in lyrics texts
- chorus, verse, pre-chorus, bridge, feat, hook, song, solo, twice, outro, sabi, intro, pre-hook, rap, x2, x3, x4, x5, x6, x7, x8, x9, x10, copyright, azlyrics, writer, br, choir, guitar
- (Korean words:)간주, 후렴, 반복, 가사입력, 출처, 작성자, 악보, 연주곡, 간주중
No comments:
Post a Comment