A little text processing library for Scala.
This is a little text processing library which supports language identification, tokenization, stopword filtering and provides some useful helper functions. The tokenization has been tuned to work well with text conventions commonly used in social media such as Twitter, and supports URLs, emoji, hashtags, emails and @-mentions cleanly. Stopword filtering is currently supported for
- German
- English
- Spanish
- French
- Indonesian
- Japanese
- Malay
- Dutch
- Portuguese
- Swedish
- Turkish
- Arabic
More to come.
Add to your project dependencies:
resolvers += "peoplepattern" at "https://dl.bintray.com/peoplepattern/maven/"
libraryDependencies += "com.peoplepattern" %% "lib-text" % "0.3"
import com.peoplepattern.text.Implicits._
val txt = "Did you get your personalised print with your copy of #MadeintheAM on Black Friday? If not, there's still time! http://www.myplaydirect.com/one-direction"
txt.lang
// Some(en)
txt.tokens
// Vector(Did, you, get, your, personalised, print, with, your, copy, of, #MadeintheAM, on, Black, Friday, ?, If, not, ,, there's, still, time, !, http://www.myplaydirect.com/one-direction)
txt.terms
// Set(print, personalised, black, copy, friday, time)
txt.termsPlus
// Set(print, personalised, black, #madeintheam, copy, friday, time)
txt.termBigrams
// Set(black friday, personalised print)
lib-text is open source and licensed under the Apache License 2.0.
Developed with ❤️ at People Pattern Corporation