Scoup (pronounced "scoop") wraps the JSoup HTML parsing library with implicits for more Scala-idiomatic element operations, as well as additional methods for querying the parsed data.
Scoup is published in Scala 3.0, 2.11, 2.12 and 2.13 versions. Bring in the library by adding the following to your build.sbt
.
libraryDependencies ++= Seq(
"com.themillhousegroup" %% "scoup" % "1.0.0"
)
Once you have scoup added to your project, you can start using it like this:
In keeping with modern asynchronous software, network operations are performed in a Future
to allow other work to be done while we wait:
import com.themillhousegroup.scoup.Scoup
// Returns a Future[Document] - map on it to get the doc:
Scoup.parse("http://www.google.com").map { doc =>
println(s"Got the doc: $doc")
}
Mix in the ScoupImplicits
trait to get automatic conversion from the Jsoup Elements
class to a Scala Iterable[Element]
.
You can just import com.themillhousegroup.scoup.ScoupImplicits._
if you prefer.
From there, you can map
, filter
etc as you see fit. For example, here we scrape www.somesite.com, first pulling out all the <h3>
elements (into an Iterable[Element]
) before filtering out only those Elements whose text
contains the word "foo", mapping it to an Iterable[String]
:
import com.themillhousegroup.scoup.Scoup
import com.themillhousegroup.scoup.ScoupImplicits
class MyThing extends ScoupImplicits {
Scoup.parse("http://www.somesite.com").map { doc =>
val allHeadings = doc.select(".main h3")
val fooHeadings = allHeadings.filter(_.ownText.contains("foo")).map(_.ownText)
...
}
}
If you have ScoupImplicits
in scope, you get the following methods added to Element
:
attribute(name: String): Option[String]
- Returns aNone
if there's no such attribute or it is blankattributeRegex(nameRegex: Regex): Option[String]
- Use a ScalaRegex
to select an attribute by name
isBefore(other: Element): Boolean
- compare the position of thisElement
with another in the DocumentisAfter(other: Element): Boolean
- compare the position of thisElement
with another in the Document
closest(selector: String): Elements
- like the jQuery method of the same name, find the match in the Element's hierarchy closest to myselfclosestOption(selector: String): Option[Element]
- find the closest match in the Element's hierarchy, orNone
if none foundclosestBeforeOption(selector: String): Option[Element]
- find the closest match in the Element's hierarchy that is before myselfclosestAfterOption(selector: String): Option[Element]
- find the closest match in the Element's hierarchy that is after myself
If you have ScoupImplicits
in scope, you get these methods (see above for description) added to Elements
in addition to being able to treat it as an Iterable[Element]
:
attribute(name: String): Option[String]
attributeRegex(nameRegex: Regex): Option[String]
closestOption(selector: String): Option[Element]
closest(selector: String): Elements
- The awesome JSoup project.
- Filippo De Luca wrote about pimping JSoup, which became SSoup