A pure, typeful, idiomatic Scala wrapper around JSoup.
This is a fork of danielnixon/scalasoup updated to Scala 2.13.7
and JSoup 1.15.1
- We keep the JSoup API basically intact, unless doing so clashes with any of the below.
- Unlike vanilla JSoup, everything in ScalaSoup is immutable.
- ScalaSoup endeavours to replace all of JSoup's partial functions (those that might return null or throw an exception) with total functions (those that do neither of these barbaric things).
- We've replaced all nullable return types with
Option
s. No null references, noNullPointerException
s. - We return
Option
s instead of throwingIndexOutOfBoundsException
s. - We encode constraints in types instead of throwing exceptions at runtime. For example, if you call JSoup's
element.remove()
on an element that doesn't have a parent, it will throw. In ScalaSoup this won't even compile (more on this below).
- We've replaced all nullable return types with
- We use Scala collection types, Scala regexes, etc instead of the Java equivalents used by vanilla JSoup.
- We drop Java-style
get
prefixes and rename identifiers that are reserved in Scala. For examplegetElementsByTag()
becomes simplyelementsByTag
, andval
(which is a keyword) becomesvalue
. JSoup usesget
prefixes inconsistently, e.g.Element.wholeText
vsTextNode.getWholeText
. In ScalaSoup these are bothwholeText
. - We support simple mutation by replacing setter methods with
withFoo
methods. ThewithFoo
methods return a clone of the object with the modification applied. The original is left unchanged (more on this below). - We don't expose the
parse
overloads that perform http requests because JSoup's built-in http client is impure and blocking. - We support more complex mutation by exposing a Free Monad-based DSL (more on this below).
- You want performance at the cost of correctness. A
def parent: Element
method that returnsnull
is probably faster than an honestdef parent: Option[Element]
but I don't care.
Add the dependency to your build.sbt
:
libraryDependencies += "com.iterable" %% "scalasoup" % "0.1.0"
Then import the scalasoup
package and use the ScalaSoup
object as your entrypoint everywhere you would have used Jsoup
.
import com.iterable.scalasoup._
ScalaSoup.parse(...)
Let's translate the Wikipedia example from the JSoup homepage. The first thing to note is that JSoup's built-in http client is impure and blocking, so ScalaSoup doesn't expose it. You probably already use a library like play-ws or http4s. We encourage you to keep using whatever http library you're already using. For this example we'll use http4s.
import org.http4s.client.blaze._
import com.iterable.scalasoup._
val httpClient = FollowRedirect[IO](maxRedirects = 3)(Http1Client[IO]().unsafeRunSync())
val uri = "https://en.wikipedia.org/"
val task = httpClient.expect[String](uri) map { html =>
val doc = ScalaSoup.parse(html, uri)
println(doc.title)
val newsHeadlines = doc.select("#mp-itn b a")
for (headline <- newsHeadlines) {
println(s"${headline.attr("title")} ${headline.absUrl("href")}")
}
}
task.unsafeRunSync()
httpClient.shutdownNow()
JSoup allows you to mutate documents and their constituent parts (elements, nodes, attributes, etc) in-place. ScalaSoup disables this in order to avoid side-effects. In ScalaSoup everything is effectively immutable. For example, JSoup's addClass
method--which mutates the element on which it is called--is not exposed by ScalaSoup.
So what do we do instead? We could create a copy of an element, make our changes to the copy and leave the original untouched.
This approach is actually possible in vanilla JSoup. Before moving on, let's see what that might look like:
def withAddClass(element: org.jsoup.nodes.Element, className: String): org.jsoup.nodes.Element = {
val updatedElement = element.clone
updatedElement.addClass(className)
updatedElement
}
val originalElement = new org.jsoup.nodes.Element("div")
val updatedElement = withAddClass(originalElement, "foo")
originalElement.hasClass("foo") // false
updatedElement.hasClass("foo") // true
This works but it has a few flaws.
- We're fighting JSoup and it shows.
- Perhaps crucially, this doesn't actually prevent us from mutating an existing element.
- It's verbose, so it's temping to avoid this approach and just mutate existing elements.
- It's pretty noisy and the interesting part is obscured by machinery. This has implications for readability, maintainability, etc.
Let's see the same approach using ScalaSoup:
val originalElement = Element("div")
val updatedElement = originalElement.withAddClass("foo")
originalElement.hasClass("foo") // false
updatedElement.hasClass("foo") // true
A few observations:
- In ScalaSoup, it's impossible to call JSoup's
addClass
directly. - ScalaSoup provides
withAddClass
for you. It does essentially the same thing as the method we wrote in the example above. - ScalaSoup provides
withFoo
alternatives for all of JSoup's mutating methods (none of which are exposed directly). - ScalaSoup calls JSoup's
addClass
under the covers, so the mutation is still happening (ScalaSoup is just a wrapper, remember). The crucial point is that the mutation is controlled such that it can only happen on a clone of an existing element and can't be directly observed. Once the modified clone is returned to you it is effectively immutable. Further changes viawithFoo
methods will create additional clones. - ScalaSoup reverses JSoup's priorities. In ScalaSoup, it's easy to create modified copies of elements but difficult to mutate existing elements. In JSoup, it's difficult to create modified copies but (too) easy to mutate existing elements.
This withFoo
approach will get you a fair way. If you need something more powerful, see the next section.
One limitation of the withFoo
approach (above) is that you incur a performance penalty associated with creating clones every time you call a withFoo
method. For example, element.withAddClass("foo").withAppendElement("div")
will result in two clones.
It'd be nice if we could batch our modifications and incur the cloning penalty only once per batch of modifications. This is exactly what ScalaSoup's mutation DSL gives us. For the curious, the mutation DSL is implemented using a Cats Free Monad.
Note that in order to use the DSL you need to add an additional dependency to your build.sbt
:
libraryDependencies += "com.iterable" %% "scalasoup-dsl" % "0.1.0"
Here's an example that makes two changes to a document, incurring the cloning cost only once.
import com.iterable.scalasoup._
import com.iterable.scalasoup.dsl._
val modifications = for {
document <- modifyDocument
_ <- document.setTitle("New Title")
_ <- document.setHtml("New HTML")
} yield ()
val originalDocument = ScalaSoup.parse(...)
val updatedDocument = originalDocument.modify(modifications)
Some things to observe:
- We need an additional
dsl
wildcard import. - Our entry-point to the DSL is
modifyDocument
. - Using a for comprehension, we assemble a description of our modifications.
- The description of our modifications doesn't actually do anything (yet).
- We execute our modifications by calling
modify
on a document. It is at this point that the document is cloned. - The cloned document is modified using the underlying JSoup methods and an immutable wrapper is returned to us.
- The DSL consistently prefixes the mutating methods with
set
(e.g.setTitle
,setHtml
). JSoup almost never prefixes setters withset
. These are all mutating methods in Jsoup:setBaseUri
,setWholeData
,title
,html
. In ScalaSoup these are always prefixed withset
and only appear in the DSL.
Here's an example that removes target
attributes from all a
tags. Note the use of Cats's foldMapM
(and the additional cats
import).
import cats.implicits._
import com.iterable.scalasoup._
import com.iterable.scalasoup.dsl._
val modifications = for {
document <- modifyDocument
_ <- document.selectChildren("a").foldMapM(_.removeAttr("target"))
} yield document
val doc = ScalaSoup.parse("<a target=\"_blank\"></a>")
val result = doc.modify(modifications)
Here's an example that builds one DSL program based on another:
val selectLinksProgram = for {
document <- modifyDocument
} yield document.selectChildren("a")
val modifications = for {
links <- selectLinksProgram
_ <- links.foldMapM(_.addClass("foo"))
} yield ()
val doc = ScalaSoup.parse("<a></a>")
val updated = doc.modify(modifications)
Here's an example that builds a DSL with some accumulated value using modifyAndAccumulate
instead of modify
:
val modifications = for {
document <- modifyDocument
target <- document.selectChildren("a").foldMapM { e =>
val originalTarget = e.attr("target")
e.removeAttr("target").map(_ => List(originalTarget))
}
} yield target
val doc = ScalaSoup.parse("<a target=\"_blank\"></a><a target=\"blah\"></a>")
val (result, removedTargets) = doc.modifyAndAccumulate(modifications)
A number of methods in JSoup throw an exception if the element on which they are called doesn't have a parent.
Here's an example:
val doc = org.jsoup.Jsoup.parse("")
// Throws IllegalArgumentException because you can't remove something from its parent if it _has no_ parent.
// This should throw an IllegalStateException instead, but what matters is that it throws at all.
doc.remove()
Let's try that in ScalaSoup:
val doc = ScalaSoup.parse("")
doc.remove
This time, the invalid program won't even compile:
[error] Foo.scala:12:9: Cannot prove that this node has a parent. You can only call this method on a node with a parent.
[error] doc.remove
[error]
ScalaSoup introduces the concept of a ParentState
phantom type. All Nodes
(including Elements
, Documents
, etc) have a ParentState
type parameter, which tells us at compile time whether a node has a parent or not. All the methods that would throw in JSoup (like remove
) are constrained such that you can only call them on nodes that have a parent. We've eliminated an entire class of runtime exceptions!
Here are some points to keep in mind:
- Newly constructed
Document
s, including those returned byScalaSoup.parse
, etc never have a parent. - Clones never have a parent (because one of the things JSoup always does when cloning is clear the clone's parent).
- Children returned from methods like
Document.head
,Document.body
,Element.child
,Element.children
,FormElement.elements
and others always have a parent. - JSoup does not make elements returned by
Document.createElement
a child of the document, so they never have a parent in ScalaSoup.
This ParentState
scheme works out reasonably well almost everywhere. There are one or two places where it isn't as simple as we'd like it to be.
The first is the set of methods including select
, selectFirst
, elementsByTag
, elementsMatchingText
, allElements
, etc (all sans the get
prefix from JSoup, of course).
These methods all return lists (or options) of elements. The returned value could include children of the current element and the current element itself.
This raises a question. What should the ParentState of the returned list be?
If the current element is known to have a parent, then we can confidently return List[Element[ParentState.HasParent]]
. And this is actually how ScalaSoup works. For example:
// Given some element with a parent (a `body` element in this case).
val element: Element[ParentState.HasParent] = ScalaSoup.parse("<div></div>").body.get
// Select all elements using a CSS wildcard. This will include the original body element and its child div.
val results: List[Element[ParentState.HasParent]] = element.select("*")
But what if we don't know the parent state of the current element (or if the current element is known to not have a parent)? In that case we cannot know the parent state of the returned list. For example:
// Given some element without a parent (a document element in this case).
val document: Document[ParentState.NoParent] = ScalaSoup.parse("<div></div>")
// Select all div elements in the document. We (humans) know that the returned list will only include elements with parents, but we haven't persuaded the compiler.
val results: List[Element[_]] = document.select("div")
There are a couple of ways we can work around this. One is to ensure we call the method (select
in these examples) on something we know has a parent. For example:
val document: Document[ParentState.NoParent] = ScalaSoup.parse("<div></div>")
// We go via the body element, which is known to have a parent.
val results: List[Element[ParentState.HasParent]] = document.body.toList.flatMap(_.select("*"))
Another, perhaps nicer, solution is to use one of the new methods introduced in ScalaSoup (i.e. not present in vanilla JSoup) for this purpose.
Instead of select
we can call selectChildren
, which is equivalent in all respects except that it will never include the current element, allowing us to know with confidence that the returned elements all have a parent.
val document: Document[ParentState.NoParent] = ScalaSoup.parse("<div></div>")
// Using `selectChildren`, we know that all the results will have a parent.
val results: List[Element[ParentState.HasParent]] = document.selectChildren("*")
For the other methods, take a look at selectFirstChild
in place of selectFirst
, allChildren
in place of allElements
, childrenByClass
in place of elementsByClass
and so on.
The second issue is raised by the parent
, parents
and parentNode
methods. In these cases we don't know at compile time whether or not the parent has a parent of its own.
This will be annoying in cases like this:
val modifications = for {
doc <- modifyDocument
link = doc.selectFirstChild("a")
linkParent = link.flatMap(_.parent)
_ <- linkParent.foldMapM(_.remove)
} yield ()
val document = ScalaSoup.parse("<div><a></a></div>")
val updated = document.modify(modifications)
In the above example we:
- select the first
a
tag, - get its parent,
- try to remove the parent from the document.
This fails to compile because we can't prove that the a
tag's parent has a parent of its own. Recall that we can't remove an element unless it has a parent. Doing so would manifest as an exception in JSoup.
There are a couple of workarounds:
The bluntest (and least safe) solution is to just resort to using asInstanceOf
:
val modifications = for {
doc <- modifyDocument
link = doc.selectFirstChild("a")
linkParent = link.flatMap(_.parent.map(_.asInstanceOf[Element[ParentState.HasParent]]))
_ <- linkParent.foldMapM(_.remove)
} yield ()
A safer solution is to rewrite our program so that it no longer has to work its way back up the tree of elements. In this example let's rewrite using the has
pseudo-class, which is supported by JSoup:
val modifications = for {
doc <- modifyDocument
linkParent = doc.selectFirstChild("div:has(a)")
_ <- linkParent.foldMapM(_.remove)
} yield ()
Consider the following (flawed) JSoup program:
import scala.jdk.CollectionConverters._
val doc: org.jsoup.nodes.Document = ???
val foo = doc.childNodes.asScala.map {
case x: org.jsoup.nodes.Element => x.html
case x: org.jsoup.nodes.DataNode => x.getWholeData
}
Can you see the problem? The pattern match is not exhaustive. If any of the child nodes is something other than an element or a data node, we're going to throw a MatchError
at runtime. Worse, the compiler cannot warn us about it.
Let's re-write our JSoup program using ScalaSoup:
val doc: Document[_] = ???
val foo = doc.childNodes.map {
case x: Element[_] => x.html
case x: DataNode[_] => x.wholeData
}
This time, we see the problem clearly:
[warn] Foo.scala:24:34: match may not be exhaustive.
[warn] It would fail on the following inputs: Comment(), DocumentType(), TextNode(), XmlDeclaration()
[warn] val foo = doc.childNodes.map {
[warn] ^
[warn] one warning found
In ScalaSoup--unlike in JSoup--the Node
/Element
/Document
/etc class hierarchy is sealed. This allows the compiler to determine when a match is not exhaustive.
Consider this JSoup program that finds elements matching a regex ((foo)
in this case):
val matchingElements = org.jsoup.Jsoup.parse("<div>foo</div>").getElementsMatchingText("(foo)")
But what if we forgot to close that capturing group ((foo
)?
val matchingElements = org.jsoup.Jsoup.parse("<div>foo</div>").getElementsMatchingText("(foo")
We get a runtime exception: java.lang.IllegalArgumentException: Pattern syntax error: (foo
.
Let's see the equivalent in ScalaSoup:
val result = ScalaSoup.parse("<div>foo</div>").elementsMatchingText("(foo")
What happens this time? It doesn't even compile.
[error] Foo.scala:183:73: Regex predicate failed: Unclosed group near index 4
[error] (foo
[error] ^
[error] val result = ScalaSoup.parse("<div>foo</div>").elementsMatchingText("(foo")
[error] ^
[error] one error found
[error] (dsl/test:compileIncremental) Compilation failed
We've eliminated yet another entire class of runtime exceptions.
What about CSS selectors?
Take this Jsoup example (note the unclosed [
):
val result = org.jsoup.Jsoup.parse("<div>foo</div>").select("a[href")
What happens? Another runtime exception:
org.jsoup.select.Selector$SelectorParseException: Did not find balanced marker at 'href'
And what about ScalaSoup?
val result = ScalaSoup.parse("<div>foo</div>").select("a[href")
As you guessed, it doesn't even compile:
[error] Foo.scala:183:59: CssSelector predicate failed: Did not find balanced marker at 'href'
[error] val result = ScalaSoup.parse("<div>foo</div>").select("a[href")
[error] ^
[error] one error found
[error] (dsl/test:compileIncremental) Compilation failed
These compile-time checks are courtesy of the excellent refined library.
There is one big limitation: these checks rely on the regex and CSS selectors being baked in at compile time. For example, this won't compile:
val someSelectorFromTheOutsideWorld: String = ???
val result = ScalaSoup.parse("<div>foo</div>").select(someSelectorFromTheOutsideWorld)
Here's the compiler error:
[error] Foo.scala:184:59: compile-time refinement only works with literals
[error] val result = ScalaSoup.parse("<div>foo</div>").select(someSelectorFromTheOutsideWorld)
[error] ^
[error] one error found
[error] (dsl/test:compileIncremental) Compilation failed
ScalaSoup provides a fromString
method for these cases. It returns an Either
containing either an error message or a valid selector:
val someSelectorFromTheOutsideWorld: String = ???
val selectorOrError: Either[String, CssSelectorString] = CssSelectorString.fromString(someSelectorFromTheOutsideWorld)
selectorOrError match {
case Left(errorMessage) => Nil
case Right(validSelector) => ScalaSoup.parse("<div>foo</div>").select(validSelector)
}
If you're feeling reckless, you can use fromStringUnsafe
:
val someSelectorFromTheOutsideWorld: String = ???
// This will throw if the selector doesn't parse.
val validSelector: CssSelectorString = CssSelectorString.fromStringUnsafe(someSelectorFromTheOutsideWorld)
ScalaSoup.parse("<div>foo</div>").select(validSelector)
There are equivalent methods for regexes: RegexString.fromString
and RegexString.fromStringUnsafe
.
JSoup really likes to return null references. For example:
val doc = org.jsoup.Jsoup.parse("<div></div>")
val element: org.jsoup.nodes.Element = doc.selectFirst("span") // Returns null
val spanHtml = element.html // Throws java.lang.NullPointerException
In ScalaSoup, selectFirst
is more honest: it returns an Option[Element]
.
val doc = ScalaSoup.parse("<div></div>")
val maybeElement: Option[Element[_]] = doc.selectFirst("span")
maybeElement match {
case Some(element) => element.html
case None => ""
}
ScalaSoup replaces all such nullable return types with Option
s.
It is of course possible to call get
on an Option
(which will throw if None
), putting us right back in the primitive world we just escaped. To avoid this, consider using WartRemover and its OptionPartial wart.
JSoup often returns an instance of its Elements
class (a subclass of ArrayList<Element>
). Scala's collection types are much richer than Java's, so in ScalaSoup we opt to use simple List[Element]
lists. We do not expose (a wrapper around) Elements
.
Consider this method from JSoup's Elements
class:
public List<FormElement> forms() {
ArrayList<FormElement> forms = new ArrayList<>();
for (Element el: this)
if (el instanceof FormElement)
forms.add((FormElement) el);
return forms;
}
Usage might look something like this:
val document = org.jsoup.Jsoup.parse(...)
val forms = document.getAllElements.forms
In ScalaSoup we don't need this, we can simply use collect
:
val forms = document.allChildren.collect({case f: FormElement[ParentState.HasParent] => f})
Consider this JSoup program:
val doc1 = org.jsoup.Jsoup.parse("<span></span>")
val doc2 = org.jsoup.Jsoup.parse("<div></div>")
doc2.selectFirst("div").replaceWith(doc1.selectFirst("span"))
With one call to replaceWith
, we've managed to mutate both doc2 and doc1. This is courtesy of JSoup's "reparenting" misfeature.
Let's rewrite this in ScalaSoup:
val doc1 = ScalaSoup.parse("<span></span>")
val doc2 = ScalaSoup.parse("<div></div>")
val modifications = for {
doc <- modifyDocument
div = doc.selectFirstChild("div").get
_ <- div.replaceWith(doc1.selectFirstChild("span").get)
} yield ()
val updatedDoc2 = doc2.modify(modifications)
This time around, neither doc2 nor doc1 are mutated. ScalaSoup achieves this by cloning arguments to replaceWith
and other similar methods.
- Publish to Sonatype.
- More tests.
- Improve this readme.
- Copy JavaDoc from JSoup?
- Use refined for URLs.
- Improve the DSL.
MIT