Tom Ordonez

Data Science, Machine Learning, Growing Teams

Python Lambda and BeautifulSoup


This Python Lambda is a very weird concept. I almost grok it.

I was trying to parse HTML comments using BeautifulSoup.

After a quick Google search I found this solution:

commments = soup.find_all(text=lambda text:isinstance(text, Comment))

Say what?

You lost me at lambda.

Here is an easier lambda example.

def sum(x, y):
    return x + y

This is easy right? Just a simple function that takes x and y and returns the sum.

What if one day you say "I bet there is a one-liner for this".

And I am not talking about this:

def sum(x, y): return x + y

I am talking about some mutant Python code skill that when you write it everybody in the room just faints.

Python Lambda

sum = lambda x,y: x + y

Did you see that?

From this:

def sum(x, y):
    return x + y

To this:

sum = lambda x,y: x + y

I read it like this:

"Invoke the powers of Lambda and take the parameters x and y. Add them together and return the value. Assign the value to sum".

I am doing a MS in Computer Science at Georgia Tech with a focus in Machine Learning. I am writing a weekly newsletter about my lessons learned. Follow my quest to conquer data science.

Python Lambda Details

A Lambda function is an anonymous function. A function defined without a name.

Your usual function is defined with def. While anonymous functions are defined with lambda.

The syntax of the lambda function is:

lambda arguments: expression
  • Any number of arguments
  • One expression to rule them all
  • The expression needs to return a value

Calling Lambda Helloooo Lambda

This works:

>>> lambda1 = lambda x: x**2
>>> lambda1(3)
9

Lambda, BeautifulSoup and HTML Comments

This is where I was tripping a bit.

Given a soup object.

And an HTML such as:

<!-- Python is awesome -->
<!-- Lambda is confusing -->
<!-- name="Homer Simpson" -->
<title>I am grook</title>
<h1>Homer groks Lambda</h1>
<p>Lolcats</p>
<a href="https://www.tomordonez.com">I grok Lambda too</a>
<a href="https://twitter.com/tomordonez">I grok Twitter</a>

I wanted to extract the text from the HTML comments.

BeautifulSoup has a module called Comment that is used for this.

from bs4 import Comment

The solution from StackOverflow says that to extract the comments to a list. You need lambda and the isinstance function.

comments = soup.find_all(text=lambda text: isinstance(text, Comment))

BeautifulSoup and Lambda

Keep as reference the short HTML example above.

The "find all HTML comments code" starts with find_all.

Some people keep using findAll too. But the new syntax is find_all to comply with PEP8. Using underscores and not camelCase.

In BeautifulSoup, the find_all method, searches for all tags in the soup object.

Using find_all()

>>> soup.find_all('a')
[<a href="https://www.tomordonez.com">I grok Lambda too</a>,
<a href="https://twitter.com/tomordonez">I grok Twitter</a>]

Will search for all a anchor tags and return a list of tag objects.

Using find

>>> soup.find('a')
<a href="https://www.tomordonez.com">I grok Lambda too</a>

Will only search for one a anchor tag and return a tag object.

<class 'bs4.element.Tag'>

The default search method of a soup object is find_all so you can also do this:

>>> soup('a')
[<a href="https://www.tomordonez.com">I grok Lambda too</a>,
<a href="https://twitter.com/tomordonez">I grok Twitter</a>]

This will also return a list of a tags.

I guess the difference of using find_all and find is about returning a list of results or returning just one result.

In an HTML document there is only one title tag.

Why would you use find_all? You wouldn't.

>>> soup.find('title')
<title>I am grook</title>

If you use this:

>>> soup('title')

It will default to using find_all.

Anchor Tags and Attributes

In BeautifulSoup the attributes of an anchor tag can be accessed as a dictionary.

Using find will only return the 1st value it finds.:

>>> anchor = soup.find('a')
>>> anchor
<a href="https://www.tomordonez.com">I grok Lambda too</a>

To get the string from href:

>>> anchor['href']
https://www.tomordonez.com

The attributes of the anchor tag are defined as a dictionary such as:

{'href': 'https://www.tomordonez.com'}

Which means that you can find a specific tag that has attributes such as:

soup.find('tag_name', attrs={'key': 'value'})

Anchor Tags and Strings

By default find_all and find are looking for HTML tags.

What about strings such as:

I grok Lambda too

Which is inside of:

<a href="https://www.tomordonez.com">I grok Lambda too</a>

You can use the string argument.

In a previous version of BeautifulSoup it was called text. They changed it to string.

You could do this:

>>> soup.find(string='I grok Lambda too')
"I grok Lambda too"

For this example:

<title>I am grook</title>

You can do this. Which uses find to search for the tag title:

>>> soup.find('title').string
"I am grook"

Or this also works. Which uses the argument string to search for strings instead of tags:

>>> soup.find(string="I am grook")
"I am grook"

Weird Lambda code

Let's review that lambda code again to find HTML comments:

comments = soup.find_all(text=lambda text: isinstance(text, Comment))

First of all text= is not used anymore. You should use string=.

Let's change that:

comments = soup.find_all(string=lambda text: isinstance(text, Comment))

Now, text is just a variable. Which in this case is not a really good name. We are not looking for any text, we are looking for HTML comments. Let's change that:

comments = soup.find_all(string=lambda html_comment: isinstance(html_comment, Comment))

What is Comment?

BeautifulSoup has a module called Comment that helps you find HTML comments.

from bs4 import Comment

What is isinstance?

isinstance is a Python built-in function.

The syntax is:

isinstance(object, classinfo)

It returns True if the object argument is an instance of the classinfo argument.

And you could test it like this:

>>> isinstance('Homer', str)
True
>>> isinstance('Homer', int)
False

Which means that this:

isinstance(html_comment, Comment)

Returns True or False...

"Is html_comment an instance of the Comment object?"

The result will be either True or False.

Lambda function

lambda html_comment: isinstance(html_comment, Comment)

This is the same as doing this:

def comments(html_comment):
    isinstance(html_comment, Comment)

Putting this together:

soup.find_all(string=lambda html_comment: isinstance(html_comment, Comment))

This reads as:

  • find_all strings
  • Pass the string as an argument of the lambda function.
  • The string uses the argument html_comment.
  • isinstance says "Is html_comment an instance of Comment?"
  • If it is then return that string.

Finding HTML Comments in BeautifulSoup

To summarize.

Given this:

from bs4 import BeautifulSoup, Comment

html = '''
<!-- Python is awesome -->
<!-- Lambda is confusing -->
<!-- name="Homer Simpson" -->
<title>I am grook</title>
<h1>Homer groks Lambda</h1>
<p>Lolcats</p>
<a href="https://www.tomordonez.com">I grok Lambda too</a>
<a href="https://twitter.com/tomordonez">I grok Twitter</a>
'''

We create a soup object:

soup = BeautifulSoup(html, 'html.parser')

Using find_all, which is the default search method for soup:

>>> soup.find_all('title')
[<title>I am grook</title>]

Keep in mind that find_all will always return a list of tag objects.

And it's the same as using this:

>>> soup('title')
[<title>I am grook</title>]

Getting the string out of title. Remember that right now we have a list with one element.

If you do this. It won't work:

>>> soup('title').string
Traceback...
...
Did you call find_all() when you
meant to call find()?

This works:

>>> soup('title')[0].string
'I am grook'

But it's weird right?

If there is only one title on the HTML then just use find:

>>> soup.find('title')
<title>I am grook</title>

This doesn't return a list. But only searches for the 1st result it finds.

This is the same as using this:

>>> soup.title
<title>I am grook</title>

To get the string. These two do the same:

>>> soup.find('title').string
>>> soup.title.string

It will give you:

'I am grook'

To find or to find_all. That's the question

The shorthand of find_all is:

>>> soup('title')

The shorthand of find is:

>>> soup.title

It's easy to get confused by this optimization.

It's better to just use the name of the method.

Finding attributes

The methods find_all and find search for tags.

Attributes can be accessed as dictionaries.

Given:

<a href="https://twitter.com/tomordonez">I grok Twitter</a>
<a href="https://www.tomordonez.com" name="Awesome">I grok Lambda too</a>

For the second a anchor, the attributes dictionary is:

{'href': 'https://www.tomordonez.com', 'name': 'Awesome'}

Find an anchor tag with a specific attribute:

>>> soup.find('a', attrs={'name': 'Awesome'})
<a href="https://www.tomordonez.com" name="Awesome">I grok Lambda too</a>

Then to access the href you could do this:

>>> anchor_tag = soup.find('a', attrs={'name': 'Awesome'})
>>> anchor_tag
<a href="https://www.tomordonez.com" name="Awesome">I grok Lambda too</a>

>>> anchor['href']
'https://www.tomordonez.com'

Or all in one:

>>> soup.find('a', attrs={'name': 'Awesome'})['href']

Finding Strings

Use the string argument:

>>> soup.find(string="I grok Lambda")
'I grok Lambda'

If you do this. It won't work:

>>> soup.find(string="I ")

But you can use a regular expression:

>>> soup.find(string = re.compile(r'^I'))
'I grok Lambda'

You can also pass a function to the string argument

Passing a Lambda function to the string argument

Given:

<!-- Python is awesome -->
<!-- Lambda is confusing -->
<!-- name="Homer Simpson" -->

To get only the 1st comment you can use find:

>>> soup.find(string = lambda html_comment: isinstance(html_comment, Comment))
' Python is awesome '

To get a list of comments then use find_all:

>>> soup.find_all(string = lambda html_comment: isinstance(html_comment, Comment))
[' Python is awesome ', ' Lambda is confusing ', ' name="Homer Simpson"']

But keep in mind in this case the HTML comments have leading and trailing whitespace. You can just use the strip() method.

I am doing a MS in Computer Science at Georgia Tech with a focus in Machine Learning. I am writing a weekly newsletter about my lessons learned. Follow my quest to conquer data science.