Загрузка...

How to Extract JSON Data from HTML with BeautifulSoup and Regex

Learn how to scrape JSON data from HTML using `BeautifulSoup` and `Regex` in Python. Follow this step-by-step guide for efficient web scraping!
---
This video is based on the question https://stackoverflow.com/q/63705060/ asked by the user 'Dan' ( https://stackoverflow.com/u/13339694/ ) and on the answer https://stackoverflow.com/a/63706802/ provided by the user 'Andrej Kesely' ( https://stackoverflow.com/u/10035985/ ) at 'Stack Overflow' website. Thanks to these great users and Stackexchange community for their contributions.

Visit these links for original content and any more details, such as alternate solutions, latest updates/developments on topic, comments, revision history etc. For example, the original title of the Question was: Beautifulsoup JSON

Also, Content (except music) licensed under CC BY-SA https://meta.stackexchange.com/help/licensing
The original Question post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/by-sa/4.0/ ) license, and the original Answer post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/by-sa/4.0/ ) license.

If anything seems off to you, please feel free to write me at vlogize [AT] gmail [DOT] com.
---
Extracting JSON Data from HTML Using BeautifulSoup and Regex

In the exciting world of web scraping, there often arises the need to extract JSON data embedded within HTML documents. This task can be particularly challenging if you're new to Python or web scraping. In this post, we will solve a common problem related to extracting JSON data using the BeautifulSoup library along with Regular Expressions (Regex).

The Problem

While scraping a webpage using BeautifulSoup, you may encounter JSON data formatted within <script> tags. A developer's goal might be to isolate this JSON data, specifically the jsonConfig section, for further analysis or manipulation. Our user has shared his struggles with extracting this data effectively.

The initial attempt using BeautifulSoup's .text() method returned an empty response, which is a common pitfall. Fortunately, there's a straightforward solution using Python's re and json modules.

The Solution

Let's break down the methods used to extract the jsonConfig JSON data from HTML.

Step 1: Set Up Your Environment

Before we code, ensure that you have the necessary libraries installed. You can install them using pip if you haven’t:

[[See Video to Reveal this Text or Code Snippet]]

Step 2: Fetch the HTML Content

Using the requests library, we can easily fetch the content of a webpage:

[[See Video to Reveal this Text or Code Snippet]]

Step 3: Use Regular Expressions to Extract JSON

Now, we can use a Regular Expression to search for our desired JSON. The re.search() function allows us to look for patterns in the string.

[[See Video to Reveal this Text or Code Snippet]]

r'let jsonConfig = ({.*?})': This regex looks for the pattern let jsonConfig = {...} in the HTML and captures everything within the curly braces.

re.DOTALL: This flag allows the dot (.) to match newline characters.

Step 4: Load the JSON Data into Python

Once we’ve extracted the JSON string, we need to load it into Python as a dictionary using json.loads():

[[See Video to Reveal this Text or Code Snippet]]

Step 5: Access the JSON Data

Now, you can easily access any part of the JSON data. For example, to retrieve the final price of the product, simply access it:

[[See Video to Reveal this Text or Code Snippet]]

Final Output

After running your script, you should see the entire JSON data printed in a readable format, along with the specific final price:

[[See Video to Reveal this Text or Code Snippet]]

Conclusion

By following this step-by-step approach, you can extract JSON data from HTML documents more efficiently. The combination of BeautifulSoup for HTML parsing and Regular Expressions for data extraction offers a powerful toolkit for web scraping enthusiasts. Test these techniques on your own projects, and you’ll find that extracting structured data like JSON can become a smooth process!

Happy coding and scraping!

Видео How to Extract JSON Data from HTML with BeautifulSoup and Regex канала vlogize
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять