The State Of Web Scraping in 2021

Author: Mihai Avram | Date: 10/02/2021

The area of web scraping has really expanded in the last few years, and it helps to know some of the main frameworks, protocols, and etiquette so that you can build the next awesome Web Scraping tool to revolutionize our world! Or maybe just your local neighborhood, or workgroup – that’s fine too.

In this post, we will cover.

  • What is web scraping?
  • What are the main programming frameworks for web scraping?
  • What are some of the main enterprise-level paid web scraping frameworks?
  • A Python web scraping example where we extract some information from a site with Beautiful Soup
  • A JavaScript (Node.js) example where we interact with Google Search using Puppeteer
  • The Do’s and Don’ts of Web Scraping

Let’s begin!

What is web scraping?

With the web now hosting almost 5 billion web pages, it would be impossible to view each one of these pages personally. Even if you somehow knew the address of each page, and assuming that you are only looking at a page for about 3 seconds, it would take you nearly 500 years to view everything. Now imagine another scenario where you need to take different parts of the web, and organize them for a specific purpose. Perhaps a project to view car prices for your favorite electric car on many car dealership websites. To do this manually, again would take a very long time. This is where web scraping comes in, and is defined as such:

Web Scraping
The act of retrieving information from, or interacting with a site hosted on the web using an automated programming script or process. This process is also known as a web crawler or bot.

One can write such an automated script fairly easily, with less than 10 lines of code, and automatically retrieve information from the web, obviating the need to search, organize, or interact with the website manually. For some specific use-cases, like the car dealership example above, this can save a lot of time, and frankly, there are a lot of business models built by this advent of web scrapers. Some examples of working business models that use web scraping are tracker services that can alert you when something you desire is back in stock, review sites that aim to aggregate people’s opinions, travel websites that want to provide trip data in real-time, and even the much-contested media/marketing practice for gathering users’ profiles and preferences. There are even plausible examples of these web crawlers filling out website profiles for some people, submitting posts, and solving captchas – but this is yet another debated gray area where one must be careful to not get in legal trouble.

With just a little bit of coding knowledge, one can do some really interesting things to retrieve, organize, and even interact with various sites online. In theory, one can automate almost anything done manually on the web – with a wide range in difficulty level of course.

What are the main programming frameworks for web scraping?

Language Agnostic Tools

Playwright – One of the best language-agnostic and feature-rich tools for web scraping. Use it if you are scraping or testing complex applications, are building tooling in multiple languages, or need to perform end-to-end testing.

Selenium – An older and popular language-agnostic tool for web scraping that inspired many of the newer frameworks. Use it if you are working on large scraping or testing projects that need scale, are building tooling in multiple languages, and don’t mind spending a more time on configuration.

While Playwright and Selenium each have their own pros and cons, you can judge for yourself what is the best tool for your job via this comparison article.

Python Frameworks

Scrapy – An open-source scraping framework used to extract data from websites in any format which is built with efficiency and flexibility in mind. Use it for complex projects that require scraping multiple sites in various ways.

Beautiful Soup – A Python scraping library that one can use to parse a webpage easily and quickly. Beautiful Soup is a minimal version of Scrapy with only a fraction of the functionality; however, if parsing a web page is all you have to do – Beautiful Soup is the perfect tool for it. Use it for simple projects where all you need to do is scrape the elements of one web page.

MechanicalSoup – An interactive library that builds on top of Beautiful Soup and provides functionality to not only parse a web page, but also interact with it like filling forms, clicking drop-downs, submitting forms, and more. Use it if you need to interact with web pages.

Honorable mention – Pyppeteer (a Python version of Puppeteer)

JavaScript Frameworks

Cheerio – A fast and flexible JavaScript library inspired by jQuery that can parse elements of a webpage. Use it if you want to quickly extract elements of a web page.

Puppeteer – A NodeJS library that can both scrape a webpage, and also interact with any website by filling forms, clicking buttons, and navigating around the web. Use it for a full web automation experience.

Apify SDK – A web scraping platform that can quickly span and scale your web automation needs in a web browser. From retrieving web pages to parsing web pages, and even interacting with web pages, Apify can do it all and has custom code libraries and server infrastructure to quickly assist you. Use it if you need to start an involved scraping and web automation project that requires a lot of computing resources.

Java Frameworks

Jaunt – A complete web scraping framework from Java that can scrape and interact with web pages. Use it if you need to both parse web pages and interact with them.

jsoup – A simple web scraping solution that can parse web pages. Use it if you need to quickly parse web pages.

Ruby Frameworks

Kimurai – A scraping solution for Ruby that provides a one-stop shop to scrape and interact with web pages.

Honorable mention – Mechanize and Nokogiri Gems

PHP Frameworks

Goutte – A PHP framework made for web scraping that can both scrape and interact with web pages.

What are some of the main enterprise-level paid web scraping frameworks?

What if I want to scrape the web and I don’t know or don’t how to code? You ask with concerned vitriol. That’s where more of the paid services come in where you can start to use any types of services as hands-on as No Code to as hands-off as fully automated services. Below is a list of some of the most popular paid scraping services that can help you quickly get started without having to know how to code.

Scraper API

A custom API that easily scrapes any site and takes care of proxy rotation, captcha solving, and anti-bot checks. It works by entering the URL of any site you want to scrape, and Scraper API returns all the information. Besides the benefit of it being hands-free, it is also cost-effective by providing free requests and relatively cheap pricing rates. Use it if you want the simplest and cheapest solution to scrape sites in a rudimental way.


One of the most established players in the web automation space. Apify allows you to leverage their thousands of plugins that have been created by the Apify community which can solve most of the common scraping problems. From scraping Instagram to interacting with Travel websites – Apify can do it all. They even have custom solutions where their developers can write custom code to solve your need. The solution also works if you have coding chops and want to do everything yourself. You simply write the code and Apify can host and run the code as well as provide you with proxy rotation and security measures to make sure your scraping scripts do not get blocked. Apify is also comparatively cost-effective compared to the other custom scraping options out there. Use it if you want the most comprehensive options for web scraping at an affordable price.


A point-and-click approach to web-scraping. The way it works is by interacting with the Parsehub desktop app; opening a website there, clicking on the data that needs to be scraped, and simply downloading the results. Parsehub also allows for data extraction using regular expressions or webhooks. One can even host scraping pipelines that run on a schedule. Parsehub is a fairly expensive option. Use it if you want to scrape data using little to no coding chops.


This product takes a different approach by giving a user access to a trillion connected facts across the web. The user can extract them on demand with the service which may include Organizational data, Retail data, News, Discussions, Events, and more. The organization also provides Big Data and Machine Learning solutions that can help you make sense of the data collected, establish patterns, and build IP that solves problems with its data. Use it if you want access to a treasure trove of data that is tied to your project or organization – and are able to afford it.


Similar to Parsehub, this product includes a point-and-click solution to web scraping. It also offers all features of web scraping such as IP rotation, Regex tools to clean up data, and scraping pipelines that can run scraping projects at scale. While Octoparse can solve almost any scraping problem, the service is certainly not cheap. Use it if you want to scrape data using little to no coding chops.


A more custom solution to the Scraping API, ScrapingBee allows users to have more control over websites they scrape with little to no coding experience. They offer a bit pricier rates; however, they can write custom solutions to anybody willing to outsource the task to professionals. ScrapingBee also provides proxy rotation to bypass bot restriction software. Use it if you want a cost-effective way to simply scrape a website – with the option of having custom support from professional developers.

Also if you are a person that likes to live on the edge seat of innovation, here are some feature-rich up and coming contenders in this space:

  • Browse AI (#5 Product Hunt product of the month – Sept. 2021)
  • ScrapeOwl (#4 Product Hunt product of the day – Oct. 2020)
  • Crawly (#2 Product Hunt product of the week – 2016)

Here are more resources if you wanted to learn more about different web scraping tools and offerings.

Now let’s get our hands dirty with some scraping projects shall we?

A Python web scraping example where we extract some information from a site with Beautiful Soup

Let’s start with a simple Python scraping example. For this, we will use the Web Scraping Sandbox where we can very quickly explain and elucidate the main ideas behind web scraping.

As a prerequisite, make sure to have Python 3 installed as well as Beautiful Soup. Once you have Python 3 simply run the following in a shell terminal to install the needed packages.

$ python -m pip install beautifulsoup4


$ python -m pip install requests

Now let us take a look at the code.

# Imports
import requests  # Used to extract the raw HTML of a web page
# Used to read and interact with elements of a web page
from bs4 import BeautifulSoup
from typing import Dict, List  # Optionally used for type hints
# Functions
def extract_web_page_contents(url: str) -> List[Dict[str, str]]:
    Requests HTML from the url, and extracts information from
    it using Beautiful Soup. Namely, we are extracting just
    a list of books on the page - their names and prices.
        url (str): The url where we are web scraping.
        book_names_and_prices (List[Dict[str, str]]):
            Names and prices of all books retrieved from the
            web page.
    # Used to store book information
    book_names_and_prices = []
    book_information = {
        'name': None,
        'price': None
        # Extracting web page contents
        web_page = requests.get(url)
        web_page_contents = BeautifulSoup(web_page.content, 'html.parser')
        # Extracting the names and prices of all books on the page
        for book_content in web_page_contents.find_all(
                'article', class_='product_pod'):
            # Useful to visualize the structure of the HTML of the book content
            print('book content: ', book_content)
            # Extracting the name and price using CSS selectors
            book_title = book_content.select_one('a > img')['alt']
            book_price = book_content.select_one('.price_color').getText()
            # Creating new instance of book information to store the info
            new_book_information = book_information.copy()
            new_book_information['name'] = book_title
            new_book_information['price'] = book_price
            # Aggregating many book names/prices
        return book_names_and_prices
    except Exception as e:
        print('Something went wrong: ', e)
        return None
# MAIN Code Start
if __name__ == '__main__':
    url = ''
    web_page_contents = extract_web_page_contents(url)
    print('All extracted web page contents: ', web_page_contents)

Code Explained

Starting at the “MAIN Code Start” section we define the URL to be a link to the Web Scraping Sandbox that has a fictitious list of books. The goal of this code is to browse to the first page on that website and extract the first set books on that page. Namely, we want to extract their names and prices.

The “extract_web_page_contents” function achieves this by first retrieving the HTML using the requests library, and then parsing those contents with the library called Beautiful Soup. Everything is easy up until this point. You may wonder, however – now that we got the HTML of the page, which looks something like this (below) – how can we extract the contents for those books?

<!DOCTYPE html> <html lang="en-us" class="no-js"> <head> <title>All products | Books to Scrape - Sandbox</title> <meta http-equiv="content-type" content="text/html; charset=UTF-8"/> <meta name="created" content="24th Jun 2016 09:29"/> <meta name="description" content=""/> <meta name="viewport" content="width=device-width"/> <meta name="robots" content="NOARCHIVE,NOCACHE"/><!--[if lt IE 9]> <script src="//"></script><![endif]--> <link rel="shortcut icon" href="static/oscar/favicon.ico"/> <link rel="stylesheet" type="text/css" href="static/oscar/css/styles.css"/> <link rel="stylesheet" href="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css"/> <link rel="stylesheet" type="text/css" href="static/oscar/css/datetimepicker.css"/> </head> <body id="default" class="default"> <header class="header container-fluid"> <div class="page_inner"> <div class="row"> <div class="col-sm-8 h1"><a href="index.html">Books to Scrape</a><small> We love being scraped!</small></div></div></div></header> <div class="container-fluid page"> <div class="page_inner"> <ul class="breadcrumb"> <li> <a href="index.html">Home</a> </li><li class="active">All products</li></ul> <div class="row"> <aside class="sidebar col-sm-4 col-md-3"> <div id="promotions_left"> </div><div class="side_categories"> <ul class="nav nav-list"> <li> <a href="catalogue/category/books_1/index.html"> Books </a>
</script> </body></html>

There are two fairly straightforward ways to do this. The first, and most popular way is by using your browser’s development tools to select and identify the elements on the page you want to extract information from. The second is using Beautiful Soup to pretty-print and explore the elements within the code. The second use case may be more advanced, so let’s just focus on the first. Here’s an example below using the browser web tools.

In the above example, I browsed to the website via Google Chrome, hit F12 to launch the web tools, then clicked the selector button (see image below) to then guide my mouse and click on the book I wanted to extract. The book called “The Light in the Attic”

On the right side of the image above you will see that the exact HTML content that needs to be extracted automatically became highlighted. The trick is to notice that every single book is under the “article” HTML tag, and has a class of “product_pod”. Hence, I directed Beautiful Soup to retrieve all instances of articles with that given class, and voila – we now have a list of all the book contents. You may now be thinking to celebrate with a glass of sparkling 💦 , ☕ , or 🍺 if you are old enough; however, not so fast. While we have the book contents, we still have them in HTML format, and so, we must use more selectors to extract the exact content we want (name and price).

The price is the easiest of the two because if we explore the HTML of a given article, we will observe that the class for each price element is “price_color” – so we just create a Beautiful Soup selector to get the HTML element with that class, which contains the price within the text.

The book title is a bit more difficult to extract as we don’t quite have a class we can hook into. Class properties (e.g. class=”price_color”) and Ids (e.g. id=2343) are much easier to hook into because they have higher specificity – are typically more unique for our purposes. Whereas, let’s say we want to find all “a” elements on the page, well, there could be dozens if not hundreds of them! It does help to understand the basics of CSS selectors, and that can help you write rules to identify these elements faster. If you’re interested in learning the basics of CSS selectors, check out this guide. For the purposes of this example, however, if we find the element which includes the title by exploring the HTML via dev tools – we will observe that each title is included in the “alt” property of an element called “img” which is the child of another element called “a”. You could do this by hand by exploring the HTML structure in the dev tools. Or, if you want to take a shortcut, you can simply find any title, select it, then right-click on the HTML formatting of the selector, and click on the “Copy selector” directive (see image below for an example).

You can then paste it into your favorite text editor and you will see that it will yield a CSS pattern that is very close to the one you will use in the code. For instance, here is what the selector contents would be in the example case above:

#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > h3 > a

This tells us roughly what we suspected, there is an “a” element that has a parent of “article”. Then all we have to do is extract the “alt” property as we do in the code, and we have just extracted the title!

Feel free to review the comments of the code for a deeper dive on how the scraping works with Python and Beautiful Soup. I tried to make the code example as self-explanatory as possible. Now that you feel more comfortable with this basic example – how about you challenge your understanding by looking to extract the book contents for 10 of the pages? Hint: You may need to use the aspect of pagination for this.

A JavaScript (Node.js) example where we interact with Google Search using Puppeteer

Let’s now delve into a more complex scraping example with JavaScript where we will interact with information on a web page using the Puppeteer API. In our simple example, we will simply go to Google, and run a quick google search.

As a prerequisite, you will have to install the latest version of Node.js and npm, and once you have both installed you can install puppeteer by running the following in a terminal:

$ npm install puppeteer

Now let us take a look at the code.

// Imports
const puppeteer = require('puppeteer'); // Used for interacting with a website
// Functions
 * Interacts with a website present at a certain url.
 * For our use-case, this url is Google, and we are
 * simply just searching something using the search bar.
 * @param {string} url The url we are browsing to, and
 *                     interacting with.
 * @returns {void}
const interactWithWebsite = async (url) => {
  // Starting a browser session with puppeteer
  const browser = await puppeteer.launch({ headless: false });
  // Opening a new page
  const page = await browser.newPage();
  // Browsing to the url provided, and waits until page loads
  await page.goto(url, { waitUntil: 'networkidle2' });
  // Filling in the search as "Best guides for web scraping"
  await page.$eval(
    (el) => el.value = 'Best guides for web scraping',
  // Clicking on the Google Search button
  await page.evaluate(() => {
    document.querySelector('input[value="Google Search"]').click();
  // Waiting for the results
  await page.waitForSelector('#result-stats');
  // Waiting for 5 seconds to admire the results
  const waitSeconds = (seconds) => new Promise(
    (resolve) => setTimeout(resolve, seconds * 1000),
  await waitSeconds(5);
  // You can do more stuff here, like retrieve the results like we did
  // in the Python example above
  // We are done, closing the browser
// MAIN code start
const url = '';

Code Explained

The code starts with the “interactWithWebsite” function which will browse to Google. We first launch a browser, open a new page, and go to the URL we searched, waiting for it to load. Then, we find the input field with an attribute of “title” called “Search” – and fill it with our query. Just as in the Python example above, one can use the browser web tools to find the component on the page, and the properties, as well as CSS selectors, needed to retrieve them with code. Below is an example of finding the search field component, and the HTML that it is defined by – from there we can create the CSS selector to hook into it.

Finally, we find the “Google Search” button which is an input field with a “value” attribute of “Google Search”, and click it. On the resulting page, there will be an element with the id of “#results-stats” – and we wait for it to load. Our magical and automatic google search is now complete! Feel free to review the comments of the code for a deeper dive on how the scraping works in Puppeteer and Node.js. I tried to make the code example as self-explanatory as possible.

Now that you have learned to both retrieve information from and interact with web pages, you should be well on your way to create some automation magic! When you do start – it will be important to understand some scraping etiquette which we will discuss next.

The Do’s and Don’ts of Web Scraping

While it may seem like web scraping is a free and easy way to extract information from the web, the web is no longer a wild west. This means we have to be careful how we interact with various sites, otherwise, we risk being a nuisance at the very least, and even possibly fined or punished for breaking the law in the worst case.

The most important theme to remember from this post if you don’t remember anything else is the notion of not doing any harm to a website, its users, and its owners.

Below are some specific Do’s and Don’ts to help you apply the theme above and engage in web scraping in an ethical way:

  • Use only one IP connection
  • Crawl at off-peak traffic times. If a news service has most of its users present between 9 am and 10 pm – then it might be good to crawl around 11 pm or in the wee hours of the morning.
  • Follow the terms of service of a website you are crawling, and especially if you have to sign up or log in to use the services within.
  • Some websites have a robots.txt file that lists the rules and limitations that scrapers should follow when scraping and interacting with the websites automatically.
  • If you are crawling to present the content in a new way, or solve a problem – make sure that the solution is unique and not simply copy/pasting content. The latter can easily be seen as copyright infringement.
  • Don’t breach GDPR or CCPA rules

Here are some articles to delve more into this topic:

Additionally, here’s an article that describes the robots.txt file at length


That’s all for this post. I really hope that you have enough basic knowledge and guides to get you started in creating an ethical and world-changing web scraper!

Understanding The Blockchain Ecosystem From The Ground Up

Author: Mihai Avram | Date: 12/08/2020

There is no doubt about it that Blockchain has started exploding both as a topic and technology for a few years now. Maybe you are a professional who simply has seen the word blockchain too many times and want to learn it once and for all. Or maybe you are a blockchain enthusiast who wants to dive in deeper in understanding the internals of the blockchain ecosystem. In both cases, you came to the right article place!

Here we will cover:

  • How blockchain technology works
  • What blockchain is used for and what industries use it
  • What programming languages to use to build a blockchain
  • What are the leading providers of blockchain technologies
  • How to build a blockchain from the ground up (with code)
  • How to learn more about blockchain

If you want to learn any of these notions then keep reading!

What is Blockchain and How Does it Work

In a nutshell, blockchain is a piece of technology that ensures that transactions (e.g. paying for your groceries, a doctor visit, an artist signing a record label contract, etc.) are transparent in a securely decentralized fashion, so there is no longer a need for a central authority (such as a bank or government) to oversee or regulate it. Because blockchain is also built with privacy in mind, it is very difficult to alter or tamper with.

In order to understand how blockchain does this and how it works, let’s envision the following example:

Simple Blockchain Example

Imagine that you and two other friends (let’s call them Friend 1 and Friend 2) are using a blockchain to update your shared expenses online. All three of you will have a file on your computers that automatically updates when you buy or sell an item, either from the internet or from each other. You buy some tickets to a concert, and when you do, your computer quickly updates your file and sends copies of your file to your friends. Once your friends receive those files, their computers quickly check if your transaction makes sense (e.g. did you have enough money to buy the tickets, and it is really you who is buying the tickets). If both friends agree that everything checks out, everyone updates their file to include your transaction. This cycle repeats for every transaction that either you or your friends make so that all three of your files are synced up, and there is no authority to oversee the process.

There is of course a bit more nuance to it and gets very technical very quickly in understanding to build such a system from a programming perspective. If you want to understand how blockchain works in-depth, you can read the academic paper by Satoshi Nakamoto who created the first blockchain database

Original blockchain paper by Satoshi Nakamoto (link)

What is Blockchain Used For?

Blockchain is quickly becoming very widespread where almost every industry is touched by this technology. For inspiration, here are just a handful of examples of how Blockchain is used today.

Monetary Payments – Blockchain used in monetary transactions creates a more efficient and secure payment infrastructure.

Global Commerce – Global supply chains are governed by blockchain technologies to ensure a more efficient transactional trade system.

Capital Markets – Blockchain enables audit trails, quicker settlements, and operational improvements.

Healthcare – Secondary health data that cannot identify an individual by itself can be placed on the blockchain which can then allow administrators to access such data without needing to worry about the data all being in one place which makes it very secure.

Energy – Utility processes such as metering, billing, emission allowances, and renewable energy certificates all can be tracked via blockchain transactions in one decentralized place.

Media – Media companies use blockchain to protect IP rights, eliminate fraud, and reduce costs.

Voting – The notion of each vote being in a decentralized blockchain solves the problem of elections being hacked or tampered with.

Cybersecurity – Blockchain solutions in the security space ensure that there is no single point of failure, and it also provides privacy as well as end-to-end encryption.

Other real-life examples exist in Regulatory Compliance and Auditing, Insurance, Peer-to-Peer Transactions, Real Estate, Record Management, Identity Management, Taxes, Finance Accounting, Big Data, Data Storage, and IoT among many others.

What are the Most Popular Types of Cryptocurrency?

Bitcoin – The cryptocurrency that started it all. It was started in 2009, and follows closely to the original Satoshi Nakamoto cryptocurrency paper referenced earlier. It is mostly used for monetary transactions.

Litecoin – Crated in 2011 as an alternative to Bitcoin. Litecoin is a little faster than Bitcoin, has a larger limit and, operates on different algorithms.

Ethereum – Ethereum was created in 2015 and is also focusing on decentralized applications with smart contracts instead of just monetary transactions. This way different transactions outside of monetary exchange can happen, such as digital trading cards, or IoT activations on a smart-grid network.

Ripple – A cryptocurrency that is not blockchain-based. However, it is often used by companies to move large amounts of money quickly across the globe.

For a more extensive list, check out these resources.

  • An ever-growing list of cryptocurrencies on Wikipedia (link)
  • Understanding The Different Types of Cryptocurrency by SoFi (link)
  • Types of Cryptocurrency Explained by Equity Trust (link)

What are the Best Programming Languages to Develop Blockchain?

C++ – Best if you need to build a blockchain from scratch or change some low-level internals of how blockchain works.

Solidity – Best if you are set on using the Ethereum Blockchain framework and platform.

Python – Best if you want to bring blockchain to general-purpose apps, especially in Data Science.

JavaScript – Best if you want to build a blockchain for the web.

Java – Best if you want to build a general, and large-scale object-oriented application.

There are, however, blockchain developments in almost all programming languages, so pick the one you’re most comfortable with or is required for the project.

What are the Leading Providers of Blockchain Technologies

Coinbase – A very secure and free API that supports many different cryptocurrencies such as bitcoin and ethereum, and also supports different blockchain transactions such as generating digital wallets, getting real-time prices, and crypto exchanges. Use it if you want to create blockchain apps cost-effectively.

Bitcore – Another free and speedy option with many different blockchain transactions possible. Use it if you want to build very fast blockchain applications with quick transaction times.

Blockchain – The oldest and most popular blockchain framework. It has a large developer community and low timeouts. Use it if you need to implement blockchain wallet transactions.

For a more extensive list check out the following resources.

  • Top 10 Best Blockchain APIs: Coinbase, Bitcoin, and more (link)
  • How to Choose the Best Blockchain API for Your Project by Jelvix (link)

How to Learn More About Blockchain

The fastest way to learn about blockchain is to first take a course, and then start building one yourself. If you’re also serious about blockchain and want to learn it continuously, you should subscribe to some blockchain newsletters.

Here are some links to the courses. Look for the ones with the highest ratings and popularity:

  • [Top 10] Best Blockchain Courses to learn in 2020 (link)
  • 10 Best Blockchain Courses and Certification in 2020 (link)

Also, if you want to build a blockchain, check out this well-sourced Quora post. Furthermore, here is a list of good newsletters to learn more about blockchain from your inbox!

How to Build a Blockchain, a Brief Introduction (With Code)

Let’s build a simple blockchain so that we can understand some of the more subtle nuances of one. The most important inner workings of a blockchain are the following. The chain itself which stores transactional information, a way to mine new possible slots in the chain, the proof of work that identifies if the chain is valid, and a consensus algorithm that can allow nodes or computers to vote whether the chain is valid. The code will label these important notions as #CHAIN step, #MINING step, #POW step, and #CONSENSUS step respectively to trace back to these notions. Note that there is an important aspect of the proof of work. The proof of identifying if a new block is valid should very easily be verified; however, it should be very hard to create from scratch (mining a new block). This property is important because it allows us to easily validate if a blockchain is not tampered with, and prevents hackers from re-creating a blockchain easily (it becomes immutable). We will build all these things below. Pay close attention to the comments as they explain the purpose of each component. Also, note that some functions, such as (is_valid_proof_pattern, get_blockchain, block_matches_proof, etc.) have yet to be implemented to keep this post short, so just imagine that they exist and that they do what they are supposed to do.

Note: that the code below is not an exact replica of a blockchain. Instead it is a simplified representation which can be used for inspiration/intuition and not for rigorous implementation of a blockchain.

Blockchain Server Code

""" Blockchain Server Code

On the blockchain server is where we store the main
implementation of the blockchain. The clients (or apps
such as your banking app) would hit a server like this
as they create new transactions and store them on the
blockchain, or as miners try to mine new blocks.
The classes and code below represents the code that
sits on the blockchain server.

# Imports

from datetime import datetime  # Generates unique timestamps
import hashlib  # Used for hasing our blocks

# Classes

class Transaction():
    A given monetary transaction.
    Example: Joe pays Amy 10 mBTC.
  __init__(self, frm, to, amount):
    self.frm = frm = to
    self.amount = amount

class Block():
    A block on the blockchain containing blockchain
    transactions. Note that every block has a hash
    that is associated to previous blocks.

    self.index = index
    self.previous_hash = previous_hash
    self.proof_of_work = proof_of_work
    self.timestamp = timestamp
    self.transactions = transactions

class Blockchain():
		The blockchain containing various blocks
		that build on each other as well as methods
		to add and mine new blocks. (# CHAIN step)
		self.blocks = []
		self.all_transactions = []

		# Every blockchain starts with a genesis first block
		genesis_block = new Block(


	def add_block(block):
		"""Adds a new block to the blockchain.

		       block (Block class): A new block for the


	def add_new_transaction(transaction):
	    """Adds a new transaction to the blockchain.

		       transaction (Transaction class): A new transaction
			                                    for the blockchain

	def get_full_chain():
	    """Returns all the blockchain blocks.

	           all_blocks (List[Block class]): All the blocks in
			                                   the blockchain.
	    all_blocks = self.blocks
	    return all_blocks

	def get_last_block():
	    """Gets the last block in the blockchain.

	           last_block (Block class): The last block in the
	    last_block = None
	    if self.blocks:
	        last_block = self.blocks[-1]

	    return last_block

    def hash(block):
        """Computes a hashed version of a block and returns it.

               block (Block class): A block in the blockchain.

               hashed_block (str): A hash of the block.
        stringified_block = json.dumps(
			block, sort_keys=True
        hashed_block = hashlib.sha256(
        return hashed_block

	def mine_new_block(possibilities):
	    """An attempt to mine a new block in the blockchain.
           (# MINING step)

		       possibilities (List[Possibility class]):
			   	All possibilities for mining that the
				miners compute/create.
		       reward (str): A reward for the miners if they

        last_block = self.get_last_block()

		# Go through many possible proofs, which is equivalent to
		# using computational power, to find the new block.
		for possibility in possibilities:
			mining_success = False
			previous_hash = self.hash(last_block)
			possible_proof = hashlib.sha256(

			# We imagine this method exists (# POW step)
			if is_valid_proof_pattern(possible_proof,
				# Our possible proof was correct, so miner was
				# able to mine a new block!

				# Forge the new Block by adding it to the chain
				index = last_block.index + 1
				proof_of_work = possible_proof
				timestamp = timestamp.utcnow()
				transactions = self.all_transactions

				new_block = new Block(

				# The mining was a success, we stop mining
				mining_success = True

		# Give reward to miner if mining was a success
		reward = '0 mBTC'
		if mining_success:
		    reward = '0.1 mBTC' # The reward can be anything

		return reward

In short, the server code contains a blockchain which contains blocks and transactions. Miners can use computational power to mine new blocks and as an incentive for doing so, they get rewarded. Consumers can add transactions to the blockchain (e.g. you pay a friend back for lunch) and that transaction will then live on the blockchain. The blockchain is really then a chain of transactions that have the property of being tied to one another and able to be verified if that tie is correct or not.

Client Code Accessing The Blockchain

""" Client Code Accessing The Blockchain

The client or blockchain application that gets
API requests for new transactions. It primarily
interacts with the blockchain server from above,
but has some internal helper functions to store the
new transactions. Note that there could be dozens if
not thousands of these clients that do the same things
as decentralized transactions are written to the
blockchain. Imagine an app like Apple Pay where
everyone is paying each other, client connections
like these would register the transactions on the
blockchain. Below are the client helper functions and

# Functions

def check_consensus(all_nodes, our_blockchain):
    """Compares our blockchain with blockchains from
	   other nodes in the network, and attempts to
	   find the longest valid blockchain, and returns it.
       (# CONSENSUS step)

	       all_nodes (List[Node class]): All nodes in
		   								 the network.
	       our_blockchain (Blockchain class): Our blockchain.

           longest_valid_blockchain (Blockchain class):
		   		The longest valid blockchain.
	longest_valid_blockchain = our_blockchain
	longest_blockchain_len = len(

	for node in all_nodes:
		# Imagine the get_blockchain method exists on the node
		node_blockchain = node.get_blockchain()

		is_valid_chain = True
		for block in node_blockchain.get_full_chain():
			# Imagine the block_matches_proof method exists
			if not block_matches_proof(block):
				is_valid_chain = False

		current_blockchain_len = len(
		if (is_valid_chain
		    and current_blockchain_len > longest_blockchain_len):
			longest_valid_blockchain = node_blockchain
			longest_blockchain_len = len(

	return longest_valid_blockchain

def get_other_nodes_in_network():
	    Returns all nodes, or servers/computers in the network.
	    Code not written here as it is application dependent.
	return all_nodes

def get_our_stored_blockchain():
	    Retrieves the current blockchain on our node or server.
	    Code not written here as it is application dependent.
	return our_blockchain

def set_our_stored_blockchain(new_blockchain):
	    Sets the current blockchain on our node or server.
	    Code not written here as it is application dependent.
	return status

# Now let's say that Joe wants to pay Amy 10 mBTC and
# the client prepares this transaction to write it
# to the blockchain. This is roughly what happens below.

# We first prepare the transaction
frm = 'Joe'
to = 'Amy'
amount = '10 mBTC'
new_transaction = new Transaction(frm, to, amount)

# Then we get the longest valid blockchain we can write our
# new transaction to.
our_blockchain = get_our_stored_blockchain()
all_nodes = get_other_nodes_in_network()
longest_valid_blockchain = check_consensus(
	all_nodes, our_blockchain
if our_blockchain != longest_valid_blockchain:
	# We have an out of date or invalid blockchain
	# so we update our blockchain as well.
	our_blockchain = get_our_stored_blockchain()

# Now that we have the current up-to-date blockchain
# we simply write our new transaction to our blockchain.

All the client code needs to do is to make sure that the blockchain it is working with is up to date by checking the consensus between all the nodes (or servers) in the blockchain network. After the client code has the proper and up to date blockchain, a new transaction can be written.

Code That Miners Use

""" Code That Miners Use

The miners also leverage the blockchain server from above. 
The role of the miners is to come up with compute possibilities 
to create new blocks using compute power. They first retrieve 
the most current blockchain, and then try to mine a new 
block via calling the following methods, and getting rewarded 
in the process if they are successful.

# Code for the generate_possibilities function is application 
# dependent.
possibilities = generate_possibilities()  
reward = current_blockchain.mine_new_block(possibilities)

As the miners keep mining new blocks, the blockchain grows and more transactions can be stored on the blockchain. With understanding the server, client, and miner parts of the blockchain lifecycle you should have a good understanding of different components of a blockchain. There are also more intricacies to a blockchain than the components covered here, such as the details of the proof of work, how transactions are stored, hashed, and regulated, double spending, the verification process, and much more. Taking a course is one of the best ways to understand these nuances.

Below are some resources to other simple blockchain implementations if you’re curious.

Learn Blockchains by Building One (link)

Simple Blockchain in 5 Minutes [Video]

In Conclusion

Well, there you have it, a good primer on this new technology that is dawning upon us. I hope that by understanding blockchain high-level and by diving deeper into the links provided, you can become proficient with blockchain in no time!

Why Use Machine Learning Pipelines and What Frameworks Exist for Them?

Author: Mihai Avram | Date: 5/17/2020

Machine Learning has evolved far beyond just training a model on data and running that trained model to return classification results. In order to efficiently build Machine Learning solutions that effectively run in production environments, we must expand our solutions to be able to provision, clean, train, validate, and monitor the data and model at scale. This requires a new exemplary skillset called a Machine Learning pipeline.

Scikit-learn is a very popular Machine Learning framework, so let’s frame this idea around it and start with a simple pipeline example.

A Simple scikit-learn Machine Learning Pipeline

Scikit-learn is one of the most popular machine libraries implemented in Python, and the key is the Pipeline package from sklearn.pipeline as shown in the code. We start with the following code.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Retrieving our data using our custom function
x_vals, y_vals = load_in_raw_data()

# Building the pipeline
pipeline = Pipeline([
    ('scalar_step', StandardScaler()),
    # More data preprocessing steps can go here
    ('dimensionality_reduction_step', PCA(n_components = 3)),
    ('classification_step', LogisticRegression())

# Running our pipeline against our data to fit and create the model, y_vals)

Let us go through the code step by step.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Here we import all of the packages needed from the scikit-learn library in order to build our pipeline. You may need to import more of them based on the problem you have at hand and the steps involving your pipeline. For instance, if your pipeline involves a Naive Bayes classifier, then the following import would be needed.

>>> from sklearn.naive_bayes import GaussianNB


# Retrieving our data using our custom function
x_vals, y_vals = load_in_raw_data()

We leverage a function we have created in our code load_in_raw_data() which is not included in this post because it is open to interpretation and varies from case to case. For instance, this function could load the popular Iris Data Set from the UCI ML Repository via sklearn.datasets.load_iris() or it could simply load a file from disk.


# Building the pipeline
pipeline = Pipeline([
    ('scalar_step', StandardScaler()),
    # More data preprocessing steps can go here
    ('dimensionality_reduction_step', PCA(n_components = 3)),
    ('classification_step', LogisticRegression())

We build our pipeline by providing a sequence of transformations that our dataset will go through. These transformations will happen in the sequence they are provided, so the scalar_step, will happen before the dimensionality_reduction_step. Note that you can include different transformations and as many as you would like depending on the Machine Learning problem you are looking to solve.


# Running our pipeline against our data to fit and create the model, y_vals)

We run our data through our pipeline to create and fit our model to our resulting values provided (y_vals). You can later use that model to predict future y values based on new x values.

And voila! That’s the skinny on scikit-learn pipelines, for more information, you can check out the following three resources, which can fortify your knowledge of scikit-learn pipelines.

Scikit-learn documentation – (

Scikit-learn pipeline examples from Queirozf -(

Creating Sklearn pipelines in Python (Video)

Creating Pipelines Using Sklearn in Python (Video)

Full-fledged Machine Learning Pipeline Frameworks

Now imagine having to run this Machine Learning task in a full-fledged production environment servicing many stakeholders that needs to do the following:

  1. Have the flexibility to quickly be configured and re-configured
  2. Be able to scale quickly
  3. Retrieve and clean data
  4. Perform feature extraction and selection
  5. Train the Machine Learning model(s)
  6. Test and validate the Machine Learning model(s)
  7. Monitor the running Machine Learning model(s)
  8. Take care of algorithm biases, fairness, and safety
  9. Send alerts if there are any anomalies in the system
  10. Follow security best practices
  11. Be fault-tolerant

The simple scikit-learn pipeline does not have the features to be able to take care of these problems. This is where other DevOps frameworks and pipelines come in which we will discuss next.

Tensorflow Machine Learning Pipeline With TFX

(Ref. – link)

TensorFlow Extended (also known as TFX) is a production-grade pipeline framework created by Google. The way it works is by segregating Machine Learning tasks into different components that run in a sequence. An example of such a component may be a code segment which takes in the input data and splits it into training and test sets, while another example may be a code segment which trains a Logistic Regression model. All of these components run in a Directed Acyclic Graph (DAG) which is a technical way of saying that the components run sequentially without forming any loops (e.g. step B will always follow step A just once).

A typical TFX pipeline consists of the following components shown below.

ExampleGen – reads in the input data and can split it into training and test sets

StatisticsGen – computes statistics about the input dataset

SchemaGen – creates a schema for the input data and infers data types, categories, ranges, and more

ExampleValidator – validates the input data and checks for training/test skews or anomalies in the data

Transform – creates features from the input data

Trainer – trains the model based on the data and features

Evaluator – tests the trained model and performs validation checks as well as an analysis of the model so assess whether it is ready to be deployed in production

Pusher – deploys the trained, tested, and polished model to production

Here’s a simple example illustrating the Directed Acyclic Graph (DAG) of these steps using Apache Airflow.

This image has an empty alt attribute; its file name is screen-shot-2020-05-17-at-12.58.14-pm.png
(Ref. – link)

As you can see, the first step here is the CsvExampleGen, which feeds into the other steps and the steps do not loop around to the top. This way it creates a dependency graph whereby a step such as the Trainer cannot run until the SchemaGen and Transform have completed.

An important bit that needs to be highlighted is that after each component finishes, it stores any output artifacts in a metadata store which then gets picked up as input to the next component. This is how the components can execute sequentially by feeding off of each other. As you may imagine, this complex sequential runtime of a Machine Learning pipeline would need to run under an orchestration service. The orchestration service will take care of hosting the pipeline on various machine nodes or even clusters. They can delegate process memory, disk space, and processing power for various tasks and compute nodes, as well as direct the flow of the pipeline in a simple manner. Examples of such orchestrator tools one could use are Kubeflow or Apache Airflow.

TFX is a vast and complete Machine Learning pipeline service that should really be best covered in a course. This blog post really cannot do it justice besides a very cursory introduction for what it can do.

For more information, the TensorFlow Extended site has some great starting guides, examples, and tutorials to get you started!

Azure Machine Learning Pipelines

(Ref. – link)

In the same vein of Machine Learning pipelines, another powerful offering is the Microsoft Azure Machine Learning pipeline. While this is a deep topic that requires its own post, it consists of the following steps.

First, one must sign up for the Azure service and create an Azure Machine Learning workspace. Then, one needs to set up the Azure ML SDK to enable the ability to configure the pipeline. Afterward, one needs to set up a datastore for storing artifacts from the pipeline in persistent storage, and a PipelineData object to allow for data to easily flow between data steps and enable the pipeline steps to communicate with each other. The final step involves configuring the compute nodes or targets in which the pipeline will run. The rest will just consist of code to create and launch pipeline steps in regards to (data preparation, model training, model storage, model validation, model deployment, and monitoring). Are you seeing some patterns? This is somewhat similar to the TensorFlow extended example.

For specifics, check out the following two articles as they explain this topic in length.

What are ML Azure pipelines – (

Creating ML pipelines with the Azure ML SDK – (

There are also other notable Machine Learning pipeline frameworks that we should be aware of, highlighted below.

Keras using scikit-learn pipelines – (

Apache Spark pipelines – (

AWS Machine Learning pipelines using Amazon SageMaker and Apache Airflow – (

d6tflow (can use PyTorch as well) – (

Some general Python pipeline packages – (

AutoML – A Simpler Way to Leverage Machine Learning Pipelines

Google Cloud AutoML

In case you haven’t made this observation yet, the notion of creating a Machine Learning pipeline can be incredibly time consuming and complex. AutoML aims to simplify this process by skipping all the intermediary steps such as feature selection and model training/tuning and go from the initial raw data to final predictions about that data. This is great because one can essentially build a Machine Learning pipeline with very little effort and have it compute results in no time. This does come with drawbacks, however. AutoML frameworks typically only emphasize performance as the end goal (i.e. did it classify well or not?), and often, there is more to Machine Learning than performance, such as bias/fairness, as well as space and time complexity. Finally, AutoML can build some very powerful standard models, but if you have a more custom or unique problem that requires combining some esoteric Machine Learning and statistical concepts, or even need to maximize performance and accuracy, you will be better off building the Machine Learning pipeline yourself.

Some leaders in this space are the following

Google Cloud AutoML – (

Auto Sklearn – (

H20 AutoML – (

Auto Keras – (

Final Remarks

As we conclude, I want to leave you with a final note. If you want to be proficient in quickly learning and using Machine Learning pipeline tools it may be worthwhile to add Docker to your Machine Learning skillset. Moreover, you should be familiar with Object-Oriented Programming Principles (OOP), and have a good understanding in how you will organize all the different components of your Machine Learning applications (e.g. your input files, trainers, optimizers, validators, hyperparameters, etc.). David Chong wrote a good post to help you learn how to do this.

I hope this post has shed light on a more complex and progressive topic of Machine Learning that we should soon pick up on as responsible Data Scientists and Machine Learning engineers.

Cheers and happy coding!