Setting up Python
In this tutorial, we will get started from scratch to get fully set up and start doing actual tasks with Python.
Installing Python
This is a straightforward step, very similar to installing any software application. You simply need to go to the Python home page python.org
, and select the version that is appropriate for your system under “Downloads”.
Then you’ll simply go through a wizard and get it finalized.
As a result of this installation, you now have a new command on the command line, called python
. Just like you probably use the host
command, or other tools, you just obtained the python
command.
For historical reasons it is actually python3
, no need to get into the details, but we’ll use it once and when our environment is activated, we’ll continue normally with python
(without the “3”).
Creating a virtual environment
When you want to install a mobile app, you generally have contstraints, like the device and oeprating system. You might see that an app needs iPhone 14+ and iOS 16.1+ to properly work. This is the “environment” in which this software (the app) can work.
With Python, a virtual environment is essentially a folder on your machine. It contains two things:
- A Python installation, eventually you’ll probably have several of those, for different projects and/or different versions of Python.
- A set of third party packages (libraries)
Let’s now start the terminal application, and go to the command line to create our virtual environment.
On Windows:
On Linux/MacOS:
python3
: The python command-m
: module, “use the module”venv
: the name of the virtual environment moduleC:\path\to\new\virtual\environment
: where you want the environment to be created.
You can simply run this
This will create the virtual environment in the current working directory, under a new directory called venv
Activating the virtual environment
We created it, now we want to activate it. When an environment is active, it means that using the command python
will run this particular environment of Python that we just created:
On Windows
- On cmd.exe:
- On PowerShell:
Replace <venv>
with the actual path of your environment.
On Linux/MacOS:
Once it is activated, the prompt should be updated to have something like this at the beginning of the prompt:
You can also check again to see which Python you are using, by running the which
command:
In this case it shows where my environment lives (I called it venv312
so I know that this has Python version 3.12.)
Installing a few Python packages (libraries)
Just like your mobile phone, even though it comes with powerful and useful native apps, it still lacks a lot of potential without an appstore. This also applies to Python, and programming languages in general.
We now want to install a few packages that will help us in our digital marketing work especially if you do SEO/SEM.
With the environment activated, run the following:
Command breakdown:
pip
: The command we use to install packagesinstall
: The install command (there a few other, but we won’t cover them here.)jupyterlab
: The first package we want to install. This is the web app that allows us to interacively run Python commands, and get rich outputs like interactive HTML/JS charts and apps.advertools
: The main library for SEO/SEMpandas
: The main library for data processing, manipulation, sorting, reshaping, etc.plotly
: The library for interactive data visualization
You can install as many libraries as you want in one go, you just have to supply their names, separated by spaces, as we did above.
Starting jupyter lab
With the environment activated and after having it installed, simply run:
This should open jupyter lab in your browser, and you should see an inerface like this:
Note that the panel on the left is a regular file browser that you can use like any file browser, and it shows the files on your computer. You can open many formats inside Jupyter in case you want to preview them, like PDF, CSV, images, and of course, most importantly Jupyter notebooks (with the extension .ipynb
.)
Now to create a new jupyter notebook, click on the icon under “Notebook”.
You are now ready to start writing code and doing some work with Python.
Crawling a website
In order to use the libraries that we installed, we need to start them, or activate them, actually import
them.
In the first code cell you can run the following by clicking on the play button on top part of your notebook (or by running SHIFT+ENTER after clicking inside the code cell).
Now we have two new libraries activated with the aliases adv
and pd
. These are simply shortcuts to make it faster to type.
Each one of those libraries has a bunch of functions, classes, and various objects. We can start using them with the dot notation.
We access them just like we use the right click on graphical interfaces. The right click functionality basically displays a contextual menu based on the type of object that you right-clicked. If it’s an image for example, you get “Save image as”, “Copy image address” and so on. If you right-click a string of characters you get a different set of options that are particular to that type of object.
With libraries we can access the available functions and methods of that object using the dot-notation, and in Jupyter by typing a dot after the object and hitting the TAB button:
You can now select any function you want, and we’ll use the crawl
function.
This function is a full-fledged crawler, and has many options that you can later explore. For now, we will minimally run it by specifying a URL(s) to crawl and the file where we want to save the crawl results output.jsonl
.
Now you should have a new file with the name you chose, and we are going to use pandas
to read it into a table or DataFrame.
url | title | viewport | charset | h1 | body_text | size | download_timeout | download_slot | download_latency | depth | status | links_url | links_text | links_nofollow | ip_address | crawl_time | resp_headers_Content-Length | resp_headers_Age | resp_headers_Cache-Control | resp_headers_Content-Type | resp_headers_Date | resp_headers_Etag | resp_headers_Expires | resp_headers_Last-Modified | resp_headers_Server | resp_headers_Vary | resp_headers_X-Cache | request_headers_Accept | request_headers_Accept-Language | request_headers_User-Agent | request_headers_Accept-Encoding | resp_headers_Accept-Ranges | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://example.com | Example Domain | width=device-width, initial-scale=1 | utf-8 | Example Domain | Example Domain | 1256 | 180 | example.com | 0.103387 | 0 | 200 | https://www.iana.org/domains/example | More information… | False | 93.184.215.14 | 2024-12-28 01:52:49 | 648 | 344584 | max-age=604800 | text/html; charset=UTF-8 | Sat, 28 Dec 2024 01:52:49 GMT | “3147526947+gzip” | Sat, 04 Jan 2025 01:52:49 GMT | Thu, 17 Oct 2019 07:18:26 GMT | ECAcc (bsb/27D1) | Accept-Encoding | HIT | text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8 | en | advertools/0.16.3 | gzip, deflate | nan |
Now you can explore other functions and build your own workflows. Check out the documentation of the libraries you are interested in, and try playing around with the available options.
If you have gone through the whole process, you have achieved a great deal, and are at a new level where you experiment with code that you see on the internet.