FindingData

Working with files - pandas 🐼

21 Nov, 2020

🤔 what is pandas ?

pandas is the most powerful library for doing data manipulation and analysis out there. pandas is like excel for python, with tables [which in padas are called DataFrames , and rows and columns [which in pandas called Series

📚 Working with various files format

If you know a little bit about data, we generally see .csv file format or .txt file format around us at the beginner level or even in kaggle like platform but at the industry level we have to deal with more complex dataset and type also. data can be in any file format like Binary file format or json format etc. so dealing with these file formats becomes a little bit though.

In this post, we will see how can we read and write a different type of file formats using pandas.

File Reading and Writing with pandas

📥 Loading data

we will load all the basic and important file formats and will try to understand

Env Setup and pandas Installation

It is a best practice to create a virtual environment and install the neccessary packages inside the virtualenv.

Create virtualenv using anaconda

python>3.6

conda create -n new_env python=3.7
conda activate new_env

Install Pandas and jupyter notebook

pip install jupyter
pip install pandas

🐼 Loading data with pandas

With pandas, we can load data having diffrent file formats.We will cover following

.txt — text file
.csv — comma seperated value
.tsv — tab seperated value
.html — hyper text markup language
.json — javascript object notation
Binary data format ( pickle, hd5 )
Web APIs
Database
pdf

From txt File

path-filename.txt

import pandas as pd
df = pd.read_csv("path_to_text_file", sep="")

From csv file

python

import pandas as pd
df = pd.read_csv("path_to_csv_file") # default par -> sep=","

From Remote URL

python

import pandas as pd
df = pd.read_csv("remote_url_path")
Eg :

From tsv File

python

import pandas as pd
df = pd.read_csv("path_to_tsv_file", sep="\t")

From HTML File

python

url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
tables = pd.read_html(url)     
# List   
# Read data which is inside the tables. if more than one table then you can
# access using list operations.

From Json Data

python

obj = """
    {"name": "Wes",
     "places_lived": ["United States", "Spain", "Germany"],
     "pet": null,
     "siblings": [{"name": "Scott", "age": 30,
                   "pets": ["Zeus", "Zuko"]},
                  {"name": "Katie", "age": 38,
                   "pets": ["Sixes", "Stache", "Cisco"]}]
    }
    """
import json
import pandas as pd
loaded = json.loads(obj)    # dict
json_data = json.dumps(loaded)  # str
siblings = pd.DataFrame(loaded['siblings'], columns=['name', 'age', 'pets';])

Binary Data Format (pickle)

https://github.com/IMsumitkumar/Internshala-jobs-scrapping/blob/main/scrapped-dataset/DELHI.csv

import pandas as pd
pd.read_pickle('path_to_pkl_filr')

From Web API

python

import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
# if resp -> <Response [200]> -> Accessable! Good to go
data = resp.json()
data[2]

let say i want to grab user data in 2nd issue

python

data = pd.DataFrame.from_dict(data)
df = data['user'].apply(pd.Series)
type(df)
< pandas.core.frame.DataFrame >

From DataBase

In order to read from Database, read the data from DB into a python list and use DataFrame() to create one

python

import pandas as pd
db = # Create DB connection object 
cur = db.cursor()
cur.execute("SELECT * FROM <TABLE>")
df = pd.DataFrame(cur.fetchall())

from pdf

python

import tabula
# Read PDF into list of DataFrame
dataframe = tabula.read_pdf("input.pdf", pages='all')
# Convert PDF into CSV file
tabula.convert_into("input.pdf", "output.csv", output_format="csv", pages='all')
# Read remote PDF into list of DataFrame
dataframe_2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

stackoverflow Source for data reading from pdf

Edit this page on GitHub

FindingData

Working with files - pandas 🐼

Working with files - pandas 🐼

🤔 what is pandas ?

📚 Working with various files format

File Reading and Writing with pandas

📥 Loading data

Env Setup and pandas Installation

🐼 Loading data with pandas

From txt File

From csv file

From Remote URL

From tsv File

From HTML File

From Json Data

Binary Data Format (pickle)

From Web API

From DataBase

from pdf

Recent Posts

On this page