FindingData
Working with files - pandas 🐼
Working with files - pandas 🐼
21 Nov, 2020
🤔 what is pandas ?

pandas is the most powerful library for doing data manipulation and analysis out there.
pandas is like excel for python, with tables [which in padas are called DataFrames ,
and rows and columns [which in pandas called Series
📚 Working with various files format
If you know a little bit about data, we generally see .csv file format or .txt file format around us at the beginner level or even in kaggle like platform but at the industry level we have to deal with more complex dataset and type also. data can be in any file format like Binary file format or json format etc. so dealing with these file formats becomes a little bit though.
In this post, we will see how can we read and write a different type of file formats using pandas.
File Reading and Writing with pandas



📥 Loading data
we will load all the basic and important file formats and will try to understand
Env Setup and pandas Installation
It is a best practice to create a virtual environment and install the neccessary packages inside the virtualenv.
Create virtualenv using anaconda
conda create -n new_env python=3.7conda activate new_envInstall Pandas and jupyter notebook
pip install jupyterpip install pandas🐼 Loading data with pandas
With pandas, we can load data having diffrent file formats.We will cover following
- .txt — text file
- .csv — comma seperated value
- .tsv — tab seperated value
- .html — hyper text markup language
- .json — javascript object notation
- Binary data format ( pickle, hd5 )
- Web APIs
- Database
From txt File
import pandas as pddf = pd.read_csv("path_to_text_file", sep="")From csv file
import pandas as pddf = pd.read_csv("path_to_csv_file") # default par -> sep=","From Remote URL
import pandas as pddf = pd.read_csv("remote_url_path")Eg :From tsv File
import pandas as pddf = pd.read_csv("path_to_tsv_file", sep="\t")From HTML File
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"tables = pd.read_html(url) # List # Read data which is inside the tables. if more than one table then you can# access using list operations.From Json Data
obj = """ {"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}] } """import jsonimport pandas as pdloaded = json.loads(obj) # dictjson_data = json.dumps(loaded) # strsiblings = pd.DataFrame(loaded['siblings'], columns=['name', 'age', 'pets';])
Binary Data Format (pickle)
import pandas as pdpd.read_pickle('path_to_pkl_filr')From Web API
import requestsurl = 'https://api.github.com/repos/pandas-dev/pandas/issues'resp = requests.get(url)resp# if resp -> <Response [200]> -> Accessable! Good to godata = resp.json()data[2]
let say i want to grab user data in 2nd issue
data = pd.DataFrame.from_dict(data)df = data['user'].apply(pd.Series)type(df)< pandas.core.frame.DataFrame >From DataBase
In order to read from Database, read the data from DB into a python
list and use DataFrame() to create one
import pandas as pddb = # Create DB connection object cur = db.cursor()cur.execute("SELECT * FROM <TABLE>")df = pd.DataFrame(cur.fetchall())from pdf
import tabula# Read PDF into list of DataFramedataframe = tabula.read_pdf("input.pdf", pages='all')# Convert PDF into CSV filetabula.convert_into("input.pdf", "output.csv", output_format="csv", pages='all')# Read remote PDF into list of DataFramedataframe_2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")stackoverflow Source for data reading from pdf
Edit this page on GitHub