At the beginning of my Python ETL journey, I created tables in SQL server and insert to those tables. Back then, I thought this is the only way. During my work using pySpark, I used pySpark to write SQL tables from pySpark dataframe. The best thing is I don’t need to create tables pySpark does all for me. (I will wrote pySpark codes later). That good experience makes me think: is there a better way in python? Is there a way to write data into SQL server without creating table?

My best friend google told me: yes. And here is how:


#imports
import pandas as pd
from sqlalchemy import create_engine
import pyodbc

#read excel file content into a pandas datafram
file_xlsx = r"C:\Users\jingh\python_presentation\file_for_ETL.xlsx"
df=pd.read_excel(file_xlsx)
df


#make connection to SQL server and write the dataframe content to a table 
myservername = 'LAPTOP-S1V0PRKL'
mydatabasename = 'pythontest'

sqlcon = create_engine('mssql+pyodbc://@' + myservername + '/' + mydatabasename + '?trusted_connection=yes&driver=ODBC+Driver+13+for+SQL+Server')

df.to_sql('ETLtest20200321', con = sqlcon, schema ='schema_test', if_exists = 'replace', index = False)

#note on my personal computer ODBC driver is 17, but my company computer is 13 
#because my personal computer has SQL Server 2019 but company is SQL Server 2016
#check your computer ODBC version and put it in the create_engine string

Yes. Simple as That.

But, there is a shortcoming for this method: slow. It takes about 100 minutes when I wrote 1 million rows. The slowness is because of row-by-row operation between python and SQL server. I have tried many times to get it done faster (many people have tried many times too). So far, I can only fast load with string datatype. In short, it is done this way: First convert the dataframe to a list of tuples. Then, use ODBC connection and execute many to fast load.



list_of_tuples=list(df.itertuples(index=False, name=None))

SQL_insert_command = """INSERT INTO [schema_name].[table_name](	[Column1]       ,
	[Column2]   ,
	[Column3]    ,
	[Column4],
	[Column5]   ,
	[Column6],
	[Column7] )
                               values (?,?,?,
                                       ?,?,?,?)"""


connStr = pyodbc.connect(r'Driver={SQL Server};Server=myservername;Database=mydatabasename;Trusted_Connection=yes;')


cursor = connStr.cursor()

cursor.fast_executemany = True
cursor.executemany(SQL_insert_command, list_of_tuples )

connStr.commit()
cursor.close()

I am still trying for broader datatypes. Of course I will share my progress here after I have a break through.

You can see from this code that a new package, pandas is used here. In pandas, a table is called dataframe. All of our knowledge about sql server tables can be applied to dataframe, just with different code.


<
Previous Post
When Python Meets SQL –part 2: loop to ETL multiple tables
>
Next Post
When Python Meets SQL — part 4: writing multiple excel files into SQL server through loop